Architectures
Hardware-Aware LLM Design β Bridging Theory and Silicon
Model architecture and hardware are inseparable. Understanding GPU memory hierarchy, tensor cores, and kernel optimization enables designs that are 10-100Γ faster in practice, even with equivalent theoretical complexity.
- Memory Hierarchy β Registers, shared memory, L1/L2 cache, HBM
- Tensor Cores β Matrix multiplication units optimized for specific data types
- Kernel Fusion β Combining operations to minimize memory transfers
- Architecture Design β Shaping models to match hardware capabilities
The fastest algorithm is the one that matches the hardware you have.
Hardware-Aware LLM Design
The theoretical complexity of an algorithm tells only part of the story. Real-world performance depends on how well the computation maps to the underlying hardware. For LLMs, this means understanding GPU architecture and designing models that exploit it.
DfHardware-Aware Design
Hardware-aware design is the practice of co-designing model architecture and computation to match the characteristics of target hardware, including memory hierarchy, compute units, and interconnect topology.
GPU Memory Hierarchy
The Memory Pyramid
βββββββββββββββββββββββββββββββ
β Registers β ~256 KB, 0.1 ns
β (per SM) β
βββββββββββββββββββββββββββββββ€
β Shared Memory β ~20 MB, 0.3 ns
β (per SM, on-chip) β
βββββββββββββββββββββββββββββββ€
β L1 Cache β ~20 MB, 0.5 ns
β (per SM, on-chip) β
βββββββββββββββββββββββββββββββ€
β L2 Cache β ~50 MB, 5 ns
β (global, on-chip) β
βββββββββββββββββββββββββββββββ€
β HBM (GPU RAM) β 80 GB, 20 ns
β (off-chip, DRAM) β
βββββββββββββββββββββββββββββββ€
β CPU RAM β 512 GB, 100 ns
β (off-chip) β
βββββββββββββββββββββββββββββββ
The key insight is that each level of the memory hierarchy is 10-100Γ larger but 10-100Γ slower than the level above it. The most efficient algorithms minimize data movement between levels.
Memory Bandwidth
Memory Bandwidth
Here,
- =Effective bandwidth utilization
- =Sequence length or batch size
- =Model dimension
- =Bytes per element (2 for FP16, 1 for INT8)
Bandwidth Bottleneck
A 70B parameter model in FP16 requires 140 GB just to load weights. On an A100 with 2 TB/s HBM bandwidth, loading weights takes 70 ms. For autoregressive generation with batch size 1, this dominates compute time, making the model memory-bandwidth bound.
Arithmetic Intensity
Arithmetic Intensity
Here,
- =Arithmetic intensity (FLOPs per byte)
- =Matrix dimension
- =Inner dimension
The roofline model tells us that operations with arithmetic intensity below the hardware's operational intensity ceiling are memory-bound. For matrix multiplication, larger matrices are more compute-bound.
Tensor Cores and Matrix Operations
Tensor Core Architecture
<svg viewBox="0 0 400 300" className="w-full h-auto">
<!-- GPU SM -->
<rect x="20" y="20" width="360" height="260" rx="10" fill="#1e1e2e" stroke="#89b4fa"/>
<text x="200" y="45" textAnchor="middle" fill="#cdd6f4" fontSize="14" fontWeight="bold">GPU Streaming Multiprocessor (SM)</text>
<!-- Tensor Cores -->
<rect x="40" y="60" width="150" height="100" rx="8" fill="#a6e3a1" opacity="0.3" stroke="#a6e3a1"/>
<text x="115" y="85" textAnchor="middle" fill="#a6e3a1" fontSize="12" fontWeight="bold">Tensor Cores</text>
<text x="115" y="105" textAnchor="middle" fill="#a6e3a1" fontSize="10">FP16/BF16/INT8</text>
<text x="115" y="120" textAnchor="middle" fill="#a6e3a1" fontSize="10">Matrix Multiply-Accumulate</text>
<text x="115" y="135" textAnchor="middle" fill="#a6e3a1" fontSize="10">312 TFLOPS (A100)</text>
<!-- CUDA Cores -->
<rect x="210" y="60" width="150" height="100" rx="8" fill="#f9e2af" opacity="0.3" stroke="#f9e2af"/>
<text x="285" y="85" textAnchor="middle" fill="#f9e2af" fontSize="12" fontWeight="bold">CUDA Cores</text>
<text x="285" y="105" textAnchor="middle" fill="#f9e2af" fontSize="10">FP32/FP64 Operations</text>
<text x="285" y="120" textAnchor="middle" fill="#f9e2af" fontSize="10">Scalar Operations</text>
<text x="285" y="135" textAnchor="middle" fill="#f9e2af" fontSize="10">19.5 TFLOPS (A100)</text>
<!-- Shared Memory -->
<rect x="40" y="180" width="320" height="60" rx="8" fill="#89b4fa" opacity="0.3" stroke="#89b4fa"/>
<text x="200" y="205" textAnchor="middle" fill="#89b4fa" fontSize="12" fontWeight="bold">Shared Memory / L1 Cache</text>
<text x="200" y="225" textAnchor="middle" fill="#89b4fa" fontSize="10">~192 KB per SM β’ 19 TB/s bandwidth</text>
<!-- Legend -->
<text x="40" y="270" fill="#6c7086" fontSize="10">Tensor cores are 16Γ faster than CUDA cores for matrix operations</text>
</svg>
Tensor Core Operations
Tensor Core FMA
Here,
- =Input matrix (FP16/BF16)
- =Weight matrix (FP16/BF16)
- =Accumulator (FP32)
- =Output matrix (FP32)
Tensor cores perform matrix multiply-accumulate (MMA) operations on 16Γ16 tiles. A single A100 SM has 4 tensor cores, each performing 64 FP16 FMAs per cycle. At 1.4 GHz, this gives 312 TFLOPSβ16Γ more than CUDA cores.
Data Type Considerations
| Data Type | Size | Tensor Core Support | Throughput (A100) |
|---|---|---|---|
| FP32 | 4 bytes | No | 19.5 TFLOPS |
| TF32 | 4 bytes | Yes | 156 TFLOPS |
| FP16 | 2 bytes | Yes | 312 TFLOPS |
| BF16 | 2 bytes | Yes | 312 TFLOPS |
| INT8 | 1 byte | Yes | 624 TOPS |
| INT4 | 0.5 bytes | Yes | 1,248 TOPS |
TF32 (TensorFloat-32) provides the range of FP32 with the precision of FP16. It's the default for matrix operations on Ampere GPUs, giving 8Γ speedup over FP32 with minimal quality loss.
Kernel Fusion
The Memory Transfer Problem
DfKernel Fusion
Kernel fusion combines multiple GPU operations (kernels) into a single kernel, reducing memory transfers between operations. This is critical for LLM inference where memory bandwidth is the bottleneck.
# WITHOUT fusion: 3 memory round-trips
def unfused_attention(q, k, v):
scores = torch.matmul(q, k.transpose(-2, -1)) # Load q, k; store scores
weights = F.softmax(scores, dim=-1) # Load scores; store weights
output = torch.matmul(weights, v) # Load weights, v; store output
return output
# WITH fusion: 1 memory round-trip (flash attention)
def fused_attention(q, k, v, block_size=256):
"""Fused attention using tiling."""
B, H, L, D = q.shape
output = torch.zeros_like(q)
lse = torch.full((B, H, L), float('-inf'), device=q.device)
# Process in blocks
for i in range(0, L, block_size):
q_block = q[:, :, i:i+block_size]
# Load k, v once for this q block
for j in range(0, L, block_size):
k_block = k[:, :, j:j+block_size]
v_block = v[:, :, j:j+block_size]
# Compute block attention (all in registers/shared memory)
scores = torch.matmul(q_block, k_block.transpose(-2, -1))
# Online softmax update
block_max = scores.max(dim=-1, keepdim=True).values
scores = scores - block_max
# Update output and normalization
exp_scores = torch.exp(scores)
output[:, :, i:i+block_size] += torch.matmul(exp_scores, v_block)
lse[:, :, i:i+block_size] = torch.logaddexp(
lse[:, :, i:i+block_size],
block_max.squeeze(-1)
)
# Final normalization
output = output / lse.unsqueeze(-1).exp().unsqueeze(-1)
return output
Flash Attention
Flash Attention IO Complexity
Here,
- =Sequence length
- =Head dimension
- =SRAM size
Flash Attention (Dao et al., 2022) tiles the attention computation to fit blocks in SRAM, reducing HBM accesses from O(nΒ²) to O(nΒ²dΒ²/M). This gives 2-4Γ speedup and 5-20Γ memory savings compared to standard attention.
Architecture Design for Hardware
Optimal Hidden Dimensions
Hardware-Optimal Dimension
Here,
- =Optimal hidden dimension
- =Integer multiplier
Modern LLMs use dimensions like 4096, 5120, 6144, 8192βall multiples of 64 or 128. This ensures tensor cores operate at full utilization. Non-aligned dimensions waste compute cycles padding to valid sizes.
Layer Normalization Placement
class PreNormBlock(nn.Module):
"""Pre-norm transformer block (hardware efficient)."""
def __init__(self, d_model, n_heads, d_ffn):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.attn = MultiHeadAttention(d_model, n_heads)
self.norm2 = nn.LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ffn)
def forward(self, x):
# Pre-norm: more stable training, better hardware utilization
x = x + self.attn(self.norm1(x))
x = x + self.ffn(self.norm2(x))
return x
Pre-norm (applying LayerNorm before attention/FFN) enables better GPU utilization by allowing the residual connection to be computed in-place, reducing memory bandwidth requirements by 30-40%.
Activation Function Selection
| Activation | Compute | Memory | Hardware Efficiency |
|---|---|---|---|
| ReLU | Minimal | Low | Excellent |
| GELU | Moderate | Low | Good |
| SiLU/Swish | Moderate | Low | Good |
| GeGLU | Higher | Higher | Moderate |
GeGLU (used in LLaMA, PaLM) provides better quality but requires 2Γ the FFN computation. The quality improvement justifies the cost in large models, but for small models, standard SiLU is more efficient.
Quantization and Hardware
Hardware-Specific Quantization
class HardwareAwareQuantization:
"""Quantization strategies for different hardware."""
@staticmethod
def get_optimal_quantization(hardware_type):
if hardware_type == "A100":
return {
"weight": "INT8", # Tensor core INT8 support
"activation": "FP16", # Keep activations in FP16
"kv_cache": "INT8", # Save memory on KV cache
}
elif hardware_type == "RTX_4090":
return {
"weight": "INT4", # Maximize memory savings
"activation": "FP16", # FP16 activations
"kv_cache": "FP16", # Limited INT4 support
}
elif hardware_type == "CPU":
return {
"weight": "INT4", # GGUF format
"activation": "FP32", # CPU prefers FP32
"kv_cache": "FP32", # No special support
}
Tensor Core Utilization
Tensor Core Utilization
Here,
- =Tensor core utilization (0-1)
- =Theoretical maximum
Achieving high tensor core utilization requires: (1) matrix dimensions aligned to 16, (2) data in supported formats (FP16/BF16/INT8), (3) sufficient parallelism to hide memory latency, (4) kernel fusion to reduce memory transfers.
Memory Optimization Techniques
Weight Sharing
DfWeight Sharing
Weight sharing reduces memory footprint by reusing weights across layers or attention heads. This is a form of parameter-efficient design that matches hardware memory constraints.
Gradient Checkpointing
class CheckpointedTransformer(nn.Module):
"""Transformer with gradient checkpointing."""
def __init__(self, d_model, n_layers, n_heads):
super().__init__()
self.layers = nn.ModuleList([
TransformerBlock(d_model, n_heads)
for _ in range(n_layers)
])
def forward(self, x):
for layer in self.layers:
# Checkpoint: recompute forward during backward
x = torch.utils.checkpoint.checkpoint(
layer, x, use_reentrant=False
)
return x
Gradient checkpointing trades compute for memory. For a 70B model, it reduces memory from 140 GB (full activations) to 35 GB (with recomputation), enabling training on a single A100.
Benchmarking Hardware Efficiency
Measuring Real Performance
def benchmark_model(model, input_ids, n_warmup=10, n_iter=100):
"""Benchmark model inference performance."""
import time
# Warmup
for _ in range(n_warmup):
with torch.no_grad():
model(input_ids)
# Benchmark
torch.cuda.synchronize()
start = time.time()
for _ in range(n_iter):
with torch.no_grad():
model(input_ids)
torch.cuda.synchronize()
end = time.time()
# Calculate metrics
tokens_per_second = input_ids.shape[1] * n_iter / (end - start)
memory_used = torch.cuda.max_memory_allocated() / 1024**3
return {
"tokens_per_second": tokens_per_second,
"memory_gb": memory_used,
"latency_ms": (end - start) / n_iter * 1000
}
Practice Exercises
-
Conceptual: Explain why autoregressive generation is memory-bandwidth bound while prefill is compute-bound. How does this affect optimization strategies?
-
Mathematical: Calculate the arithmetic intensity of matrix multiplication for two 4096Γ4096 matrices in FP16. Is this operation compute-bound or memory-bound on an A100?
-
Practical: Benchmark the same transformer model with different hidden dimensions (2048, 4096, 8192) and measure how tensor core alignment affects throughput.
-
Research: Investigate how mixed-precision training (FP16/BF16) affects both training speed and final model quality. What is the optimal precision strategy?
Key Takeaways:
- GPU memory hierarchy has 100Γ bandwidth differences between levels
- Tensor cores provide 16Γ speedup over CUDA cores for matrix operations
- Kernel fusion reduces memory transfers and improves throughput
- Model dimensions should align to 64 or 128 for tensor core efficiency
- Pre-norm architecture reduces memory bandwidth requirements
- Gradient checkpointing trades compute for memory (3-4Γ reduction)
- Autoregressive generation is memory-bandwidth bound; prefill is compute-bound
What to Learn Next
-> Flash Attention and Memory Efficiency IO-aware attention optimization for modern GPUs.
-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and hardware-specific quantization.
-> Model Parallelism and Tensor Parallelism Distributing models across multiple GPUs.
-> KV Cache Optimization Optimizing transformer inference memory.
-> LLM Inference Optimization Speeding up model inference for production.
-> Distributed Training for LLMs Training large models across multiple GPUs.