Architectures

Hardware-Aware LLM Design — Bridging Theory and Silicon

Model architecture and hardware are inseparable. Understanding GPU memory hierarchy, tensor cores, and kernel optimization enables designs that are 10-100× faster in practice, even with equivalent theoretical complexity.

Memory Hierarchy — Registers, shared memory, L1/L2 cache, HBM
Tensor Cores — Matrix multiplication units optimized for specific data types
Kernel Fusion — Combining operations to minimize memory transfers
Architecture Design — Shaping models to match hardware capabilities

The fastest algorithm is the one that matches the hardware you have.

Hardware-Aware LLM Design

The theoretical complexity of an algorithm tells only part of the story. Real-world performance depends on how well the computation maps to the underlying hardware. For LLMs, this means understanding GPU architecture and designing models that exploit it.

DfHardware-Aware Design

Hardware-aware design is the practice of co-designing model architecture and computation to match the characteristics of target hardware, including memory hierarchy, compute units, and interconnect topology.

GPU Memory Hierarchy

The Memory Pyramid

Architecture Diagram

┌─────────────────────────────┐
│         Registers           │  ~256 KB, 0.1 ns
│        (per SM)             │
├─────────────────────────────┤
│       Shared Memory         │  ~20 MB, 0.3 ns
│     (per SM, on-chip)       │
├─────────────────────────────┤
│        L1 Cache             │  ~20 MB, 0.5 ns
│      (per SM, on-chip)      │
├─────────────────────────────┤
│        L2 Cache             │  ~50 MB, 5 ns
│       (global, on-chip)     │
├─────────────────────────────┤
│       HBM (GPU RAM)         │  80 GB, 20 ns
│     (off-chip, DRAM)        │
├─────────────────────────────┤
│       CPU RAM               │  512 GB, 100 ns
│       (off-chip)            │
└─────────────────────────────┘

The key insight is that each level of the memory hierarchy is 10-100× larger but 10-100× slower than the level above it. The most efficient algorithms minimize data movement between levels.

Memory Bandwidth

B_{\\text{eff}} = \\frac{\\text{Bytes transferred}}{\\text{Time}} = \\frac{2 \\cdot n \\cdot d \\cdot \\text{bytes}}{\\text{time}}

Here,

$B_{\text{eff}}$ =Effective bandwidth utilization
$n$ =Sequence length or batch size
$d$ =Model dimension
$bytes$ =Bytes per element (2 for FP16, 1 for INT8)

Bandwidth Bottleneck

A 70B parameter model in FP16 requires 140 GB just to load weights. On an A100 with 2 TB/s HBM bandwidth, loading weights takes 70 ms. For autoregressive generation with batch size 1, this dominates compute time, making the model memory-bandwidth bound.

Arithmetic Intensity

AI = \\frac{\\text{FLOPs}}{\\text{Bytes}} = \\frac{O(n^2 d)}{O(nd)} = O(n)

Here,

$AI$ =Arithmetic intensity (FLOPs per byte)
$n$ =Matrix dimension
$d$ =Inner dimension

The roofline model tells us that operations with arithmetic intensity below the hardware's operational intensity ceiling are memory-bound. For matrix multiplication, larger matrices are more compute-bound.

Tensor Cores and Matrix Operations

Tensor Core Architecture

<svg viewBox="0 0 400 300" className="w-full h-auto">
  <!-- GPU SM -->
  <rect x="20" y="20" width="360" height="260" rx="10" fill="#1e1e2e" stroke="#89b4fa"/>
  <text x="200" y="45" textAnchor="middle" fill="#cdd6f4" fontSize="14" fontWeight="bold">GPU Streaming Multiprocessor (SM)</text>
  
  <!-- Tensor Cores -->
  <rect x="40" y="60" width="150" height="100" rx="8" fill="#a6e3a1" opacity="0.3" stroke="#a6e3a1"/>
  <text x="115" y="85" textAnchor="middle" fill="#a6e3a1" fontSize="12" fontWeight="bold">Tensor Cores</text>
  <text x="115" y="105" textAnchor="middle" fill="#a6e3a1" fontSize="10">FP16/BF16/INT8</text>
  <text x="115" y="120" textAnchor="middle" fill="#a6e3a1" fontSize="10">Matrix Multiply-Accumulate</text>
  <text x="115" y="135" textAnchor="middle" fill="#a6e3a1" fontSize="10">312 TFLOPS (A100)</text>
  
  <!-- CUDA Cores -->
  <rect x="210" y="60" width="150" height="100" rx="8" fill="#f9e2af" opacity="0.3" stroke="#f9e2af"/>
  <text x="285" y="85" textAnchor="middle" fill="#f9e2af" fontSize="12" fontWeight="bold">CUDA Cores</text>
  <text x="285" y="105" textAnchor="middle" fill="#f9e2af" fontSize="10">FP32/FP64 Operations</text>
  <text x="285" y="120" textAnchor="middle" fill="#f9e2af" fontSize="10">Scalar Operations</text>
  <text x="285" y="135" textAnchor="middle" fill="#f9e2af" fontSize="10">19.5 TFLOPS (A100)</text>
  
  <!-- Shared Memory -->
  <rect x="40" y="180" width="320" height="60" rx="8" fill="#89b4fa" opacity="0.3" stroke="#89b4fa"/>
  <text x="200" y="205" textAnchor="middle" fill="#89b4fa" fontSize="12" fontWeight="bold">Shared Memory / L1 Cache</text>
  <text x="200" y="225" textAnchor="middle" fill="#89b4fa" fontSize="10">~192 KB per SM • 19 TB/s bandwidth</text>
  
  <!-- Legend -->
  <text x="40" y="270" fill="#6c7086" fontSize="10">Tensor cores are 16× faster than CUDA cores for matrix operations</text>
</svg>

Tensor Core Operations

Tensor Core FMA

D = A \\times B + C \\quad \\text{(where A, B, C, D are 16×16 matrices)}

Here,

$A$ =Input matrix (FP16/BF16)
$B$ =Weight matrix (FP16/BF16)
$C$ =Accumulator (FP32)
$D$ =Output matrix (FP32)

Tensor cores perform matrix multiply-accumulate (MMA) operations on 16×16 tiles. A single A100 SM has 4 tensor cores, each performing 64 FP16 FMAs per cycle. At 1.4 GHz, this gives 312 TFLOPS—16× more than CUDA cores.

Data Type Considerations

Data Type	Size	Tensor Core Support	Throughput (A100)
FP32	4 bytes	No	19.5 TFLOPS
TF32	4 bytes	Yes	156 TFLOPS
FP16	2 bytes	Yes	312 TFLOPS
BF16	2 bytes	Yes	312 TFLOPS
INT8	1 byte	Yes	624 TOPS
INT4	0.5 bytes	Yes	1,248 TOPS

TF32 (TensorFloat-32) provides the range of FP32 with the precision of FP16. It's the default for matrix operations on Ampere GPUs, giving 8× speedup over FP32 with minimal quality loss.

Kernel Fusion

The Memory Transfer Problem

DfKernel Fusion

Kernel fusion combines multiple GPU operations (kernels) into a single kernel, reducing memory transfers between operations. This is critical for LLM inference where memory bandwidth is the bottleneck.

# WITHOUT fusion: 3 memory round-trips
def unfused_attention(q, k, v):
    scores = torch.matmul(q, k.transpose(-2, -1))  # Load q, k; store scores
    weights = F.softmax(scores, dim=-1)              # Load scores; store weights
    output = torch.matmul(weights, v)                 # Load weights, v; store output
    return output

# WITH fusion: 1 memory round-trip (flash attention)
def fused_attention(q, k, v, block_size=256):
    """Fused attention using tiling."""
    B, H, L, D = q.shape
    output = torch.zeros_like(q)
    lse = torch.full((B, H, L), float('-inf'), device=q.device)
    
    # Process in blocks
    for i in range(0, L, block_size):
        q_block = q[:, :, i:i+block_size]
        
        # Load k, v once for this q block
        for j in range(0, L, block_size):
            k_block = k[:, :, j:j+block_size]
            v_block = v[:, :, j:j+block_size]
            
            # Compute block attention (all in registers/shared memory)
            scores = torch.matmul(q_block, k_block.transpose(-2, -1))
            
            # Online softmax update
            block_max = scores.max(dim=-1, keepdim=True).values
            scores = scores - block_max
            
            # Update output and normalization
            exp_scores = torch.exp(scores)
            output[:, :, i:i+block_size] += torch.matmul(exp_scores, v_block)
            lse[:, :, i:i+block_size] = torch.logaddexp(
                lse[:, :, i:i+block_size], 
                block_max.squeeze(-1)
            )
    
    # Final normalization
    output = output / lse.unsqueeze(-1).exp().unsqueeze(-1)
    return output

Flash Attention

Flash Attention IO Complexity

O\\left(\\frac{n^2 d^2}{M}\\right)

Here,

$n$ =Sequence length
$d$ =Head dimension
$M$ =SRAM size

Flash Attention (Dao et al., 2022) tiles the attention computation to fit blocks in SRAM, reducing HBM accesses from O(n²) to O(n²d²/M). This gives 2-4× speedup and 5-20× memory savings compared to standard attention.

Architecture Design for Hardware

Optimal Hidden Dimensions

Hardware-Optimal Dimension

d_{\\text{opt}} = k \\cdot 64 \\quad \\text{(multiple of 64 for tensor cores)}

Here,

$d_{\text{opt}}$ =Optimal hidden dimension
$k$ =Integer multiplier

Modern LLMs use dimensions like 4096, 5120, 6144, 8192—all multiples of 64 or 128. This ensures tensor cores operate at full utilization. Non-aligned dimensions waste compute cycles padding to valid sizes.

Layer Normalization Placement

class PreNormBlock(nn.Module):
    """Pre-norm transformer block (hardware efficient)."""
    
    def __init__(self, d_model, n_heads, d_ffn):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ffn)
    
    def forward(self, x):
        # Pre-norm: more stable training, better hardware utilization
        x = x + self.attn(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

Pre-norm (applying LayerNorm before attention/FFN) enables better GPU utilization by allowing the residual connection to be computed in-place, reducing memory bandwidth requirements by 30-40%.

Activation Function Selection

Activation	Compute	Memory	Hardware Efficiency
ReLU	Minimal	Low	Excellent
GELU	Moderate	Low	Good
SiLU/Swish	Moderate	Low	Good
GeGLU	Higher	Higher	Moderate

GeGLU (used in LLaMA, PaLM) provides better quality but requires 2× the FFN computation. The quality improvement justifies the cost in large models, but for small models, standard SiLU is more efficient.

Quantization and Hardware

Hardware-Specific Quantization

class HardwareAwareQuantization:
    """Quantization strategies for different hardware."""
    
    @staticmethod
    def get_optimal_quantization(hardware_type):
        if hardware_type == "A100":
            return {
                "weight": "INT8",      # Tensor core INT8 support
                "activation": "FP16",  # Keep activations in FP16
                "kv_cache": "INT8",    # Save memory on KV cache
            }
        elif hardware_type == "RTX_4090":
            return {
                "weight": "INT4",      # Maximize memory savings
                "activation": "FP16",  # FP16 activations
                "kv_cache": "FP16",    # Limited INT4 support
            }
        elif hardware_type == "CPU":
            return {
                "weight": "INT4",      # GGUF format
                "activation": "FP32",  # CPU prefers FP32
                "kv_cache": "FP32",    # No special support
            }

Tensor Core Utilization

U_{\\text{TC}} = \\frac{\\text{Actual TFLOPS}}{\\text{Peak TFLOPS}}

Here,

$U_{\text{TC}}$ =Tensor core utilization (0-1)
$Peak TFLOPS$ =Theoretical maximum

Achieving high tensor core utilization requires: (1) matrix dimensions aligned to 16, (2) data in supported formats (FP16/BF16/INT8), (3) sufficient parallelism to hide memory latency, (4) kernel fusion to reduce memory transfers.

Memory Optimization Techniques

Weight Sharing

DfWeight Sharing

Weight sharing reduces memory footprint by reusing weights across layers or attention heads. This is a form of parameter-efficient design that matches hardware memory constraints.

Gradient Checkpointing

class CheckpointedTransformer(nn.Module):
    """Transformer with gradient checkpointing."""
    
    def __init__(self, d_model, n_layers, n_heads):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads) 
            for _ in range(n_layers)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            # Checkpoint: recompute forward during backward
            x = torch.utils.checkpoint.checkpoint(
                layer, x, use_reentrant=False
            )
        return x

Gradient checkpointing trades compute for memory. For a 70B model, it reduces memory from 140 GB (full activations) to 35 GB (with recomputation), enabling training on a single A100.

Benchmarking Hardware Efficiency

Measuring Real Performance

def benchmark_model(model, input_ids, n_warmup=10, n_iter=100):
    """Benchmark model inference performance."""
    import time
    
    # Warmup
    for _ in range(n_warmup):
        with torch.no_grad():
            model(input_ids)
    
    # Benchmark
    torch.cuda.synchronize()
    start = time.time()
    
    for _ in range(n_iter):
        with torch.no_grad():
            model(input_ids)
    
    torch.cuda.synchronize()
    end = time.time()
    
    # Calculate metrics
    tokens_per_second = input_ids.shape[1] * n_iter / (end - start)
    memory_used = torch.cuda.max_memory_allocated() / 1024**3
    
    return {
        "tokens_per_second": tokens_per_second,
        "memory_gb": memory_used,
        "latency_ms": (end - start) / n_iter * 1000
    }

Practice Exercises

Conceptual: Explain why autoregressive generation is memory-bandwidth bound while prefill is compute-bound. How does this affect optimization strategies?
Mathematical: Calculate the arithmetic intensity of matrix multiplication for two 4096×4096 matrices in FP16. Is this operation compute-bound or memory-bound on an A100?
Practical: Benchmark the same transformer model with different hidden dimensions (2048, 4096, 8192) and measure how tensor core alignment affects throughput.
Research: Investigate how mixed-precision training (FP16/BF16) affects both training speed and final model quality. What is the optimal precision strategy?

Key Takeaways:

GPU memory hierarchy has 100× bandwidth differences between levels
Tensor cores provide 16× speedup over CUDA cores for matrix operations
Kernel fusion reduces memory transfers and improves throughput
Model dimensions should align to 64 or 128 for tensor core efficiency
Pre-norm architecture reduces memory bandwidth requirements
Gradient checkpointing trades compute for memory (3-4× reduction)
Autoregressive generation is memory-bandwidth bound; prefill is compute-bound

What to Learn Next

-> Flash Attention and Memory Efficiency IO-aware attention optimization for modern GPUs.

-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and hardware-specific quantization.

-> Model Parallelism and Tensor Parallelism Distributing models across multiple GPUs.

-> KV Cache Optimization Optimizing transformer inference memory.

-> LLM Inference Optimization Speeding up model inference for production.

-> Distributed Training for LLMs Training large models across multiple GPUs.

Hardware-Aware LLM Design

Hardware-Aware LLM Design — Bridging Theory and Silicon

Hardware-Aware LLM Design

DfHardware-Aware Design

GPU Memory Hierarchy

The Memory Pyramid

Memory Bandwidth

Memory Bandwidth

Bandwidth Bottleneck

Arithmetic Intensity

Arithmetic Intensity

Tensor Cores and Matrix Operations

Tensor Core Architecture

Tensor Core Operations

Tensor Core FMA

Data Type Considerations

Kernel Fusion

The Memory Transfer Problem

DfKernel Fusion

Flash Attention

Flash Attention IO Complexity

Architecture Design for Hardware

Optimal Hidden Dimensions

Hardware-Optimal Dimension

Layer Normalization Placement

Activation Function Selection

Quantization and Hardware

Hardware-Specific Quantization

Tensor Core Utilization

Tensor Core Utilization

Memory Optimization Techniques

Weight Sharing

DfWeight Sharing

Gradient Checkpointing

Benchmarking Hardware Efficiency

Measuring Real Performance

Practice Exercises

What to Learn Next

Need Expert LLM Help?