CW

Hardware-Aware LLM Design

ArchitecturesHardware OptimizationFree Lesson

Advertisement

Architectures

Hardware-Aware LLM Design β€” Bridging Theory and Silicon

Model architecture and hardware are inseparable. Understanding GPU memory hierarchy, tensor cores, and kernel optimization enables designs that are 10-100Γ— faster in practice, even with equivalent theoretical complexity.

  • Memory Hierarchy β€” Registers, shared memory, L1/L2 cache, HBM
  • Tensor Cores β€” Matrix multiplication units optimized for specific data types
  • Kernel Fusion β€” Combining operations to minimize memory transfers
  • Architecture Design β€” Shaping models to match hardware capabilities

The fastest algorithm is the one that matches the hardware you have.

Hardware-Aware LLM Design

The theoretical complexity of an algorithm tells only part of the story. Real-world performance depends on how well the computation maps to the underlying hardware. For LLMs, this means understanding GPU architecture and designing models that exploit it.

DfHardware-Aware Design

Hardware-aware design is the practice of co-designing model architecture and computation to match the characteristics of target hardware, including memory hierarchy, compute units, and interconnect topology.

GPU Memory Hierarchy

The Memory Pyramid

Architecture Diagram
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Registers           β”‚  ~256 KB, 0.1 ns
β”‚        (per SM)             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       Shared Memory         β”‚  ~20 MB, 0.3 ns
β”‚     (per SM, on-chip)       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        L1 Cache             β”‚  ~20 MB, 0.5 ns
β”‚      (per SM, on-chip)      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚        L2 Cache             β”‚  ~50 MB, 5 ns
β”‚       (global, on-chip)     β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       HBM (GPU RAM)         β”‚  80 GB, 20 ns
β”‚     (off-chip, DRAM)        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚       CPU RAM               β”‚  512 GB, 100 ns
β”‚       (off-chip)            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The key insight is that each level of the memory hierarchy is 10-100Γ— larger but 10-100Γ— slower than the level above it. The most efficient algorithms minimize data movement between levels.

Memory Bandwidth

Memory Bandwidth

Btexteff=fractextBytestransferredtextTime=frac2cdotncdotdcdottextbytestexttimeB_{\\text{eff}} = \\frac{\\text{Bytes transferred}}{\\text{Time}} = \\frac{2 \\cdot n \\cdot d \\cdot \\text{bytes}}{\\text{time}}

Here,

  • BeffB_{\text{eff}}=Effective bandwidth utilization
  • nn=Sequence length or batch size
  • dd=Model dimension
  • bytesbytes=Bytes per element (2 for FP16, 1 for INT8)

Bandwidth Bottleneck

A 70B parameter model in FP16 requires 140 GB just to load weights. On an A100 with 2 TB/s HBM bandwidth, loading weights takes 70 ms. For autoregressive generation with batch size 1, this dominates compute time, making the model memory-bandwidth bound.

Arithmetic Intensity

Arithmetic Intensity

AI=fractextFLOPstextBytes=fracO(n2d)O(nd)=O(n)AI = \\frac{\\text{FLOPs}}{\\text{Bytes}} = \\frac{O(n^2 d)}{O(nd)} = O(n)

Here,

  • AIAI=Arithmetic intensity (FLOPs per byte)
  • nn=Matrix dimension
  • dd=Inner dimension

The roofline model tells us that operations with arithmetic intensity below the hardware's operational intensity ceiling are memory-bound. For matrix multiplication, larger matrices are more compute-bound.

Tensor Cores and Matrix Operations

Tensor Core Architecture

<svg viewBox="0 0 400 300" className="w-full h-auto">
  <!-- GPU SM -->
  <rect x="20" y="20" width="360" height="260" rx="10" fill="#1e1e2e" stroke="#89b4fa"/>
  <text x="200" y="45" textAnchor="middle" fill="#cdd6f4" fontSize="14" fontWeight="bold">GPU Streaming Multiprocessor (SM)</text>
  
  <!-- Tensor Cores -->
  <rect x="40" y="60" width="150" height="100" rx="8" fill="#a6e3a1" opacity="0.3" stroke="#a6e3a1"/>
  <text x="115" y="85" textAnchor="middle" fill="#a6e3a1" fontSize="12" fontWeight="bold">Tensor Cores</text>
  <text x="115" y="105" textAnchor="middle" fill="#a6e3a1" fontSize="10">FP16/BF16/INT8</text>
  <text x="115" y="120" textAnchor="middle" fill="#a6e3a1" fontSize="10">Matrix Multiply-Accumulate</text>
  <text x="115" y="135" textAnchor="middle" fill="#a6e3a1" fontSize="10">312 TFLOPS (A100)</text>
  
  <!-- CUDA Cores -->
  <rect x="210" y="60" width="150" height="100" rx="8" fill="#f9e2af" opacity="0.3" stroke="#f9e2af"/>
  <text x="285" y="85" textAnchor="middle" fill="#f9e2af" fontSize="12" fontWeight="bold">CUDA Cores</text>
  <text x="285" y="105" textAnchor="middle" fill="#f9e2af" fontSize="10">FP32/FP64 Operations</text>
  <text x="285" y="120" textAnchor="middle" fill="#f9e2af" fontSize="10">Scalar Operations</text>
  <text x="285" y="135" textAnchor="middle" fill="#f9e2af" fontSize="10">19.5 TFLOPS (A100)</text>
  
  <!-- Shared Memory -->
  <rect x="40" y="180" width="320" height="60" rx="8" fill="#89b4fa" opacity="0.3" stroke="#89b4fa"/>
  <text x="200" y="205" textAnchor="middle" fill="#89b4fa" fontSize="12" fontWeight="bold">Shared Memory / L1 Cache</text>
  <text x="200" y="225" textAnchor="middle" fill="#89b4fa" fontSize="10">~192 KB per SM β€’ 19 TB/s bandwidth</text>
  
  <!-- Legend -->
  <text x="40" y="270" fill="#6c7086" fontSize="10">Tensor cores are 16Γ— faster than CUDA cores for matrix operations</text>
</svg>

Tensor Core Operations

Tensor Core FMA

D=AtimesB+Cquadtext(whereA,B,C,Dare16Γ—16matrices)D = A \\times B + C \\quad \\text{(where A, B, C, D are 16Γ—16 matrices)}

Here,

  • AA=Input matrix (FP16/BF16)
  • BB=Weight matrix (FP16/BF16)
  • CC=Accumulator (FP32)
  • DD=Output matrix (FP32)

Tensor cores perform matrix multiply-accumulate (MMA) operations on 16Γ—16 tiles. A single A100 SM has 4 tensor cores, each performing 64 FP16 FMAs per cycle. At 1.4 GHz, this gives 312 TFLOPSβ€”16Γ— more than CUDA cores.

Data Type Considerations

Data TypeSizeTensor Core SupportThroughput (A100)
FP324 bytesNo19.5 TFLOPS
TF324 bytesYes156 TFLOPS
FP162 bytesYes312 TFLOPS
BF162 bytesYes312 TFLOPS
INT81 byteYes624 TOPS
INT40.5 bytesYes1,248 TOPS

TF32 (TensorFloat-32) provides the range of FP32 with the precision of FP16. It's the default for matrix operations on Ampere GPUs, giving 8Γ— speedup over FP32 with minimal quality loss.

Kernel Fusion

The Memory Transfer Problem

DfKernel Fusion

Kernel fusion combines multiple GPU operations (kernels) into a single kernel, reducing memory transfers between operations. This is critical for LLM inference where memory bandwidth is the bottleneck.

# WITHOUT fusion: 3 memory round-trips
def unfused_attention(q, k, v):
    scores = torch.matmul(q, k.transpose(-2, -1))  # Load q, k; store scores
    weights = F.softmax(scores, dim=-1)              # Load scores; store weights
    output = torch.matmul(weights, v)                 # Load weights, v; store output
    return output

# WITH fusion: 1 memory round-trip (flash attention)
def fused_attention(q, k, v, block_size=256):
    """Fused attention using tiling."""
    B, H, L, D = q.shape
    output = torch.zeros_like(q)
    lse = torch.full((B, H, L), float('-inf'), device=q.device)
    
    # Process in blocks
    for i in range(0, L, block_size):
        q_block = q[:, :, i:i+block_size]
        
        # Load k, v once for this q block
        for j in range(0, L, block_size):
            k_block = k[:, :, j:j+block_size]
            v_block = v[:, :, j:j+block_size]
            
            # Compute block attention (all in registers/shared memory)
            scores = torch.matmul(q_block, k_block.transpose(-2, -1))
            
            # Online softmax update
            block_max = scores.max(dim=-1, keepdim=True).values
            scores = scores - block_max
            
            # Update output and normalization
            exp_scores = torch.exp(scores)
            output[:, :, i:i+block_size] += torch.matmul(exp_scores, v_block)
            lse[:, :, i:i+block_size] = torch.logaddexp(
                lse[:, :, i:i+block_size], 
                block_max.squeeze(-1)
            )
    
    # Final normalization
    output = output / lse.unsqueeze(-1).exp().unsqueeze(-1)
    return output

Flash Attention

Flash Attention IO Complexity

Oleft(fracn2d2Mright)O\\left(\\frac{n^2 d^2}{M}\\right)

Here,

  • nn=Sequence length
  • dd=Head dimension
  • MM=SRAM size

Flash Attention (Dao et al., 2022) tiles the attention computation to fit blocks in SRAM, reducing HBM accesses from O(nΒ²) to O(nΒ²dΒ²/M). This gives 2-4Γ— speedup and 5-20Γ— memory savings compared to standard attention.

Architecture Design for Hardware

Optimal Hidden Dimensions

Hardware-Optimal Dimension

dtextopt=kcdot64quadtext(multipleof64fortensorcores)d_{\\text{opt}} = k \\cdot 64 \\quad \\text{(multiple of 64 for tensor cores)}

Here,

  • doptd_{\text{opt}}=Optimal hidden dimension
  • kk=Integer multiplier

Modern LLMs use dimensions like 4096, 5120, 6144, 8192β€”all multiples of 64 or 128. This ensures tensor cores operate at full utilization. Non-aligned dimensions waste compute cycles padding to valid sizes.

Layer Normalization Placement

class PreNormBlock(nn.Module):
    """Pre-norm transformer block (hardware efficient)."""
    
    def __init__(self, d_model, n_heads, d_ffn):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn = MultiHeadAttention(d_model, n_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ffn)
    
    def forward(self, x):
        # Pre-norm: more stable training, better hardware utilization
        x = x + self.attn(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

Pre-norm (applying LayerNorm before attention/FFN) enables better GPU utilization by allowing the residual connection to be computed in-place, reducing memory bandwidth requirements by 30-40%.

Activation Function Selection

ActivationComputeMemoryHardware Efficiency
ReLUMinimalLowExcellent
GELUModerateLowGood
SiLU/SwishModerateLowGood
GeGLUHigherHigherModerate

GeGLU (used in LLaMA, PaLM) provides better quality but requires 2Γ— the FFN computation. The quality improvement justifies the cost in large models, but for small models, standard SiLU is more efficient.

Quantization and Hardware

Hardware-Specific Quantization

class HardwareAwareQuantization:
    """Quantization strategies for different hardware."""
    
    @staticmethod
    def get_optimal_quantization(hardware_type):
        if hardware_type == "A100":
            return {
                "weight": "INT8",      # Tensor core INT8 support
                "activation": "FP16",  # Keep activations in FP16
                "kv_cache": "INT8",    # Save memory on KV cache
            }
        elif hardware_type == "RTX_4090":
            return {
                "weight": "INT4",      # Maximize memory savings
                "activation": "FP16",  # FP16 activations
                "kv_cache": "FP16",    # Limited INT4 support
            }
        elif hardware_type == "CPU":
            return {
                "weight": "INT4",      # GGUF format
                "activation": "FP32",  # CPU prefers FP32
                "kv_cache": "FP32",    # No special support
            }

Tensor Core Utilization

Tensor Core Utilization

UtextTC=fractextActualTFLOPStextPeakTFLOPSU_{\\text{TC}} = \\frac{\\text{Actual TFLOPS}}{\\text{Peak TFLOPS}}

Here,

  • UTCU_{\text{TC}}=Tensor core utilization (0-1)
  • PeakTFLOPSPeak TFLOPS=Theoretical maximum

Achieving high tensor core utilization requires: (1) matrix dimensions aligned to 16, (2) data in supported formats (FP16/BF16/INT8), (3) sufficient parallelism to hide memory latency, (4) kernel fusion to reduce memory transfers.

Memory Optimization Techniques

Weight Sharing

DfWeight Sharing

Weight sharing reduces memory footprint by reusing weights across layers or attention heads. This is a form of parameter-efficient design that matches hardware memory constraints.

Gradient Checkpointing

class CheckpointedTransformer(nn.Module):
    """Transformer with gradient checkpointing."""
    
    def __init__(self, d_model, n_layers, n_heads):
        super().__init__()
        self.layers = nn.ModuleList([
            TransformerBlock(d_model, n_heads) 
            for _ in range(n_layers)
        ])
    
    def forward(self, x):
        for layer in self.layers:
            # Checkpoint: recompute forward during backward
            x = torch.utils.checkpoint.checkpoint(
                layer, x, use_reentrant=False
            )
        return x

Gradient checkpointing trades compute for memory. For a 70B model, it reduces memory from 140 GB (full activations) to 35 GB (with recomputation), enabling training on a single A100.

Benchmarking Hardware Efficiency

Measuring Real Performance

def benchmark_model(model, input_ids, n_warmup=10, n_iter=100):
    """Benchmark model inference performance."""
    import time
    
    # Warmup
    for _ in range(n_warmup):
        with torch.no_grad():
            model(input_ids)
    
    # Benchmark
    torch.cuda.synchronize()
    start = time.time()
    
    for _ in range(n_iter):
        with torch.no_grad():
            model(input_ids)
    
    torch.cuda.synchronize()
    end = time.time()
    
    # Calculate metrics
    tokens_per_second = input_ids.shape[1] * n_iter / (end - start)
    memory_used = torch.cuda.max_memory_allocated() / 1024**3
    
    return {
        "tokens_per_second": tokens_per_second,
        "memory_gb": memory_used,
        "latency_ms": (end - start) / n_iter * 1000
    }

Practice Exercises

  1. Conceptual: Explain why autoregressive generation is memory-bandwidth bound while prefill is compute-bound. How does this affect optimization strategies?

  2. Mathematical: Calculate the arithmetic intensity of matrix multiplication for two 4096Γ—4096 matrices in FP16. Is this operation compute-bound or memory-bound on an A100?

  3. Practical: Benchmark the same transformer model with different hidden dimensions (2048, 4096, 8192) and measure how tensor core alignment affects throughput.

  4. Research: Investigate how mixed-precision training (FP16/BF16) affects both training speed and final model quality. What is the optimal precision strategy?

Key Takeaways:

  • GPU memory hierarchy has 100Γ— bandwidth differences between levels
  • Tensor cores provide 16Γ— speedup over CUDA cores for matrix operations
  • Kernel fusion reduces memory transfers and improves throughput
  • Model dimensions should align to 64 or 128 for tensor core efficiency
  • Pre-norm architecture reduces memory bandwidth requirements
  • Gradient checkpointing trades compute for memory (3-4Γ— reduction)
  • Autoregressive generation is memory-bandwidth bound; prefill is compute-bound

What to Learn Next

-> Flash Attention and Memory Efficiency IO-aware attention optimization for modern GPUs.

-> Quantization Techniques Deep Dive GPTQ, AWQ, GGUF, and hardware-specific quantization.

-> Model Parallelism and Tensor Parallelism Distributing models across multiple GPUs.

-> KV Cache Optimization Optimizing transformer inference memory.

-> LLM Inference Optimization Speeding up model inference for production.

-> Distributed Training for LLMs Training large models across multiple GPUs.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement