LLM Foundations

LLM Architecture Deep Dive — How Transformers Power Language Models

Modern LLMs are built on the decoder-only Transformer architecture. This guide dives deep into self-attention mechanisms, positional encoding, and the critical KV cache optimization that powers efficient inference.

Transformers — Decoder-only architecture is the modern standard
Self-Attention — The core mechanism enabling context understanding
KV Cache — Reduces autoregressive generation from O(T²) to O(T)

Architecture is destiny—understand the model to unlock its potential.

LLM Architecture Deep Dive

Modern LLMs are built on the decoder-only Transformer architecture. This tutorial provides a rigorous treatment of the architecture, including self-attention, positional encoding, and the critical KV cache optimization.

The Transformer Architecture

The original Transformer (Vaswani et al., 2017) uses an encoder-decoder structure. However, modern LLMs predominantly use a decoder-only architecture, which simplifies training and inference while maintaining strong performance.

Self-Attention Mechanism

The core computation in each Transformer layer is self-attention, which allows each token to attend to all previous tokens in the sequence.

In decoder-only models, we apply causal masking to prevent attention to future tokens:

Multi-Head Attention

Multi-head attention allows the model to attend to information from different representation subspaces:

The per-head dimensions are typically:

d_k = d_model / h
d_v = d_model / h

Positional Encoding

Since self-attention is permutation-invariant, positional information must be injected explicitly. Modern LLMs use Rotary Position Embeddings (RoPE) or ALiBi.

Rotary Position Embeddings (RoPE)

RoPE encodes position by rotating the query and key vectors in the attention computation:

A key property of RoPE is that the attention score between tokens at positions m and n depends only on the relative distance (m - n):

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance, without any learned parameters:

The KV Cache

In autoregressive generation, we compute one token at a time. Without optimization, this requires re-computing attention for all previous tokens at each step, which is O(T²) per generation step.

KV Cache Example

For a 70B parameter model with 80 layers, d_model = 8192, and sequence length 4096:

KV Cache size per token: 2 × 80 × 8192 × 2 bytes (FP16) = 2.56 MB
For batch size 1 and sequence length 4096: ~10.5 GB
This is why KV cache management is critical for LLM serving

Comparison: Encoder vs Decoder vs Encoder-Decoder

Encoder-Only (BERT)

Bidirectional attention (no masking)
Pre-trained with masked language modeling (MLM)
Best for: classification, NER, sentence similarity
NOT suitable for text generation

Decoder-Only (GPT)

Causal (masked) attention
Pre-trained with next-token prediction (CLM)
Best for: text generation, in-context learning, instruction following
Dominant architecture for LLMs

Encoder-Decoder (T5, BART)

Encoder processes input, decoder generates output
Cross-attention between encoder and decoder
Best for: translation, summarization, question answering
More parameters for same input/output capacity

Feed-Forward Network

Each Transformer layer includes a position-wise feed-forward network (FFN):

Modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU/GELU, with d_ff = (8/3) × d_model (typically rounded to a multiple of 128 for hardware efficiency).

Transformer Forward Pass

The complete forward pass through a decoder-only Transformer:

Practical Example: Building a Minimal Transformer

import torch
import torch.nn as nn
import torch.nn.functional as F

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

class SwiGLU(nn.Module):
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, dim: int, n_heads: int, hidden_dim: int):
        super().__init__()
        self.attention_norm = RMSNorm(dim)
        self.ffn_norm = RMSNorm(dim)
        self.attention = nn.MultiheadAttention(dim, n_heads, batch_first=True)
        self.ffn = SwiGLU(dim, hidden_dim)
    
    def forward(self, x, mask=None):
        # Self-attention with residual
        h = self.attention_norm(x)
        h, _ = self.attention(h, h, h, attn_mask=mask)
        x = x + h
        
        # FFN with residual
        h = self.ffn_norm(x)
        h = self.ffn(h)
        x = x + h
        return x

# Example usage
dim, n_heads, hidden_dim = 512, 8, 1408  # ~355M params
block = TransformerBlock(dim, n_heads, hidden_dim)
x = torch.randn(2, 128, dim)  # (batch, seq_len, dim)
mask = torch.triu(torch.ones(128, 128) * float('-inf'), diagonal=1)
output = block(x, mask)
print(f"Output shape: {output.shape}")  # (2, 128, 512)

Practice Exercises

Architecture: Draw the block diagram of a single Transformer layer in a decoder-only model. Label all components and show the flow of information.
Mathematical: For a model with 32 layers, d_model = 4096, and 32 attention heads, calculate the total number of parameters in the attention blocks (Q, K, V, O projections) and the FFN layers (assuming SwiGLU with d_ff = 11008).
Implementation: Implement a simplified version of the KV cache for autoregressive generation. Show how it reduces computation from O(T²) to O(T) per generation step.
Analysis: Compare the memory requirements of a 7B parameter model in FP16 vs INT4 quantization. How does this affect the maximum sequence length you can use with a given GPU memory?

What to Learn Next

-> Tokenization for LLMs How LLMs break text into manageable pieces using BPE, WordPiece, and more.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> Prompt Engineering Getting the most out of language models through effective input design.

LLM Architecture Deep Dive

LLM Architecture Deep Dive — How Transformers Power Language Models

LLM Architecture Deep Dive

The Transformer Architecture

Self-Attention Mechanism

Multi-Head Attention

Positional Encoding

Rotary Position Embeddings (RoPE)

ALiBi (Attention with Linear Biases)

The KV Cache

KV Cache Example

Comparison: Encoder vs Decoder vs Encoder-Decoder

Encoder-Only (BERT)

Decoder-Only (GPT)

Encoder-Decoder (T5, BART)

Feed-Forward Network

Transformer Forward Pass

Practical Example: Building a Minimal Transformer

Practice Exercises

What to Learn Next

Need Expert LLM Help?