LLM Architecture Deep Dive

FoundationsArchitectureFree Lesson

Advertisement

LLM Architecture Deep Dive

Modern LLMs are built on the decoder-only Transformer architecture. This tutorial provides a rigorous treatment of the architecture, including self-attention, positional encoding, and the critical KV cache optimization.

The Transformer Architecture

The original Transformer (Vaswani et al., 2017) uses an encoder-decoder structure. However, modern LLMs predominantly use a decoder-only architecture, which simplifies training and inference while maintaining strong performance.

DfDecoder-Only Transformer

A decoder-only Transformer processes input tokens sequentially, using causal (masked) self-attention to prevent information leakage from future tokens. Each layer consists of: (1) masked multi-head self-attention, (2) layer normalization, (3) position-wise feed-forward network, and (4) residual connections.

Self-Attention Mechanism

The core computation in each Transformer layer is self-attention, which allows each token to attend to all previous tokens in the sequence.

Scaled Dot-Product Self-Attention

textAttention(Q,K,V)=textsoftmaxleft(fracQKTsqrtdkright)V\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V

Here,

  • QQ=Query matrix (n ร— d_k)
  • KK=Key matrix (n ร— d_k)
  • VV=Value matrix (n ร— d_v)
  • dkd_k=Dimension of keys/queries
  • nn=Sequence length

In decoder-only models, we apply causal masking to prevent attention to future tokens:

Causal Self-Attention

textCausalAttention(Q,K,V)=textsoftmaxleft(fracQKT+Msqrtdkright)V\\text{CausalAttention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T + M}{\\sqrt{d_k}}\\right)V

Here,

  • MM=Causal mask matrix where M_{ij} = 0 if i โ‰ฅ j, else -โˆž

Multi-Head Attention

Multi-head attention allows the model to attend to information from different representation subspaces:

Multi-Head Attention

textMultiHead(Q,K,V)=textConcat(texthead1,ldots,textheadh)WO\\text{MultiHead}(Q, K, V) = \\text{Concat}(\\text{head}_1, \\ldots, \\text{head}_h)W^O

Here,

  • hh=Number of attention heads
  • headi\text{head}_i=Attention head i = Attention(QW_i^Q, KW_i^K, VW_i^V)
  • WOW^O=Output projection matrix
  • WiQ,WiK,WiVW_i^Q, W_i^K, W_i^V=Learned projection matrices for head i

The per-head dimensions are typically:

  • d_k = d_model / h
  • d_v = d_model / h

For a model with d_model = 4096 and h = 32 heads, each head has dimension d_k = d_v = 128. This is the standard configuration for models like LLaMA 2 70B.

Positional Encoding

Since self-attention is permutation-invariant, positional information must be injected explicitly. Modern LLMs use Rotary Position Embeddings (RoPE) or ALiBi.

Rotary Position Embeddings (RoPE)

RoPE encodes position by rotating the query and key vectors in the attention computation:

RoPE Rotation

f(x,m)=beginpmatrixx0x1x2x3vdotsxdโˆ’2xdโˆ’1endpmatrixotimesbeginpmatrixcos(mtheta0)cos(mtheta0)cos(mtheta1)cos(mtheta1)vdotscos(mthetad/2โˆ’1)cos(mthetad/2โˆ’1)endpmatrix+beginpmatrixโˆ’x1x0โˆ’x3x2vdotsโˆ’xdโˆ’1xdโˆ’2endpmatrixotimesbeginpmatrixsin(mtheta0)sin(mtheta0)sin(mtheta1)sin(mtheta1)vdotssin(mthetad/2โˆ’1)sin(mthetad/2โˆ’1)endpmatrixf(x, m) = \\begin{pmatrix} x_0 \\\\ x_1 \\\\ x_2 \\\\ x_3 \\\\ \\vdots \\\\ x_{d-2} \\\\ x_{d-1} \\end{pmatrix} \\otimes \\begin{pmatrix} \\cos(m\\theta_0) \\\\ \\cos(m\\theta_0) \\\\ \\cos(m\\theta_1) \\\\ \\cos(m\\theta_1) \\\\ \\vdots \\\\ \\cos(m\\theta_{d/2-1}) \\\\ \\cos(m\\theta_{d/2-1}) \\end{pmatrix} + \\begin{pmatrix} -x_1 \\\\ x_0 \\\\ -x_3 \\\\ x_2 \\\\ \\vdots \\\\ -x_{d-1} \\\\ x_{d-2} \\end{pmatrix} \\otimes \\begin{pmatrix} \\sin(m\\theta_0) \\\\ \\sin(m\\theta_0) \\\\ \\sin(m\\theta_1) \\\\ \\sin(m\\theta_1) \\\\ \\vdots \\\\ \\sin(m\\theta_{d/2-1}) \\\\ \\sin(m\\theta_{d/2-1}) \\end{pmatrix}

Here,

  • xx=Input vector
  • mm=Position index
  • ฮธi\theta_i=Frequency parameter = 10000^{-2i/d}
  • dd=Model dimension

A key property of RoPE is that the attention score between tokens at positions m and n depends only on the relative distance (m - n):

RoPE Relative Attention

langlef(q,m),f(k,n)rangle=g(q,k,mโˆ’n)\\langle f(q, m), f(k, n) \\rangle = g(q, k, m - n)

Here,

  • q,kq, k=Query and key vectors
  • m,nm, n=Absolute positions
  • gg=Function of relative position (m - n)

ALiBi (Attention with Linear Biases)

ALiBi adds a linear bias to attention scores based on distance, without any learned parameters:

ALiBi Bias

textsoftmaxleft(fracqikjTsqrtdkโˆ’mcdotโˆฃiโˆ’jโˆฃright)\\text{softmax}\\left(\\frac{q_i k_j^T}{\\sqrt{d_k}} - m \\cdot |i - j|\\right)

Here,

  • mm=Head-specific slope (geometric sequence)
  • i,ji, j=Token positions

RoPE is the dominant positional encoding in modern LLMs (LLaMA, Mistral, Qwen). ALiBi was popularized by BLOOM and is used in some models for its simplicity and extrapolation capabilities.

The KV Cache

In autoregressive generation, we compute one token at a time. Without optimization, this requires re-computing attention for all previous tokens at each step, which is O(Tยฒ) per generation step.

DfKV Cache

The KV Cache stores the key and value tensors from previous tokens, avoiding redundant computation. At each generation step, we only compute Q, K, V for the new token and attend to the cached K, V from all previous tokens.

KV Cache Memory

textMemory=2timesLtimesntextlayerstimesdtextmodeltimestextbatchsizetimestextprecisionbytes\\text{Memory} = 2 \\times L \\times n_{\\text{layers}} \\times d_{\\text{model}} \\times \\text{batch\\_size} \\times \\text{precision\\_bytes}

Here,

  • LL=Sequence length
  • nlayersn_{\text{layers}}=Number of transformer layers
  • dmodeld_{\text{model}}=Model dimension
  • 22=For both K and V

KV Cache Example

For a 70B parameter model with 80 layers, d_model = 8192, and sequence length 4096:

  • KV Cache size per token: 2 ร— 80 ร— 8192 ร— 2 bytes (FP16) = 2.56 MB
  • For batch size 1 and sequence length 4096: ~10.5 GB
  • This is why KV cache management is critical for LLM serving

Modern techniques like Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) reduce KV cache size by sharing key-value heads across attention groups. LLaMA 2 70B uses GQA with 8 KV heads (vs 64 query heads).

Comparison: Encoder vs Decoder vs Encoder-Decoder

Encoder-Only (BERT)

  • Bidirectional attention (no masking)
  • Pre-trained with masked language modeling (MLM)
  • Best for: classification, NER, sentence similarity
  • NOT suitable for text generation

Decoder-Only (GPT)

  • Causal (masked) attention
  • Pre-trained with next-token prediction (CLM)
  • Best for: text generation, in-context learning, instruction following
  • Dominant architecture for LLMs

Encoder-Decoder (T5, BART)

  • Encoder processes input, decoder generates output
  • Cross-attention between encoder and decoder
  • Best for: translation, summarization, question answering
  • More parameters for same input/output capacity

Architecture Comparison

textBERT:P(zโˆฃx)quadtextGPT:P(x)=prodtP(xtโˆฃx<t)quadtextT5:P(yโˆฃx)=prodtP(ytโˆฃx,y<t)\\text{BERT: } P(z|x) \\quad \\text{GPT: } P(x) = \\prod_t P(x_t|x_{<t}) \\quad \\text{T5: } P(y|x) = \\prod_t P(y_t|x, y_{<t})

Here,

  • xx=Input sequence
  • yy=Output sequence
  • zz=Latent representation

Decoder-only models are preferred for LLMs because: (1) they use a single unified architecture for all tasks, (2) they scale more efficiently, and (3) in-context learning emerges naturally from the autoregressive objective.

Feed-Forward Network

Each Transformer layer includes a position-wise feed-forward network (FFN):

SwiGLU Feed-Forward

textFFN(x)=textSwiGLU(xW1,W3)W2=(textSiLU(xW1)odotxW3)W2\\text{FFN}(x) = \\text{SwiGLU}(xW_1, W_3)W_2 = (\\text{SiLU}(xW_1) \\odot xW_3)W_2

Here,

  • W1W_1=Gate projection (d_model โ†’ d_ff)
  • W2W_2=Down projection (d_ff โ†’ d_model)
  • W3W_3=Up projection (d_model โ†’ d_ff)
  • SiLU\text{SiLU}=Swish activation: x ยท ฯƒ(x)
  • โŠ™\odot=Element-wise multiplication

Modern LLMs use SwiGLU (Shazeer, 2020) instead of ReLU/GELU, with d_ff = (8/3) ร— d_model (typically rounded to a multiple of 128 for hardware efficiency).

Transformer Forward Pass

The complete forward pass through a decoder-only Transformer:

h0=textEmbed(x)+textPosEnc(x)quadtextforl=1ldotsL:quadhl=hlโˆ’1+textFFN(textLN(hlโˆ’1+textMHA(textLN(hlโˆ’1))))quadtextlogits=textLMHead(textLN(hL))h_0 = \\text{Embed}(x) + \\text{PosEnc}(x) \\quad \\text{for } l = 1 \\ldots L: \\quad h_l = h_{l-1} + \\text{FFN}(\\text{LN}(h_{l-1} + \\text{MHA}(\\text{LN}(h_{l-1})))) \\quad \\text{logits} = \\text{LM\\_Head}(\\text{LN}(h_L))

Practical Example: Building a Minimal Transformer

import torch
import torch.nn as nn
import torch.nn.functional as F

class RMSNorm(nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))
    
    def forward(self, x):
        rms = torch.sqrt(torch.mean(x ** 2, dim=-1, keepdim=True) + self.eps)
        return x / rms * self.weight

class SwiGLU(nn.Module):
    def __init__(self, dim: int, hidden_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim, bias=False)
        self.w2 = nn.Linear(hidden_dim, dim, bias=False)
        self.w3 = nn.Linear(dim, hidden_dim, bias=False)
    
    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

class TransformerBlock(nn.Module):
    def __init__(self, dim: int, n_heads: int, hidden_dim: int):
        super().__init__()
        self.attention_norm = RMSNorm(dim)
        self.ffn_norm = RMSNorm(dim)
        self.attention = nn.MultiheadAttention(dim, n_heads, batch_first=True)
        self.ffn = SwiGLU(dim, hidden_dim)
    
    def forward(self, x, mask=None):
        # Self-attention with residual
        h = self.attention_norm(x)
        h, _ = self.attention(h, h, h, attn_mask=mask)
        x = x + h
        
        # FFN with residual
        h = self.ffn_norm(x)
        h = self.ffn(h)
        x = x + h
        return x

# Example usage
dim, n_heads, hidden_dim = 512, 8, 1408  # ~355M params
block = TransformerBlock(dim, n_heads, hidden_dim)
x = torch.randn(2, 128, dim)  # (batch, seq_len, dim)
mask = torch.triu(torch.ones(128, 128) * float('-inf'), diagonal=1)
output = block(x, mask)
print(f"Output shape: {output.shape}")  # (2, 128, 512)

For a comprehensive treatment of attention mechanisms, see our module on Attention Mechanisms Deep Dive.

Practice Exercises

  1. Architecture: Draw the block diagram of a single Transformer layer in a decoder-only model. Label all components and show the flow of information.

  2. Mathematical: For a model with 32 layers, d_model = 4096, and 32 attention heads, calculate the total number of parameters in the attention blocks (Q, K, V, O projections) and the FFN layers (assuming SwiGLU with d_ff = 11008).

  3. Implementation: Implement a simplified version of the KV cache for autoregressive generation. Show how it reduces computation from O(Tยฒ) to O(T) per generation step.

  4. Analysis: Compare the memory requirements of a 7B parameter model in FP16 vs INT4 quantization. How does this affect the maximum sequence length you can use with a given GPU memory?

Key Takeaways:

  • Modern LLMs use decoder-only Transformers with causal self-attention
  • Self-attention: Attention(Q, K, V) = softmax(QK^T / โˆšd_k) V
  • RoPE encodes position via rotation, enabling relative position awareness
  • The KV cache reduces autoregressive generation from O(Tยฒ) to O(T)
  • SwiGLU FFN layers with RMSNorm are the modern standard
  • GQA reduces KV cache size while maintaining performance

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement