Transformer Architecture — Complete Guide

Why Transformers?

Before Transformers (2017), NLP relied on RNNs/LSTMs that processed tokens sequentially — slow and unable to capture long-range dependencies well. Transformers replaced recurrence with attention, enabling full parallelisation.

"Attention Is All You Need" — Vaswani et al., 2017

Self-Attention — The Core Mechanism

Self-attention lets each token look at every other token in the sequence simultaneously.

Step 1 — Create Q, K, V matrices:

$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$

Step 2 — Compute attention scores:

$\text{Attention}(Q, K, V) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

Where $d_k$ is the key dimension (scaling prevents gradient vanishing).

Intuition:

Token: "The cat sat on the mat"
              ↕   ↕   ↕  ↕   ↕
"cat" attends to: "sat" (subject-verb), "mat" (context), "The" (article)
Each token collects information weighted by relevance to itself.

Multi-Head Attention

Run $h$ attention heads in parallel, each learning different relationships:

$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$

$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$

Head	Learns
Head 1	Syntactic structure (subject-verb)
Head 2	Coreference (pronoun → noun)
Head 3	Long-range dependencies
Head h	Semantic similarity

Positional Encoding

Transformers have no built-in order sense — positional encoding injects position info:

$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$

$PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d_{model}}}\right)$

Token embedding = word embedding + positional encoding.

Full Architecture

INPUT SEQUENCE
      ↓
[Embedding + Positional Encoding]
      ↓
┌─────────────────────────────┐
│       ENCODER (×N)          │
│  ┌──────────────────────┐   │
│  │  Multi-Head           │   │
│  │  Self-Attention       │   │
│  └──────────────────────┘   │
│           +                 │
│  [Add & Layer Norm]         │
│           ↓                 │
│  ┌──────────────────────┐   │
│  │  Feed-Forward        │   │
│  │  (2-layer MLP)       │   │
│  └──────────────────────┘   │
│           +                 │
│  [Add & Layer Norm]         │
└─────────────────────────────┘
      ↓ Context (memory)
┌─────────────────────────────┐
│       DECODER (×N)          │
│  Masked Self-Attention      │
│  + Cross-Attention          │
│  + Feed-Forward             │
└─────────────────────────────┘
      ↓
[Linear + Softmax]
      ↓
OUTPUT PROBABILITIES

PyTorch Implementation

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        assert d_model % n_heads == 0
        self.d_k    = d_model // n_heads
        self.n_heads = n_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def split_heads(self, x, batch_size):
        # (B, seq, d_model) → (B, heads, seq, d_k)
        x = x.view(batch_size, -1, self.n_heads, self.d_k)
        return x.transpose(1, 2)

    def forward(self, Q, K, V, mask=None):
        B = Q.size(0)
        Q = self.split_heads(self.W_q(Q), B)
        K = self.split_heads(self.W_k(K), B)
        V = self.split_heads(self.W_v(V), B)

        scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)
        attn   = F.softmax(scores, dim=-1)
        out    = (attn @ V).transpose(1, 2).contiguous()
        out    = out.view(B, -1, self.n_heads * self.d_k)
        return self.W_o(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
        super().__init__()
        self.attn  = MultiHeadAttention(d_model, n_heads)
        self.ff    = nn.Sequential(
            nn.Linear(d_model, d_ff), nn.ReLU(),
            nn.Linear(d_ff, d_model)
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.drop  = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention + residual
        x = self.norm1(x + self.drop(self.attn(x, x, x, mask)))
        # Feed-forward + residual
        x = self.norm2(x + self.drop(self.ff(x)))
        return x

# Quick test
model = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)
x = torch.randn(2, 20, 512)   # (batch=2, seq_len=20, d_model=512)
out = model(x)
print(f"Input:  {x.shape}")
print(f"Output: {out.shape}")
# Output: torch.Size([2, 20, 512])

# Count parameters
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}")

BERT vs GPT — Key Differences

Feature	BERT	GPT
Direction	Bidirectional	Left-to-right (causal)
Pre-training	Masked LM + NSP	Causal LM
Fine-tuning	Task-specific heads	Prompt-based / few-shot
Best for	Classification, NER, QA	Text generation
Attention mask	Full (all tokens see all)	Causal (no future peeking)

Key Takeaways

Self-attention lets every token attend to every other — full context at once
Scaling by $\sqrt{d_k}$ prevents softmax saturation with large dot products
Multi-head attention learns multiple relationship types simultaneously
Residual connections + LayerNorm make deep Transformers trainable
BERT = encoder-only (understanding); GPT = decoder-only (generation)