Why Transformers?
Before Transformers (2017), NLP relied on RNNs/LSTMs that processed tokens sequentially — slow and unable to capture long-range dependencies well. Transformers replaced recurrence with attention, enabling full parallelisation.
"Attention Is All You Need" — Vaswani et al., 2017
Self-Attention — The Core Mechanism
Self-attention lets each token look at every other token in the sequence simultaneously.
Step 1 — Create Q, K, V matrices:
Step 2 — Compute attention scores:
Where is the key dimension (scaling prevents gradient vanishing).
Intuition:
Token: "The cat sat on the mat"
↕ ↕ ↕ ↕ ↕
"cat" attends to: "sat" (subject-verb), "mat" (context), "The" (article)
Each token collects information weighted by relevance to itself.
Multi-Head Attention
Run attention heads in parallel, each learning different relationships:
| Head | Learns |
|---|---|
| Head 1 | Syntactic structure (subject-verb) |
| Head 2 | Coreference (pronoun → noun) |
| Head 3 | Long-range dependencies |
| Head h | Semantic similarity |
Positional Encoding
Transformers have no built-in order sense — positional encoding injects position info:
Token embedding = word embedding + positional encoding.
Full Architecture
INPUT SEQUENCE
↓
[Embedding + Positional Encoding]
↓
┌─────────────────────────────┐
│ ENCODER (×N) │
│ ┌──────────────────────┐ │
│ │ Multi-Head │ │
│ │ Self-Attention │ │
│ └──────────────────────┘ │
│ + │
│ [Add & Layer Norm] │
│ ↓ │
│ ┌──────────────────────┐ │
│ │ Feed-Forward │ │
│ │ (2-layer MLP) │ │
│ └──────────────────────┘ │
│ + │
│ [Add & Layer Norm] │
└─────────────────────────────┘
↓ Context (memory)
┌─────────────────────────────┐
│ DECODER (×N) │
│ Masked Self-Attention │
│ + Cross-Attention │
│ + Feed-Forward │
└─────────────────────────────┘
↓
[Linear + Softmax]
↓
OUTPUT PROBABILITIES
PyTorch Implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
assert d_model % n_heads == 0
self.d_k = d_model // n_heads
self.n_heads = n_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def split_heads(self, x, batch_size):
# (B, seq, d_model) → (B, heads, seq, d_k)
x = x.view(batch_size, -1, self.n_heads, self.d_k)
return x.transpose(1, 2)
def forward(self, Q, K, V, mask=None):
B = Q.size(0)
Q = self.split_heads(self.W_q(Q), B)
K = self.split_heads(self.W_k(K), B)
V = self.split_heads(self.W_v(V), B)
scores = Q @ K.transpose(-2, -1) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attn = F.softmax(scores, dim=-1)
out = (attn @ V).transpose(1, 2).contiguous()
out = out.view(B, -1, self.n_heads * self.d_k)
return self.W_o(out)
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout=0.1):
super().__init__()
self.attn = MultiHeadAttention(d_model, n_heads)
self.ff = nn.Sequential(
nn.Linear(d_model, d_ff), nn.ReLU(),
nn.Linear(d_ff, d_model)
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention + residual
x = self.norm1(x + self.drop(self.attn(x, x, x, mask)))
# Feed-forward + residual
x = self.norm2(x + self.drop(self.ff(x)))
return x
# Quick test
model = TransformerBlock(d_model=512, n_heads=8, d_ff=2048)
x = torch.randn(2, 20, 512) # (batch=2, seq_len=20, d_model=512)
out = model(x)
print(f"Input: {x.shape}")
print(f"Output: {out.shape}")
# Output: torch.Size([2, 20, 512])
# Count parameters
n_params = sum(p.numel() for p in model.parameters())
print(f"Parameters: {n_params:,}")
BERT vs GPT — Key Differences
| Feature | BERT | GPT |
|---|---|---|
| Direction | Bidirectional | Left-to-right (causal) |
| Pre-training | Masked LM + NSP | Causal LM |
| Fine-tuning | Task-specific heads | Prompt-based / few-shot |
| Best for | Classification, NER, QA | Text generation |
| Attention mask | Full (all tokens see all) | Causal (no future peeking) |
Key Takeaways
- Self-attention lets every token attend to every other — full context at once
- Scaling by
prevents softmax saturation with large dot products - Multi-head attention learns multiple relationship types simultaneously
- Residual connections + LayerNorm make deep Transformers trainable
- BERT = encoder-only (understanding); GPT = decoder-only (generation)