Transformers — Attention Is All You Need
Transformers replaced RNNs as the dominant architecture for sequence processing. They power GPT, BERT, and all modern LLMs.
Self-Attention
The core innovation:
Each token can attend to EVERY other token simultaneously
Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I pass on?
Attention(Q,K,V) = softmax(QK^T / √d_k) V
Intuition:
Token "cat" queries: "What animals am I with?"
Token "sat" queries: "What is sitting?"
Token "on" queries: "What is on what?"
Multi-Head Attention
8 heads attending to different aspects:
Head 1: Syntax (subject-verb)
Head 2: Semantics (animal-action)
Head 3: Coreference (pronouns)
...
Output = Concat(head₁, ..., head₈) × W^O
Transformer Block
┌─────────────────────────────────┐
│ Input Embedding + Position │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Multi-Head Attention │ │
│ │ + Add & Norm │ │
│ └─────────────┬───────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Feed-Forward Network │ │
│ │ + Add & Norm │ │
│ └─────────────┬───────────┘ │
│ │ │
│ ▼ │
│ Output Embeddings │
└─────────────────────────────────┘
Stacked N times:
BERT-base: 12 layers
GPT-4: ~120 layers
PyTorch Implementation
import torch.nn as nn
class TransformerBlock(nn.Module):
def __init__(self, d_model=512, n_heads=8):
super().__init__()
self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.norm1 = nn.LayerNorm(d_model)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_model * 4),
nn.GELU(),
nn.Linear(d_model * 4, d_model)
)
self.norm2 = nn.LayerNorm(d_model)
def forward(self, x):
# Self-attention
attn_out, _ = self.attention(x, x, x)
x = self.norm1(x + attn_out)
# Feed-forward
ffn_out = self.ffn(x)
x = self.norm2(x + ffn_out)
return x
Key Takeaways
- Self-attention enables parallel processing of sequences
- Multi-head attention captures different types of relationships
- Positional encoding adds sequence order information
- Transformers replace RNNs for most sequence tasks
- Encoder for understanding (BERT), Decoder for generation (GPT)
- Transformers are highly parallelizable — fast on GPUs
- Scaling (more data + parameters) improves performance
- Transformers power all modern LLMs (GPT, Claude, Gemini)