Transformers — Attention Is All You Need

Transformers replaced RNNs as the dominant architecture for sequence processing. They power GPT, BERT, and all modern LLMs.

Self-Attention

The core innovation:

Each token can attend to EVERY other token simultaneously

Query (Q): What am I looking for?
Key (K):   What do I contain?
Value (V): What information do I pass on?

Attention(Q,K,V) = softmax(QK^T / √d_k) V

Intuition:
Token "cat" queries: "What animals am I with?"
Token "sat" queries: "What is sitting?"
Token "on" queries: "What is on what?"

Multi-Head Attention

8 heads attending to different aspects:

Head 1: Syntax (subject-verb)
Head 2: Semantics (animal-action)
Head 3: Coreference (pronouns)
...

Output = Concat(head₁, ..., head₈) × W^O

Transformer Block

┌─────────────────────────────────┐
│  Input Embedding + Position     │
│       │                         │
│       ▼                         │
│  ┌─────────────────────────┐   │
│  │  Multi-Head Attention   │   │
│  │  + Add & Norm           │   │
│  └─────────────┬───────────┘   │
│                │                │
│                ▼                │
│  ┌─────────────────────────┐   │
│  │  Feed-Forward Network   │   │
│  │  + Add & Norm           │   │
│  └─────────────┬───────────┘   │
│                │                │
│                ▼                │
│  Output Embeddings             │
└─────────────────────────────────┘

Stacked N times:
BERT-base: 12 layers
GPT-4: ~120 layers

PyTorch Implementation

import torch.nn as nn

class TransformerBlock(nn.Module):
    def __init__(self, d_model=512, n_heads=8):
        super().__init__()
        self.attention = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm1 = nn.LayerNorm(d_model)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.GELU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.norm2 = nn.LayerNorm(d_model)

    def forward(self, x):
        # Self-attention
        attn_out, _ = self.attention(x, x, x)
        x = self.norm1(x + attn_out)
        # Feed-forward
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)
        return x

Key Takeaways

Self-attention enables parallel processing of sequences
Multi-head attention captures different types of relationships
Positional encoding adds sequence order information
Transformers replace RNNs for most sequence tasks
Encoder for understanding (BERT), Decoder for generation (GPT)
Transformers are highly parallelizable — fast on GPUs
Scaling (more data + parameters) improves performance
Transformers power all modern LLMs (GPT, Claude, Gemini)

Transformers — Attention Is All You Need Complete Guide

Transformers — Attention Is All You Need

Self-Attention

Multi-Head Attention

Transformer Block

PyTorch Implementation

Key Takeaways

Need Expert Machine Learning Help?