Attention Mechanisms — Deep Dive
Attention allows models to focus on relevant parts of the input when producing output.
Types of Attention
Self-Attention:
├─ Input attends to itself
├─ Each token relates to all other tokens
├─ Captures internal relationships
└─ Used in: Encoder, decoder (with mask)
Cross-Attention:
├─ Output attends to input
├─ Decoder queries encoder output
├─ Links input and output
└─ Used in: Encoder-Decoder models
Causal (Masked) Attention:
├─ Each token can only attend to previous tokens
├─ Prevents looking ahead
└─ Used in: GPT, autoregressive models
Multi-Head Attention
Why multiple heads?
Head 1: "What is the subject?"
Head 2: "What is the verb?"
Head 3: "What describes the noun?"
Head 4: "What is the temporal context?"
Each head learns different relationship types!
Output = Concat(heads) × W^O
Efficient Attention
Problem: Standard attention is O(n²) in sequence length
Solutions:
├─ Sparse attention: Attend to subset of tokens
├─ Linear attention: O(n) approximation
├─ Flash Attention: Hardware-optimized
├─ Sliding window: Local attention
└─ Grouped Query Attention: Share KV heads
Key Takeaways
- Self-attention is the core of Transformers
- Multi-head captures different relationship types
- Causal masking prevents looking ahead in generation
- Cross-attention links encoder and decoder
- Flash Attention is the most important optimization
- KV cache speeds up autoregressive generation
- Sliding window enables long sequences
- Attention complexity is O(n²) — a key limitation