Attention Mechanisms — Deep Dive

Attention allows models to focus on relevant parts of the input when producing output.

Types of Attention

Self-Attention:
├─ Input attends to itself
├─ Each token relates to all other tokens
├─ Captures internal relationships
└─ Used in: Encoder, decoder (with mask)

Cross-Attention:
├─ Output attends to input
├─ Decoder queries encoder output
├─ Links input and output
└─ Used in: Encoder-Decoder models

Causal (Masked) Attention:
├─ Each token can only attend to previous tokens
├─ Prevents looking ahead
└─ Used in: GPT, autoregressive models

Multi-Head Attention

Why multiple heads?

Head 1: "What is the subject?"
Head 2: "What is the verb?"
Head 3: "What describes the noun?"
Head 4: "What is the temporal context?"

Each head learns different relationship types!
Output = Concat(heads) × W^O

Efficient Attention

Problem: Standard attention is O(n²) in sequence length

Solutions:
├─ Sparse attention: Attend to subset of tokens
├─ Linear attention: O(n) approximation
├─ Flash Attention: Hardware-optimized
├─ Sliding window: Local attention
└─ Grouped Query Attention: Share KV heads

Key Takeaways

Self-attention is the core of Transformers
Multi-head captures different relationship types
Causal masking prevents looking ahead in generation
Cross-attention links encoder and decoder
Flash Attention is the most important optimization
KV cache speeds up autoregressive generation
Sliding window enables long sequences
Attention complexity is O(n²) — a key limitation

Attention Mechanisms — Deep Dive

Attention Mechanisms — Deep Dive

Types of Attention

Multi-Head Attention

Efficient Attention

Key Takeaways

Need Expert Machine Learning Help?