Attention Mechanisms — Deep Dive

Deep LearningAttentionFree Lesson

Advertisement

Attention Mechanisms — Deep Dive

Attention allows models to focus on relevant parts of the input when producing output.


Types of Attention

Self-Attention:
├─ Input attends to itself
├─ Each token relates to all other tokens
├─ Captures internal relationships
└─ Used in: Encoder, decoder (with mask)

Cross-Attention:
├─ Output attends to input
├─ Decoder queries encoder output
├─ Links input and output
└─ Used in: Encoder-Decoder models

Causal (Masked) Attention:
├─ Each token can only attend to previous tokens
├─ Prevents looking ahead
└─ Used in: GPT, autoregressive models

Multi-Head Attention

Why multiple heads?

Head 1: "What is the subject?"
Head 2: "What is the verb?"
Head 3: "What describes the noun?"
Head 4: "What is the temporal context?"

Each head learns different relationship types!
Output = Concat(heads) × W^O

Efficient Attention

Problem: Standard attention is O(n²) in sequence length

Solutions:
├─ Sparse attention: Attend to subset of tokens
├─ Linear attention: O(n) approximation
├─ Flash Attention: Hardware-optimized
├─ Sliding window: Local attention
└─ Grouped Query Attention: Share KV heads

Key Takeaways

  1. Self-attention is the core of Transformers
  2. Multi-head captures different relationship types
  3. Causal masking prevents looking ahead in generation
  4. Cross-attention links encoder and decoder
  5. Flash Attention is the most important optimization
  6. KV cache speeds up autoregressive generation
  7. Sliding window enables long sequences
  8. Attention complexity is O(n²) — a key limitation

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement