Transformer Architecture
The Transformer Revolution
Introduced in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017), the Transformer architecture revolutionized natural language processing and became the foundation for virtually all modern generative AI models.
Why Transformers?
- Parallelization: Unlike RNNs, processes all tokens simultaneously
- Long-range Dependencies: Attention mechanism captures relationships across entire sequences
- Scalability: Performance improves predictably with size
Core Components
Self-Attention Mechanism
Positional Encoding
Since Transformers process all tokens in parallel, positional encodings inject sequence order information:
import torch
import math
def positional_encoding(max_len, d_model):
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
# Example: 100 positions, 512 dimensions
pe = positional_encoding(100, 512)
print(f"Positional encoding shape: {pe.shape}")
Multi-Head Attention
Multi-head attention allows the model to jointly attend to information from different representation subspaces:
import torch.nn as nn
import torch
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
self.num_heads = num_heads
self.d_model = d_model
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(output)
Model Variants
| Variant | Architecture | Use Case |
|---|---|---|
| BERT | Encoder-only | Understanding, Classification |
| GPT | Decoder-only | Text Generation |
| T5 | Encoder-Decoder | Translation, Summarization |
| BART | Encoder-Decoder | Summarization, QA |
Key Hyperparameters
# Typical Transformer configuration
config = {
"d_model": 768, # Embedding dimension
"num_heads": 12, # Number of attention heads
"num_layers": 12, # Number of transformer layers
"d_ff": 3072, # Feed-forward dimension
"vocab_size": 50257, # Vocabulary size
"max_seq_len": 2048, # Maximum sequence length
"dropout": 0.1 # Dropout rate
}
Summary
The Transformer architecture's self-attention mechanism enables efficient parallel processing and captures long-range dependencies, making it the foundation of modern generative AI.
Next: We'll explore the attention mechanism in greater detail.