Transformers + BERT
š” The Transformer architecture (Vaswani et al., 2017) revolutionized NLP by replacing recurrence with self-attention. BERT leverages Transformers for bidirectional language understanding. This lesson covers the full architecture and practical fine-tuning.
1. The Attention Mechanism
Attention computes a weighted sum of values, where weights reflect the importance of each value to a query.
Scaled Dot-Product Attention
Scaled Dot-Product Attention
Here,
- =Query matrix (n Ć d_k)
- =Key matrix (m Ć d_k)
- =Value matrix (m Ć d_v)
- =Dimension of keys (scaling factor)
where:
- (queries)
- (keys)
- (values)
- = dimension of keys (scaling factor)
Visual: Attention Score Matrix
Query words: "The cat sat on the mat"
Key words: "The cat sat on the mat"
The cat sat on the mat
āāāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā¬āāāāāāā
"The" ā 0.82 ā 0.03 ā 0.01 ā 0.05 ā 0.06 ā 0.03 ā
"cat" ā 0.02 ā 0.71 ā 0.12 ā 0.02 ā 0.02 ā 0.11 ā
"sat" ā 0.01 ā 0.15 ā 0.68 ā 0.08 ā 0.01 ā 0.07 ā
"on" ā 0.03 ā 0.01 ā 0.09 ā 0.78 ā 0.03 ā 0.06 ā
"the" ā 0.11 ā 0.02 ā 0.01 ā 0.04 ā 0.79 ā 0.03 ā
"mat" ā 0.02 ā 0.09 ā 0.06 ā 0.07 ā 0.02 ā 0.74 ā
āāāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā“āāāāāāā
Each row = attention weights for that word
High diagonal = words attend strongly to themselves
"cat" attends to "mat" (semantic relationship)
ā¹ļø Why Scale by sqrt(d_k)?
Without scaling, dot products grow large with d_k, pushing softmax into regions with tiny gradients. Dividing by sqrt(d_k) keeps variance at 1, maintaining stable gradients.
Variance of Unscaled Dot Products
Here,
- =Dimension of keys
- =Individual query element
- =Individual key element
Why Scaled Dot-Product?
Without scaling:
d_k = 512 dimensions
Q and K are random vectors with mean=0, var=1
dot(Q, K) = sum(q_i * k_i) for i=1..512
Var(dot) = d_k * Var(q_i) * Var(k_i) = 512
Large values -> softmax saturates -> gradients vanish
With scaling (divide by sqrt(d_k) = sqrt(512) = 22.6):
Var(dot/sqrt(d_k)) = 512 / 512 = 1
Variance stays at 1 regardless of d_k
Softmax stays in its "sensitive" region
Gradients flow properly during backprop
ThAttention Scaling Property
For queries and keys with independent entries having zero mean and unit variance, the variance of the unscaled dot product equals the key dimension d_k. Scaling by 1/sqrt(d_k) normalizes the variance to 1, ensuring softmax operates in a regime with non-vanishing gradients.
šComputing Attention Weights
Consider queries Q = [1, 0, 1] and keys K = [1, 0, 1] with d_k = 3:
- Dot product: QK^T = 11 + 00 + 1*1 = 2
- Scaled: 2 / sqrt(3) ā 1.155
- Softmax: [e^1.155] / [e^1.155] = 1.0 (single token)
- For multiple keys, softmax distributes weights across tokens
2. Multi-Head Attention
Instead of a single attention function, run h parallel attention heads and concatenate:
Multi-Head Attention
Here,
- =Individual attention head
- =Output projection matrix
Individual Attention Head
Here,
- =Query projection for head i
- =Key projection for head i
- =Value projection for head i
Input: X ā ā^(n Ć d_model)
ā
āāā Q = XW^Q āāā Head 1 (Attention) āāā
āāā K = XW^K āāā Head 2 (Attention) āāā Concat ā Linear ā Output
āāā V = XW^V āāā Head 3 (Attention) āāā
āāā āāā Head 4 (Attention) āāā
Each head: d_k = d_model / h = 512 / 8 = 64 dimensions
Why Multiple Heads?
Different heads learn different types of relationships:
Head 1 (syntactic): "cat" <-> "sat" (subject-verb)
Head 2 (semantic): "cat" <-> "mat" (object)
Head 3 (positional): "the" <-> "cat" (determiner)
Head 4 (long-range): "cat" <-> "the mat" (noun phrase)
DfSelf-Attention
Self-attention lets each token "attend to" every other token to gather context. Think of it as each word asking: "Which other words should I pay attention to?"
š” Multi-Head Attention Intuition
Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. Each head can specialize in capturing different types of dependencies (syntactic, semantic, positional).
Step-by-step for "The cat sat on the mat":
1. Create Query, Key, Value vectors for each token:
Token Query (what am I looking for?)
------ -------------------------------
The [0.1, 0.3, -0.2, ...] "I'm a determiner, looking for a noun"
cat [0.5, -0.1, 0.4, ...] "I'm a noun, looking for my verb"
sat [-0.3, 0.6, 0.2, ...] "I'm a verb, looking for my subject"
the [0.1, 0.3, -0.2, ...] (same as first "The")
mat [0.4, 0.2, -0.1, ...] "I'm a noun, looking for my verb"
2. Compute attention scores (dot product of Q and K):
score("cat", "sat") = dot(Q_cat, K_sat) = 0.8 (high! subject-verb)
score("cat", "mat") = dot(Q_cat, K_mat) = 0.6 (medium, object)
score("cat", "the") = dot(Q_cat, K_the) = 0.1 (low, just a determiner)
3. Scale by sqrt(d_k) and apply softmax:
attention_weights = softmax(scores / sqrt(d_k))
"cat" attends to: sat=0.6, mat=0.3, the=0.05, The=0.05
4. Weighted sum of Values:
output("cat") = 0.6*V_sat + 0.3*V_mat + 0.05*V_the + 0.05*V_The
This output encodes "cat" with context from "sat" (its verb)
and "mat" (its object), while mostly ignoring determiners.
Attention Matrix Visualization
The cat sat on the mat
+------+------+------+------+------+------+
The | 0.82 | 0.03 | 0.01 | 0.05 | 0.06 | 0.03 |
+------+------+------+------+------+------+
cat | 0.02 | 0.71 | 0.12 | 0.02 | 0.02 | 0.11 |
+------+------+------+------+------+------+
sat | 0.01 | 0.15 | 0.68 | 0.08 | 0.01 | 0.07 |
+------+------+------+------+------+------+
on | 0.03 | 0.01 | 0.09 | 0.78 | 0.03 | 0.06 |
+------+------+------+------+------+------+
the | 0.11 | 0.02 | 0.01 | 0.04 | 0.79 | 0.03 |
+------+------+------+------+------+------+
mat | 0.02 | 0.09 | 0.06 | 0.07 | 0.02 | 0.74 |
+------+------+------+------+------+------+
Reading: Row i, Column j = how much token i attends to token j
Diagonal is high (tokens attend to themselves)
"cat" strongly attends to "sat" (0.12) and "mat" (0.11)
import torch
import torch.nn as nn
import math
class MultiHeadAttention(nn.Module):
def __init__(self, d_model=512, num_heads=8):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def scaled_dot_product_attention(self, Q, K, V, mask=None):
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
attention_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
def forward(self, query, key, value, mask=None):
batch_size = query.size(0)
# Linear projections and reshape
Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
# Attention
output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads
output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
return self.W_o(output)
# Example
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512) # batch=2, seq_len=10
output = mha(x, x, x) # self-attention
print(output.shape) # torch.Size([2, 10, 512])
3. Transformer Architecture
Encoder Block
Input Embedding + Positional Encoding
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Multi-Head Self-Attn āāāā Add & LayerNorm āāā
āāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā ā
ā¼ ā
āāāāāāāāāāāāāāāāāāāāāāāāāāā ā
ā Feed-Forward Network āāāā Add & LayerNorm āā
ā (d_model ā d_ff ā d) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
Output
Decoder Block
Target Embedding + Positional Encoding
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Masked Multi-Head āāāā Add & LayerNorm
ā Self-Attention ā
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Cross-Attention āāāā Add & LayerNorm
ā (Q from decoder, ā
ā K,V from encoder) ā
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā
ā¼
āāāāāāāāāāāāāāāāāāāāāāāāāāā
ā Feed-Forward Network āāāā Add & LayerNorm
āāāāāāāāāāāāāāāāāāāāāāāāāāā
Positional Encoding
Since Transformers have no recurrence, positional information is injected:
Positional Encoding (Sine)
Here,
- =Position in the sequence
- =Dimension index
- =Model dimension
Positional Encoding (Cosine)
Here,
- =Position in the sequence
- =Dimension index
- =Model dimension
DfPositional Encoding
Positional encoding injects sequence order information into the model by adding sinusoidal functions of different frequencies to the input embeddings. This allows the model to distinguish between different positions in the sequence.
ā¹ļø Why Sinusoidal Encoding?
The sinusoidal functions allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This enables the model to generalize to sequence lengths longer than those seen during training.
class PositionalEncoding(nn.Module):
def __init__(self, d_model=512, max_len=5000):
super().__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer('pe', pe.unsqueeze(0))
def forward(self, x):
seq_len = x.size(1)
return x + self.pe[:, :seq_len]
# Visualize positional encodings
pe = PositionalEncoding(d_model=128)
positions = pe.pe[0, :100, :].detach().numpy()
# Each row is a positional encoding vector
# Sine/cosine waves of different frequencies
# Similar positions have similar encodings (smooth interpolation)
šPositional Encoding Computation
For position pos=5, dimension i=0, d_model=512:
- Compute frequency: 5 / 10000^(0/512) = 5 / 1 = 5
- Sine encoding: sin(5) ā -0.959
- Cosine encoding: cos(5) ā 0.284
For i=1: frequency = 5 / 10000^(2/512) ā 5 / 1.039 ā 4.812
Full Transformer Encoder
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
super().__init__()
self.self_attn = MultiHeadAttention(d_model, num_heads)
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output = self.self_attn(x, x, x, mask)
x = self.norm1(x + self.dropout(attn_output))
# FFN with residual connection
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_output))
return x
class TransformerEncoder(nn.Module):
def __init__(self, vocab_size, d_model=512, num_heads=8, d_ff=2048,
num_layers=6, max_len=5000, dropout=0.1):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoding = PositionalEncoding(d_model, max_len)
self.layers = nn.ModuleList([
TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
self.d_model = d_model
def forward(self, x, mask=None):
x = self.embedding(x) * math.sqrt(self.d_model)
x = self.pos_encoding(x)
for layer in self.layers:
x = layer(x, mask)
return x
# Example
encoder = TransformerEncoder(vocab_size=30000, d_model=512)
x = torch.randint(0, 30000, (2, 20)) # batch=2, seq_len=20
output = encoder(x)
print(output.shape) # torch.Size([2, 20, 512])
4. BERT (Bidirectional Encoder Representations from Transformers)
Architecture
BERT uses only the Transformer encoder (no decoder):
| Model | Layers | Hidden | Heads | Parameters |
|---|---|---|---|---|
| BERT-base | 12 | 768 | 12 | 110M |
| BERT-large | 24 | 1024 | 16 | 340M |
Pre-training Objectives
DfMasked Language Model (MLM)
Randomly mask 15% of tokens and predict them. This allows BERT to learn bidirectional context.
Input: The [MASK] sat on the mat
Target: The cat sat on the mat
Masking strategy:
- 80% ā replace with [MASK]
- 10% ā replace with random word
- 10% ā keep original
ā¹ļø Why 80/10/10 Masking?
The 80/10/10 strategy prevents the model from over-relying on the [MASK] token. Since [MASK] never appears during fine-tuning, always masking would create a mismatch between pre-training and fine-tuning.
Next Sentence Prediction (NSP): Predict whether sentence B follows sentence A.
Input: [CLS] The cat sat on the mat [SEP] It was soft [SEP]
Label: IsNext (1)
Input: [CLS] The cat sat on the mat [SEP] Stocks rose [SEP]
Label: NotNext (0)
ThBERT Pre-training Objective
BERT optimizes a combined loss: L = L_MLM + L_NSM, where L_MLM is the cross-entropy loss for masked token prediction and L_NSM is the binary cross-entropy loss for next sentence prediction. This joint objective enables bidirectional context understanding.
Using Pre-trained BERT
from transformers import BertTokenizer, BertModel
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Input
text = "BERT learns contextual word representations."
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
# Access different outputs
last_hidden = outputs.last_hidden_state # (batch, seq_len, 768)
pooler_output = outputs.pooler_output # (batch, 768) ā [CLS] token
print(f"Token embeddings shape: {last_hidden.shape}")
print(f"CLS token shape: {pooler_output.shape}")
# Extract specific token embeddings
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['bert', 'learns', 'contextual', 'word', 'representations', '.']
# Each token has a 768-dim contextual embedding
šBERT Token Embeddings
For input "The cat sat", BERT produces:
- Token embeddings: [The], [cat], [sat] each with shape (768,)
- Position embeddings: [pos0], [pos1], [pos2]
- Segment embeddings: [sentence A] for all tokens
Combined: embedding = token + position + segment
Output: (3, 768) tensor with contextual representations
5. Fine-tuning BERT for Classification
from transformers import BertForSequenceClassification, BertTokenizer, AdamW
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn
class TextClassificationDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_len,
padding='max_length',
truncation=True,
return_tensors='pt'
)
return {
'input_ids': encoding['input_ids'].squeeze(),
'attention_mask': encoding['attention_mask'].squeeze(),
'label': torch.tensor(self.labels[idx])
}
# Model
model = BertForSequenceClassification.from_pretrained(
'bert-base-uncased',
num_labels=2, # binary classification
)
# Freeze BERT layers (optional ā for small datasets)
for param in model.bert.parameters():
param.requires_grad = False
# Only train classifier
for param in model.classifier.parameters():
param.requires_grad = True
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)
# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
for epoch in range(3):
model.train()
total_loss = 0
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['label'].to(device)
optimizer.zero_grad()
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f}")
š” Fine-tuning Best Practices
Use a small learning rate (2e-5 to 5e-5) when fine-tuning BERT. Larger learning rates can cause catastrophic forgetting of pre-trained knowledge. For small datasets, freeze BERT layers and only train the classifier head.
6. Other BERT Variants
RoBERTa (Robustly Optimized BERT)
- Removes NSP objective
- Uses larger batches and more data
- Dynamic masking (different mask each epoch)
ALBERT (A Lite BERT)
- Factorized embedding decomposition:
- Cross-layer parameter sharing (all layers share weights)
- 89M parameters (vs BERT-base 110M) with comparable performance
DistilBERT
- Knowledge distillation from BERT
- 6 layers, 66M parameters (40% smaller, 60% faster)
- 97% of BERT's performance
DeBERTa
- Disentangled attention: separate content and position embeddings
- Enhanced mask decoder
- Superhuman performance on some benchmarks
7. Visualization: How Attention Captures Syntax
Sentence: "The cat that the dog chased was brown"
Attention Head (layer 6, head 3):
The cat that the dog chased was brown
āāāāāāāāā¬āāāāāāāā¬āāāāāāāā¬āāāāāāāā¬āāāāāāāā¬āāāāāāāā¬āāāāāāāā¬āāāāāāāā
"cat" ā 0.45 ā 0.02 ā 0.12 ā 0.03 ā 0.08 ā 0.25 ā 0.02 ā 0.03 ā
"chased" ā 0.02 ā 0.38 ā 0.03 ā 0.02 ā 0.31 ā 0.04 ā 0.12 ā 0.08 ā
"brown" ā 0.01 ā 0.02 ā 0.01 ā 0.01 ā 0.02 ā 0.01 ā 0.52 ā 0.40 ā
āāāāāāāāā“āāāāāāāā“āāāāāāāā“āāāāāāāā“āāāāāāāā“āāāāāāāā“āāāāāāāā“āāāāāāāā
"cat" attends to "chased" (subject-verb dependency)
"chased" attends to "dog" (verb-object dependency)
"brown" attends to "was" (adjective-copula)
8. Key Takeaways
šSummary: Transformers + BERT
- Self-attention computes relationships between all tokens in time
- Multi-head attention captures different types of relationships in parallel
- Positional encoding injects sequence order information using sinusoidal functions
- BERT pre-trains with MLM and NSP, then fine-tunes for downstream tasks
- Fine-tuning with a small learning rate () works best
- Domain-specific BERT models (BioBERT, SciBERT, FinBERT) outperform general BERT on domain tasks
- The Transformer's parallelization advantage over RNNs enables training on massive datasets
- Attention weights provide interpretability for model decisions
9. Practice Exercises
Exercise 1: Implement Attention from Scratch
# TODO: Implement scaled dot-product attention
# Test with Q=K=V (self-attention) and Qā K (cross-attention)
# Verify output shapes and attention weight properties
Exercise 2: Fine-tune BERT for NER
# TODO: Fine-tune BERT for Named Entity Recognition
# Use CoNLL-2003 dataset
# Target: F1 > 90% on entity recognition
Exercise 3: Compare Transformer Variants
# TODO: Compare BERT-base, RoBERTa, and DistilBERT on:
# 1. Training time
# 2. Inference speed
# 3. Accuracy on GLUE benchmark
# 4. Model size
Exercise 4: Attention Visualization
# TODO: Extract and visualize attention weights
# Show which tokens attend to which
# Identify syntactic patterns (subject-verb, adjective-noun)