Transformers + BERT

💡 The Transformer architecture (Vaswani et al., 2017) revolutionized NLP by replacing recurrence with self-attention. BERT leverages Transformers for bidirectional language understanding. This lesson covers the full architecture and practical fine-tuning.

1. The Attention Mechanism

Attention computes a weighted sum of values, where weights reflect the importance of each value to a query.

Scaled Dot-Product Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V

Here,

$Q$ =Query matrix (n × d_k)
$K$ =Key matrix (m × d_k)
$V$ =Value matrix (m × d_v)
$d_k$ =Dimension of keys (scaling factor)

where:

$Q \in \mathbb{R}^{n \times d_k}$ (queries)
$K \in \mathbb{R}^{m \times d_k}$ (keys)
$V \in \mathbb{R}^{m \times d_v}$ (values)
$d_k$ = dimension of keys (scaling factor)

Visual: Attention Score Matrix

Architecture Diagram

Query words:  "The cat sat on the mat"
Key words:    "The cat sat on the mat"

              The    cat    sat    on     the    mat
           ┌──────┬──────┬──────┬──────┬──────┬──────┐
"The"      │ 0.82 │ 0.03 │ 0.01 │ 0.05 │ 0.06 │ 0.03 │
"cat"      │ 0.02 │ 0.71 │ 0.12 │ 0.02 │ 0.02 │ 0.11 │
"sat"      │ 0.01 │ 0.15 │ 0.68 │ 0.08 │ 0.01 │ 0.07 │
"on"       │ 0.03 │ 0.01 │ 0.09 │ 0.78 │ 0.03 │ 0.06 │
"the"      │ 0.11 │ 0.02 │ 0.01 │ 0.04 │ 0.79 │ 0.03 │
"mat"      │ 0.02 │ 0.09 │ 0.06 │ 0.07 │ 0.02 │ 0.74 │
           └──────┴──────┴──────┴──────┴──────┴──────┘

Each row = attention weights for that word
High diagonal = words attend strongly to themselves
"cat" attends to "mat" (semantic relationship)

ℹ️ Why Scale by sqrt(d_k)?

Without scaling, dot products grow large with d_k, pushing softmax into regions with tiny gradients. Dividing by sqrt(d_k) keeps variance at 1, maintaining stable gradients.

Variance of Unscaled Dot Products

\text{Var}(QK^T) = d_k \cdot \text{Var}(q_i) \cdot \text{Var}(k_j) = d_k

Here,

$d_k$ =Dimension of keys
$q_i$ =Individual query element
$k_j$ =Individual key element

Why Scaled Dot-Product?

Architecture Diagram

Without scaling:
  d_k = 512 dimensions
  Q and K are random vectors with mean=0, var=1
  dot(Q, K) = sum(q_i * k_i) for i=1..512
  Var(dot) = d_k * Var(q_i) * Var(k_i) = 512

  Large values -> softmax saturates -> gradients vanish

With scaling (divide by sqrt(d_k) = sqrt(512) = 22.6):
  Var(dot/sqrt(d_k)) = 512 / 512 = 1

  Variance stays at 1 regardless of d_k
  Softmax stays in its "sensitive" region
  Gradients flow properly during backprop

ThAttention Scaling Property

For queries and keys with independent entries having zero mean and unit variance, the variance of the unscaled dot product equals the key dimension d_k. Scaling by 1/sqrt(d_k) normalizes the variance to 1, ensuring softmax operates in a regime with non-vanishing gradients.

📝Computing Attention Weights

Consider queries Q = [1, 0, 1] and keys K = [1, 0, 1] with d_k = 3:

Dot product: QK^T = 11 + 00 + 1*1 = 2
Scaled: 2 / sqrt(3) ≈ 1.155
Softmax: [e^1.155] / [e^1.155] = 1.0 (single token)
For multiple keys, softmax distributes weights across tokens

2. Multi-Head Attention

Instead of a single attention function, run h parallel attention heads and concatenate:

Multi-Head Attention

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O

Here,

$\text{head}_i$ =Individual attention head
$W^O$ =Output projection matrix

Individual Attention Head

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

Here,

$W_i^Q$ =Query projection for head i
$W_i^K$ =Key projection for head i
$W_i^V$ =Value projection for head i

Architecture Diagram

Input: X ∈ ℝ^(n × d_model)
  │
  ├─→ Q = XW^Q ──→ Head 1 (Attention) ──→
  ├─→ K = XW^K ──→ Head 2 (Attention) ──→ Concat → Linear → Output
  ├─→ V = XW^V ──→ Head 3 (Attention) ──→
  └─→          ──→ Head 4 (Attention) ──→

Each head: d_k = d_model / h = 512 / 8 = 64 dimensions

Why Multiple Heads?

Different heads learn different types of relationships:

Architecture Diagram

Head 1 (syntactic):   "cat" <-> "sat"     (subject-verb)
Head 2 (semantic):    "cat" <-> "mat"     (object)
Head 3 (positional):  "the" <-> "cat"     (determiner)
Head 4 (long-range):  "cat" <-> "the mat" (noun phrase)

DfSelf-Attention

Self-attention lets each token "attend to" every other token to gather context. Think of it as each word asking: "Which other words should I pay attention to?"

💡 Multi-Head Attention Intuition

Multiple heads allow the model to jointly attend to information from different representation subspaces at different positions. Each head can specialize in capturing different types of dependencies (syntactic, semantic, positional).

Architecture Diagram

Step-by-step for "The cat sat on the mat":

1. Create Query, Key, Value vectors for each token:
   Token    Query (what am I looking for?)
   ------   -------------------------------
   The      [0.1, 0.3, -0.2, ...]  "I'm a determiner, looking for a noun"
   cat      [0.5, -0.1, 0.4, ...]  "I'm a noun, looking for my verb"
   sat      [-0.3, 0.6, 0.2, ...]  "I'm a verb, looking for my subject"
   the      [0.1, 0.3, -0.2, ...]  (same as first "The")
   mat      [0.4, 0.2, -0.1, ...]  "I'm a noun, looking for my verb"

2. Compute attention scores (dot product of Q and K):
   score("cat", "sat") = dot(Q_cat, K_sat) = 0.8  (high! subject-verb)
   score("cat", "mat") = dot(Q_cat, K_mat) = 0.6  (medium, object)
   score("cat", "the") = dot(Q_cat, K_the) = 0.1  (low, just a determiner)

3. Scale by sqrt(d_k) and apply softmax:
   attention_weights = softmax(scores / sqrt(d_k))

   "cat" attends to: sat=0.6, mat=0.3, the=0.05, The=0.05

4. Weighted sum of Values:
   output("cat") = 0.6*V_sat + 0.3*V_mat + 0.05*V_the + 0.05*V_The

   This output encodes "cat" with context from "sat" (its verb)
   and "mat" (its object), while mostly ignoring determiners.

Attention Matrix Visualization

Architecture Diagram

           The    cat    sat    on    the    mat
        +------+------+------+------+------+------+
   The  | 0.82 | 0.03 | 0.01 | 0.05 | 0.06 | 0.03 |
        +------+------+------+------+------+------+
   cat  | 0.02 | 0.71 | 0.12 | 0.02 | 0.02 | 0.11 |
        +------+------+------+------+------+------+
   sat  | 0.01 | 0.15 | 0.68 | 0.08 | 0.01 | 0.07 |
        +------+------+------+------+------+------+
   on   | 0.03 | 0.01 | 0.09 | 0.78 | 0.03 | 0.06 |
        +------+------+------+------+------+------+
   the  | 0.11 | 0.02 | 0.01 | 0.04 | 0.79 | 0.03 |
        +------+------+------+------+------+------+
   mat  | 0.02 | 0.09 | 0.06 | 0.07 | 0.02 | 0.74 |
        +------+------+------+------+------+------+

Reading: Row i, Column j = how much token i attends to token j
Diagonal is high (tokens attend to themselves)
"cat" strongly attends to "sat" (0.12) and "mat" (0.11)

import torch
import torch.nn as nn
import math

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model=512, num_heads=8):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and reshape
        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Attention
        output, attn_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        # Concatenate heads
        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(output)

# Example
mha = MultiHeadAttention(d_model=512, num_heads=8)
x = torch.randn(2, 10, 512)  # batch=2, seq_len=10
output = mha(x, x, x)  # self-attention
print(output.shape)  # torch.Size([2, 10, 512])

3. Transformer Architecture

Encoder Block

Architecture Diagram

Input Embedding + Positional Encoding
         │
         ▼
┌─────────────────────────┐
│  Multi-Head Self-Attn   │──→ Add & LayerNorm ──┐
└─────────────────────────┘                      │
         │                                       │
         ▼                                       │
┌─────────────────────────┐                      │
│  Feed-Forward Network    │──→ Add & LayerNorm ─┘
│  (d_model → d_ff → d)  │
└─────────────────────────┘
         │
         ▼
      Output

Decoder Block

Architecture Diagram

Target Embedding + Positional Encoding
         │
         ▼
┌─────────────────────────┐
│  Masked Multi-Head      │──→ Add & LayerNorm
│  Self-Attention         │
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Cross-Attention        │──→ Add & LayerNorm
│  (Q from decoder,       │
│   K,V from encoder)     │
└─────────────────────────┘
         │
         ▼
┌─────────────────────────┐
│  Feed-Forward Network   │──→ Add & LayerNorm
└─────────────────────────┘

Positional Encoding

Since Transformers have no recurrence, positional information is injected:

Positional Encoding (Sine)

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Here,

$pos$ =Position in the sequence
$i$ =Dimension index
$d_{\text{model}}$ =Model dimension

Positional Encoding (Cosine)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Here,

$pos$ =Position in the sequence
$i$ =Dimension index
$d_{\text{model}}$ =Model dimension

DfPositional Encoding

Positional encoding injects sequence order information into the model by adding sinusoidal functions of different frequencies to the input embeddings. This allows the model to distinguish between different positions in the sequence.

ℹ️ Why Sinusoidal Encoding?

The sinusoidal functions allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos). This enables the model to generalize to sequence lengths longer than those seen during training.

class PositionalEncoding(nn.Module):
    def __init__(self, d_model=512, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))

    def forward(self, x):
        seq_len = x.size(1)
        return x + self.pe[:, :seq_len]

# Visualize positional encodings
pe = PositionalEncoding(d_model=128)
positions = pe.pe[0, :100, :].detach().numpy()
# Each row is a positional encoding vector
# Sine/cosine waves of different frequencies
# Similar positions have similar encodings (smooth interpolation)

📝Positional Encoding Computation

For position pos=5, dimension i=0, d_model=512:

Compute frequency: 5 / 10000^(0/512) = 5 / 1 = 5
Sine encoding: sin(5) ≈ -0.959
Cosine encoding: cos(5) ≈ 0.284

For i=1: frequency = 5 / 10000^(2/512) ≈ 5 / 1.039 ≈ 4.812

Full Transformer Encoder

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model=512, num_heads=8, d_ff=2048, dropout=0.1):
        super().__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # FFN with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))
        return x

class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, d_ff=2048,
                 num_layers=6, max_len=5000, dropout=0.1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoding = PositionalEncoding(d_model, max_len)
        self.layers = nn.ModuleList([
            TransformerEncoderLayer(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])
        self.d_model = d_model

    def forward(self, x, mask=None):
        x = self.embedding(x) * math.sqrt(self.d_model)
        x = self.pos_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

# Example
encoder = TransformerEncoder(vocab_size=30000, d_model=512)
x = torch.randint(0, 30000, (2, 20))  # batch=2, seq_len=20
output = encoder(x)
print(output.shape)  # torch.Size([2, 20, 512])

4. BERT (Bidirectional Encoder Representations from Transformers)

Architecture

BERT uses only the Transformer encoder (no decoder):

Model	Layers	Hidden	Heads	Parameters
BERT-base	12	768	12	110M
BERT-large	24	1024	16	340M

Pre-training Objectives

DfMasked Language Model (MLM)

Randomly mask 15% of tokens and predict them. This allows BERT to learn bidirectional context.

Architecture Diagram

Input:    The [MASK] sat on the mat
Target:   The  cat  sat on the mat

Masking strategy:
- 80% → replace with [MASK]
- 10% → replace with random word
- 10% → keep original

ℹ️ Why 80/10/10 Masking?

The 80/10/10 strategy prevents the model from over-relying on the [MASK] token. Since [MASK] never appears during fine-tuning, always masking would create a mismatch between pre-training and fine-tuning.

Next Sentence Prediction (NSP): Predict whether sentence B follows sentence A.

Architecture Diagram

Input:  [CLS] The cat sat on the mat [SEP] It was soft [SEP]
Label:  IsNext (1)

Input:  [CLS] The cat sat on the mat [SEP] Stocks rose [SEP]
Label:  NotNext (0)

ThBERT Pre-training Objective

BERT optimizes a combined loss: L = L_MLM + L_NSM, where L_MLM is the cross-entropy loss for masked token prediction and L_NSM is the binary cross-entropy loss for next sentence prediction. This joint objective enables bidirectional context understanding.

Using Pre-trained BERT

from transformers import BertTokenizer, BertModel
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Input
text = "BERT learns contextual word representations."
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Access different outputs
last_hidden = outputs.last_hidden_state     # (batch, seq_len, 768)
pooler_output = outputs.pooler_output       # (batch, 768) — [CLS] token

print(f"Token embeddings shape: {last_hidden.shape}")
print(f"CLS token shape: {pooler_output.shape}")

# Extract specific token embeddings
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# ['bert', 'learns', 'contextual', 'word', 'representations', '.']
# Each token has a 768-dim contextual embedding

📝BERT Token Embeddings

For input "The cat sat", BERT produces:

Token embeddings: [The], [cat], [sat] each with shape (768,)
Position embeddings: [pos0], [pos1], [pos2]
Segment embeddings: [sentence A] for all tokens

Combined: embedding = token + position + segment

Output: (3, 768) tensor with contextual representations

5. Fine-tuning BERT for Classification

from transformers import BertForSequenceClassification, BertTokenizer, AdamW
from torch.utils.data import DataLoader, Dataset
import torch.nn as nn

class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_len=128):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        encoding = self.tokenizer(
            self.texts[idx],
            max_length=self.max_len,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].squeeze(),
            'attention_mask': encoding['attention_mask'].squeeze(),
            'label': torch.tensor(self.labels[idx])
        }

# Model
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,  # binary classification
)

# Freeze BERT layers (optional — for small datasets)
for param in model.bert.parameters():
    param.requires_grad = False

# Only train classifier
for param in model.classifier.parameters():
    param.requires_grad = True

# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5, weight_decay=0.01)

# Training loop
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

for epoch in range(3):
    model.train()
    total_loss = 0
    for batch in train_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['label'].to(device)

        optimizer.zero_grad()
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1} | Loss: {avg_loss:.4f}")

💡 Fine-tuning Best Practices

Use a small learning rate (2e-5 to 5e-5) when fine-tuning BERT. Larger learning rates can cause catastrophic forgetting of pre-trained knowledge. For small datasets, freeze BERT layers and only train the classifier head.

6. Other BERT Variants

RoBERTa (Robustly Optimized BERT)

Removes NSP objective
Uses larger batches and more data
Dynamic masking (different mask each epoch)

ALBERT (A Lite BERT)

Factorized embedding decomposition: $V \times H \rightarrow V \times E + E \times H$
Cross-layer parameter sharing (all layers share weights)
89M parameters (vs BERT-base 110M) with comparable performance

DistilBERT

Knowledge distillation from BERT
6 layers, 66M parameters (40% smaller, 60% faster)
97% of BERT's performance

DeBERTa

Disentangled attention: separate content and position embeddings
Enhanced mask decoder
Superhuman performance on some benchmarks

7. Visualization: How Attention Captures Syntax

Architecture Diagram

Sentence: "The cat that the dog chased was brown"

Attention Head (layer 6, head 3):
               The   cat   that  the   dog   chased was    brown
           ┌───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
"cat"      │  0.45 │  0.02 │  0.12 │  0.03 │  0.08 │  0.25 │  0.02 │  0.03 │
"chased"   │  0.02 │  0.38 │  0.03 │  0.02 │  0.31 │  0.04 │  0.12 │  0.08 │
"brown"    │  0.01 │  0.02 │  0.01 │  0.01 │  0.02 │  0.01 │  0.52 │  0.40 │
           └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

"cat" attends to "chased" (subject-verb dependency)
"chased" attends to "dog" (verb-object dependency)
"brown" attends to "was" (adjective-copula)

8. Key Takeaways

📋Summary: Transformers + BERT

Self-attention computes relationships between all tokens in $O(n^2 \cdot d)$ time
Multi-head attention captures different types of relationships in parallel
Positional encoding injects sequence order information using sinusoidal functions
BERT pre-trains with MLM and NSP, then fine-tunes for downstream tasks
Fine-tuning with a small learning rate ( $2 \times 10^{-5}$ ) works best
Domain-specific BERT models (BioBERT, SciBERT, FinBERT) outperform general BERT on domain tasks
The Transformer's parallelization advantage over RNNs enables training on massive datasets
Attention weights provide interpretability for model decisions

9. Practice Exercises

Exercise 1: Implement Attention from Scratch

# TODO: Implement scaled dot-product attention
# Test with Q=K=V (self-attention) and Q≠K (cross-attention)
# Verify output shapes and attention weight properties

Exercise 2: Fine-tune BERT for NER

# TODO: Fine-tune BERT for Named Entity Recognition
# Use CoNLL-2003 dataset
# Target: F1 > 90% on entity recognition

Exercise 3: Compare Transformer Variants

# TODO: Compare BERT-base, RoBERTa, and DistilBERT on:
# 1. Training time
# 2. Inference speed
# 3. Accuracy on GLUE benchmark
# 4. Model size

Exercise 4: Attention Visualization

# TODO: Extract and visualize attention weights
# Show which tokens attend to which
# Identify syntactic patterns (subject-verb, adjective-noun)

Transformers + BERT