From RNNs to Transformers
The Limitations of Recurrent Architectures
Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU) processed sequences step-by-step, maintaining a hidden state that accumulated information over time. While effective for many tasks, they suffered from fundamental limitations:
Sequential Bottleneck: RNNs process tokens one at a time, making parallelization impossible. For a sequence of length , the time complexity is sequential operations, preventing GPU acceleration.
Long-Range Dependencies: Despite gating mechanisms, RNNs struggled to maintain information over long distances. The gradient signal must propagate through every intermediate step, leading to vanishing or exploding gradients.
Information Bottleneck: The fixed-size hidden state must compress all relevant information from the entire sequence, creating a capacity bottleneck.
The Attention Revolution
The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminated recurrence entirely. The key insight: use self-attention to model relationships between all positions simultaneously, enabling full parallelization and direct connections across arbitrary distances.
The Transformer achieves sequential operations (fully parallelizable) and compute complexity, trading sequential depth for quadratic attention computation — a favorable tradeoff on modern hardware.
| RNNs | Transformers |
|---|---|
| Sequential: steps | Parallel: steps |
| Hidden state bottleneck | Full pairwise attention |
| compute | compute |
| Struggles with long range | Direct connections across all distances |
Self-Attention Mechanism
Query, Key, Value Framework
Self-attention computes a weighted sum of all positions in a sequence, where the weights are determined by the compatibility (dot product) between positions. Each input token is projected into three vectors:
where and are learned projection matrices.
Scaled Dot-Product Attention
Given a sequence of tokens with embeddings , we compute the full attention operation as:
where:
- — Query matrix
- — Key matrix
- — Value matrix
Why scale by ? When is large, the dot products tend to have large magnitudes, pushing the softmax into regions with extremely small gradients. Scaling by (the standard deviation of the dot product under random initialization) keeps the softmax in a regime with useful gradients:
Attention as Soft Retrieval
Self-attention can be interpreted as a soft content-based retrieval system:
- Queries represent what each token is "looking for"
- Keys represent what each token "offers"
- Values represent the actual information carried by each token
- The attention weights determine how much information to retrieve from position for position
Computational Complexity
The full attention computation requires:
- Matrix multiplication :
- Softmax:
- Weighted sum:
- Total:
This quadratic complexity in sequence length is the primary limitation of standard Transformers for very long sequences.
Multi-Head Attention
Parallel Attention Heads
Multi-Head Attention runs multiple attention operations in parallel, allowing the model to attend to different types of relationships simultaneously:
where , , , and .
Why Multiple Heads Work
Different heads specialize in different linguistic phenomena:
- Head 1 may attend to syntactic dependencies (subject-verb agreement)
- Head 2 may capture semantic relationships (modifier-modified)
- Head 3 may track positional patterns (adjacent tokens)
- Head 4 may resolve coreference (pronoun-antecedent)
With heads each using dimensions, the total compute is equivalent to a single head with full dimensionality, but the representational capacity is significantly richer.
Attention Head Visualization
In practice, attention patterns can be visualized as heatmaps where each row represents a query token and each column represents a key token:
Positional Encoding
The Need for Position Information
Self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional information, "The cat sat" and "sat cat The" would produce identical representations. Positional encodings inject sequence order information.
Sinusoidal Positional Encoding
The original Transformer uses fixed sinusoidal functions:
where is the position index and is the dimension index.
Key Properties:
- Bounded values: for all positions and dimensions
- Relative positions learnable: can be expressed as a linear function of (rotation matrix in each frequency pair)
- Unique encoding per position: Each position has a distinct encoding vector
- Generalization to unseen lengths: The model can extrapolate to sequences longer than seen in training
Alternative Positional Encodings
Learned Positional Embeddings: BERT and GPT use learned position embeddings stored in a lookup table. This is simpler but limits generalization to unseen sequence lengths.
Rotary Position Embeddings (RoPE): Encodes positions by rotating query and key vectors in 2D planes, enabling relative position awareness through dot products. Used in modern LLMs like LLaMA.
ALiBi (Attention with Linear Biases): Adds linear bias terms to attention scores based on relative distance, without explicit positional encoding.
Transformer Encoder Architecture
Encoder Block
Each Transformer encoder block consists of two sub-layers with residual connections and layer normalization:
Mathematical Formulation
Sub-layer 1: Multi-Head Self-Attention
Sub-layer 2: Position-wise Feed-Forward Network
The FFN is applied independently to each position but shares the same parameters across positions — hence "position-wise."
Layer Normalization
where and are computed across the feature dimension for each token, and are learned scale and shift parameters.
Transformer Decoder Architecture
Decoder Block
The decoder extends the encoder with an additional masked self-attention sub-layer and encoder-decoder cross-attention:
Causal Masking
For autoregressive generation, the decoder uses a causal mask to prevent attending to future positions:
Applied before softmax: ensures position only attends to positions .
BERT: Bidirectional Encoder Representations from Transformers
Architecture Overview
BERT (Devlin et al., 2019) uses only the Transformer encoder stack, enabling bidirectional context understanding — unlike GPT which is left-to-right only.
BERT Model Variants
| Model | Layers () | Hidden () | Heads () | Parameters |
|---|---|---|---|---|
| BERT-base | 12 | 768 | 12 | 110M |
| BERT-large | 24 | 1024 | 16 | 340M |
Pre-Training Objectives
BERT is pre-trained on two self-supervised tasks:
1. Masked Language Modeling (MLM)
Randomly mask 15% of input tokens and predict them:
Of the masked tokens:
- 80% are replaced with
[MASK] - 10% are replaced with a random token
- 10% are left unchanged
This creates a mismatch between pre-training and fine-tuning (no [MASK] token at inference), so the 80/10/10 strategy mitigates this.
2. Next Sentence Prediction (NSP)
Given sentence pairs , predict whether actually follows in the corpus:
BERT Input Representation
The input to BERT is the sum of three embeddings:
where:
- Token Embedding: WordPiece tokenization (max 512 tokens)
- Position Embedding: Learned (max 512 positions)
- Segment Embedding: Distinguishes sentence A vs B (for NSP)
Special tokens: [CLS] (classification), [SEP] (separator), [MASK] (masked), [PAD] (padding)
GPT: Autoregressive Language Modeling
GPT Architecture
GPT (Radford et al., 2018) uses only the Transformer decoder stack, training autoregressively to predict the next token:
BERT vs GPT: Key Differences
| Aspect | BERT | GPT |
|---|---|---|
| Architecture | Encoder only | Decoder only |
| Attention | Bidirectional | Unidirectional (causal) |
| Pre-training | MLM + NSP | Next token prediction |
| Fine-tuning | Classification, QA, NER | Generation, classification |
| Best for | Understanding tasks | Generation tasks |
BERT Variants and Evolution
- RoBERTa (Liu et al., 2019): Removed NSP, more data, larger batches
- ALBERT (Lan et al., 2020): Parameter sharing, factorized embeddings
- DistilBERT (Sanh et al., 2019): Knowledge distillation, 60% params, 97% performance
- DeBERTa (He et al., 2021): Disentangled attention, enhanced decoder
Fine-Tuning Pretrained Models
Transfer Learning Paradigm
The modern NLP paradigm follows a two-stage approach:
- Pre-training: Learn general language representations on large corpora (self-supervised)
- Fine-tuning: Adapt to specific downstream tasks (supervised)
Task-Specific Heads
Different tasks require different output layers:
| Task | Input | Output | Head |
|---|---|---|---|
| Sentiment | [CLS] token | Binary/multi-class | Linear + softmax |
| NER | All tokens | Per-token labels | Linear + CRF |
| QA | Passage + Question | Start/end span | Two linear layers |
| Similarity | Two [CLS] | Similarity score | Cosine / MLP |
| Generation | All tokens | Next token distribution | Language head |
Hugging Face Transformers Implementation
Installation
pip install transformers datasets accelerate
Basic Usage
from transformers import AutoTokenizer, AutoModel
# Load pretrained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Tokenize input
text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")
# Forward pass
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, hidden_dim)
# outputs.pooler_output: (batch, hidden_dim) - [CLS] token
Fine-Tuning for Text Classification
from transformers import (
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
# Load dataset
dataset = load_dataset("imdb")
# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
# Tokenize
def tokenize_fn(examples):
return tokenizer(
examples["text"],
padding="max_length",
truncation=True,
max_length=512,
)
tokenized = dataset.map(tokenize_fn, batched=True)
# Training arguments
args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Trainer
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
compute_metrics=compute_metrics,
)
# Fine-tune
trainer.train()
Fine-Tuning for Named Entity Recognition
from transformers import AutoModelForTokenClassification
import numpy as np
import evaluate
# Load model
model = AutoModelForTokenClassification.from_pretrained(
"bert-base-uncased",
num_labels=9, # CoNLL-2003 NER tags
)
# Tokenize with alignment
def tokenize_and_align_labels(examples):
tokenized = tokenizer(
examples["tokens"],
truncation=True,
is_split_into_words=True,
)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized.word_ids(batch_index=i)
label_ids = []
prev_word_id = None
for word_id in word_ids:
if word_id is None:
label_ids.append(-100) # Special tokens
elif word_id != prev_word_id:
label_ids.append(label[word_id])
else:
label_ids.append(-100) # Subword tokens
prev_word_id = word_id
labels.append(label_ids)
tokenized["labels"] = labels
return tokenized
tokenized_ner = dataset.map(tokenize_and_align_labels, batched=True)
# Compute metrics
def compute_metrics_ner(pred):
predictions, labels = pred
predictions = np.argmax(predictions, axis=2)
true_labels = [
[label for label, pred in zip(label, prediction) if label != -100]
for label, prediction in zip(labels, predictions)
]
true_preds = [
[label for label, pred in zip(label, prediction) if label != -100]
for label, prediction in zip(labels, predictions)
]
results = metric.compute(predictions=true_preds, references=true_labels)
return results
Using Pipelines for Inference
from transformers import pipeline
# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This movie is amazing!")
# {'label': 'POSITIVE', 'score': 0.9998}
# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Apple was founded by Steve Jobs in California")
# [{'entity_group': 'ORG', 'word': 'Apple', ...},
# {'entity_group': 'PER', 'word': 'Steve Jobs', ...},
# {'entity_group': 'LOC', 'word': 'California', ...}]
# Question Answering
qa = pipeline("question-answering")
result = qa(
question="When was BERT published?",
context="BERT was published by Google in October 2018.",
)
# {'answer': 'October 2018', 'score': 0.95, ...}
# Text Generation (GPT-2)
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)
Advanced: Custom Model Architecture
import torch
import torch.nn as nn
from transformers import AutoModel
class CustomTransformerModel(nn.Module):
def __init__(self, model_name, num_classes):
super().__init__()
self.encoder = AutoModel.from_pretrained(model_name)
self.classifier = nn.Sequential(
nn.Dropout(0.1),
nn.Linear(self.encoder.config.hidden_size, 256),
nn.ReLU(),
nn.Dropout(0.1),
nn.Linear(256, num_classes),
)
def forward(self, input_ids, attention_mask=None):
outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
)
pooled = outputs.pooler_output # [CLS] token
logits = self.classifier(pooled)
return logits
# Usage
model = CustomTransformerModel("bert-base-uncased", num_classes=5)
Key Takeaways
-
Self-attention enables parallel processing of sequences with direct connections between all positions, solving the sequential bottleneck of RNNs.
-
Multi-head attention allows the model to capture diverse relationships (syntactic, semantic, positional) simultaneously.
-
Positional encoding injects sequence order information into the permutation-equivariant self-attention mechanism.
-
BERT (encoder-only) excels at understanding tasks through bidirectional pre-training with MLM and NSP objectives.
-
GPT (decoder-only) excels at generation tasks through autoregressive next-token prediction.
-
Fine-tuning pretrained models on task-specific data achieves strong performance with minimal labeled data and compute.
-
Hugging Face Transformers provides a unified API for accessing, fine-tuning, and deploying transformer models.
References
- Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
- Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
- Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
- Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
- Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
- Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
- Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP.