CW

Transformers and BERT: Attention Is All You Need

Module 14: NLPFree Lesson

Advertisement

From RNNs to Transformers

The Limitations of Recurrent Architectures

Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU) processed sequences step-by-step, maintaining a hidden state that accumulated information over time. While effective for many tasks, they suffered from fundamental limitations:

Sequential Bottleneck: RNNs process tokens one at a time, making parallelization impossible. For a sequence of length nn, the time complexity is O(n)O(n) sequential operations, preventing GPU acceleration.

Long-Range Dependencies: Despite gating mechanisms, RNNs struggled to maintain information over long distances. The gradient signal must propagate through every intermediate step, leading to vanishing or exploding gradients.

Information Bottleneck: The fixed-size hidden state htRdh_t \in \mathbb{R}^d must compress all relevant information from the entire sequence, creating a capacity bottleneck.

The Attention Revolution

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminated recurrence entirely. The key insight: use self-attention to model relationships between all positions simultaneously, enabling full parallelization and direct connections across arbitrary distances.

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

The Transformer achieves O(1)O(1) sequential operations (fully parallelizable) and O(n2d)O(n^2 \cdot d) compute complexity, trading sequential depth for quadratic attention computation — a favorable tradeoff on modern hardware.

RNNsTransformers
Sequential: O(n)O(n) stepsParallel: O(1)O(1) steps
Hidden state bottleneckFull pairwise attention
O(nd2)O(n \cdot d^2) computeO(n2d)O(n^2 \cdot d) compute
Struggles with long rangeDirect connections across all distances

Self-Attention Mechanism

Query, Key, Value Framework

Self-attention computes a weighted sum of all positions in a sequence, where the weights are determined by the compatibility (dot product) between positions. Each input token xix_i is projected into three vectors:

qi=WQxi,ki=WKxi,vi=WVxiq_i = W_Q x_i, \quad k_i = W_K x_i, \quad v_i = W_V x_i

where WQ,WKRdk×dW_Q, W_K \in \mathbb{R}^{d_k \times d} and WVRdv×dW_V \in \mathbb{R}^{d_v \times d} are learned projection matrices.

Inputx_iW_Qq_iW_Kk_iW_Vv_iScaled Dot-Product1. Compute ScoresS = QK^T / sqrt(d_k)2. Apply Softmaxalpha = softmax(S)3. Weighted SumOutputz_i = Sigma(alpha*v)Each position attends to all positions simultaneously

Scaled Dot-Product Attention

Given a sequence of nn tokens with embeddings XRn×dX \in \mathbb{R}^{n \times d}, we compute the full attention operation as:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

where:

  • Q=XWQRn×dkQ = XW_Q \in \mathbb{R}^{n \times d_k} — Query matrix
  • K=XWKRn×dkK = XW_K \in \mathbb{R}^{n \times d_k} — Key matrix
  • V=XWVRn×dvV = XW_V \in \mathbb{R}^{n \times d_v} — Value matrix

Why scale by dk\sqrt{d_k}? When dkd_k is large, the dot products qiTkjq_i^T k_j tend to have large magnitudes, pushing the softmax into regions with extremely small gradients. Scaling by dk\sqrt{d_k} (the standard deviation of the dot product under random initialization) keeps the softmax in a regime with useful gradients:

Var(qiTkj)=dkVar(qi)Var(kj)cos(θ)=dk1dk=1\text{Var}(q_i^T k_j) = d_k \cdot \text{Var}(q_i) \cdot \text{Var}(k_j) \cdot \cos(\theta) = d_k \cdot \frac{1}{d_k} = 1

Attention as Soft Retrieval

Self-attention can be interpreted as a soft content-based retrieval system:

  • Queries represent what each token is "looking for"
  • Keys represent what each token "offers"
  • Values represent the actual information carried by each token
  • The attention weights αij\alpha_{ij} determine how much information to retrieve from position jj for position ii

Computational Complexity

The full attention computation requires:

  • Matrix multiplication QKTQK^T: O(n2dk)O(n^2 \cdot d_k)
  • Softmax: O(n2)O(n^2)
  • Weighted sum: O(n2dv)O(n^2 \cdot d_v)
  • Total: O(n2d)O(n^2 \cdot d)

This quadratic complexity in sequence length is the primary limitation of standard Transformers for very long sequences.


Multi-Head Attention

Parallel Attention Heads

Multi-Head Attention runs multiple attention operations in parallel, allowing the model to attend to different types of relationships simultaneously:

MultiHead(Q,K,V)=Concat(head1,,headh)WO\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W_O
headi=Attention(QWQi,KWKi,VWVi)\text{head}_i = \text{Attention}(QW_Q^i, KW_K^i, VW_V^i)

where WQiRdk×d/hW_Q^i \in \mathbb{R}^{d_k \times d/h}, WKiRdk×d/hW_K^i \in \mathbb{R}^{d_k \times d/h}, WViRdv×d/hW_V^i \in \mathbb{R}^{d_v \times d/h}, and WORd×dW_O \in \mathbb{R}^{d \times d}.

InputXHead 1 (syntactic)Head 2 (semantic)Head 3 (positional)Head 4 (coreference)...Concat[head_1; head_2; ...]Linear Proj.W_OOutputZ in R^(n x d)Each head learns distinct attention patterns

Why Multiple Heads Work

Different heads specialize in different linguistic phenomena:

  • Head 1 may attend to syntactic dependencies (subject-verb agreement)
  • Head 2 may capture semantic relationships (modifier-modified)
  • Head 3 may track positional patterns (adjacent tokens)
  • Head 4 may resolve coreference (pronoun-antecedent)

With hh heads each using dk=d/hd_k = d/h dimensions, the total compute is equivalent to a single head with full dimensionality, but the representational capacity is significantly richer.

Attention Head Visualization

In practice, attention patterns can be visualized as heatmaps where each row represents a query token and each column represents a key token:

Attention Weight Heatmap (Head 3)Thecatsatonthemat<- Keys (K)ThecatsatonthematAttention Weight:Low (~0.0)Medium (~0.3)High (~0.7)Pattern: Head 3 focuses onpredicate-argument structure* "cat" -> "sat" (subject-verb)* "sat" -> "mat" (verb-object)* Prepositions -> their objects

Positional Encoding

The Need for Position Information

Self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional information, "The cat sat" and "sat cat The" would produce identical representations. Positional encodings inject sequence order information.

Sinusoidal Positional Encoding

The original Transformer uses fixed sinusoidal functions:

PE(pos,2i)=sin(pos100002i/d)PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)
PE(pos,2i+1)=cos(pos100002i/d)PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)

where pospos is the position index and ii is the dimension index.

Key Properties:

  1. Bounded values: PE(pos,i)[1,1]PE_{(pos, i)} \in [-1, 1] for all positions and dimensions
  2. Relative positions learnable: PEpos+kPE_{pos+k} can be expressed as a linear function of PEposPE_{pos} (rotation matrix in each frequency pair)
  3. Unique encoding per position: Each position has a distinct encoding vector
  4. Generalization to unseen lengths: The model can extrapolate to sequences longer than seen in training
Sinusoidal Positional EncodingPosition (pos)Encoding Value0+1-1dim=0 (low freq)dim=1 (low freq)dim=8 (high freq)0102030PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Alternative Positional Encodings

Learned Positional Embeddings: BERT and GPT use learned position embeddings EposRdE_{pos} \in \mathbb{R}^{d} stored in a lookup table. This is simpler but limits generalization to unseen sequence lengths.

Rotary Position Embeddings (RoPE): Encodes positions by rotating query and key vectors in 2D planes, enabling relative position awareness through dot products. Used in modern LLMs like LLaMA.

ALiBi (Attention with Linear Biases): Adds linear bias terms to attention scores based on relative distance, without explicit positional encoding.


Transformer Encoder Architecture

Encoder Block

Each Transformer encoder block consists of two sub-layers with residual connections and layer normalization:

Input Embeddings + Pos EncodingX + PEMulti-Head Self-AttentionQ = K = V = XAdd and LayerNormx + LayerNorm(MHAttn(x))Feed-Forward NetworkFFN(x) = max(0, xW1 + b1)W2 + b2Add and LayerNormx + LayerNorm(FFN(x))Encoder OutputZ in R^(n x d)x N layers

Mathematical Formulation

Sub-layer 1: Multi-Head Self-Attention

MHA(x)=MultiHead(x,x,x)\text{MHA}(x) = \text{MultiHead}(x, x, x)
z=LayerNorm(x+MHA(x))\text{z} = \text{LayerNorm}(x + \text{MHA}(x))

Sub-layer 2: Position-wise Feed-Forward Network

FFN(z)=max(0,zW1+b1)W2+b2\text{FFN}(z) = \max(0, zW_1 + b_1)W_2 + b_2
output=LayerNorm(z+FFN(z))\text{output} = \text{LayerNorm}(z + \text{FFN}(z))

The FFN is applied independently to each position but shares the same parameters across positions — hence "position-wise."

Layer Normalization

LayerNorm(x)=γxμσ2+ϵ+β\text{LayerNorm}(x) = \gamma \odot \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta

where μ\mu and σ2\sigma^2 are computed across the feature dimension for each token, and γ,β\gamma, \beta are learned scale and shift parameters.


Transformer Decoder Architecture

Decoder Block

The decoder extends the encoder with an additional masked self-attention sub-layer and encoder-decoder cross-attention:

Encoder OutputK_enc, V_encDecoder Input (shifted right)Y + PEMasked Multi-Head Self-AttentionPrevents attending to future tokensAdd and LayerNormMulti-Head Cross-AttentionQ from decoder, K/V from encoderAdd and LayerNormFeed-Forward NetworkSame as encoder FFNAdd and LayerNormLinear + Softmax -> P(y_t)x N layersCausal MaskLower triangular matrix

Causal Masking

For autoregressive generation, the decoder uses a causal mask to prevent attending to future positions:

Maskij={0if jiif j>i\text{Mask}_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}

Applied before softmax: softmax(S+Mask)\text{softmax}(S + \text{Mask}) ensures position ii only attends to positions i\leq i.


BERT: Bidirectional Encoder Representations from Transformers

Architecture Overview

BERT (Devlin et al., 2019) uses only the Transformer encoder stack, enabling bidirectional context understanding — unlike GPT which is left-to-right only.

BERT Architecture (Bidirectional)[CLS]Thecat[MASK]saton[SEP]Transformer Encoder Layer 1Self-Attention + FFN + LayerNormTransformer Encoder Layer 2Full bidirectional attention...Transformer Encoder Layer LBERT-base: L=12, BERT-large: L=24[CLS]-> h_clsh_1h_2h_3h_4h_5h_sep

Key: Bidirectional — each token attends to ALL other tokens (left and right)

BERT Model Variants

ModelLayers (LL)Hidden (dd)Heads (hh)Parameters
BERT-base1276812110M
BERT-large24102416340M

Pre-Training Objectives

BERT is pre-trained on two self-supervised tasks:

1. Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict them:

LMLM=iMlogP(xix\M)\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i | x_{\backslash \mathcal{M}})

Of the masked tokens:

  • 80% are replaced with [MASK]
  • 10% are replaced with a random token
  • 10% are left unchanged

This creates a mismatch between pre-training and fine-tuning (no [MASK] token at inference), so the 80/10/10 strategy mitigates this.

2. Next Sentence Prediction (NSP)

Given sentence pairs (A,B)(A, B), predict whether BB actually follows AA in the corpus:

LNSP=[ylogP(IsNext[CLS],A,B)+(1y)logP(NotNext[CLS],A,B)]\mathcal{L}_{\text{NSP}} = -\left[ y \log P(\text{IsNext} | [CLS], A, B) + (1-y) \log P(\text{NotNext} | [CLS], A, B) \right]
BERT Pre-Training TasksTask 1: Masked Language Modeling (MLM)Input:The[M]satonthe[M]BERT Encoder Layers (bidirectional)Predict:catmatP("cat")=0.72P("mat")=0.65Task 2: Next Sentence Prediction (NSP)Sentence A:"The cat sat"Sentence B:"on the mat"[CLS] The cat sat [SEP] on the mat [SEP]IsNextNotNextBinary classification from [CLS] token

BERT Input Representation

The input to BERT is the sum of three embeddings:

Input=TokenEmbed(x)+PosEmbed(pos)+SegEmbed(seg)\text{Input} = \text{TokenEmbed}(x) + \text{PosEmbed}(pos) + \text{SegEmbed}(seg)

where:

  • Token Embedding: WordPiece tokenization (max 512 tokens)
  • Position Embedding: Learned (max 512 positions)
  • Segment Embedding: Distinguishes sentence A vs B (for NSP)

Special tokens: [CLS] (classification), [SEP] (separator), [MASK] (masked), [PAD] (padding)


GPT: Autoregressive Language Modeling

GPT Architecture

GPT (Radford et al., 2018) uses only the Transformer decoder stack, training autoregressively to predict the next token:

LGPT=t=1TlogP(xtx1,,xt1;θ)\mathcal{L}_{\text{GPT}} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}; \theta)
GPT Autoregressive GenerationThecatsatonthe?Causal attention: each token only sees previous tokensMasked Multi-Head Self-AttentionFeed-Forward Networkx N layers (GPT-3: 96 layers, 175B params)P(next)Top predictions:1. mat (0.42)2. floor (0.18)3. bed (0.12)

BERT vs GPT: Key Differences

AspectBERTGPT
ArchitectureEncoder onlyDecoder only
AttentionBidirectionalUnidirectional (causal)
Pre-trainingMLM + NSPNext token prediction
Fine-tuningClassification, QA, NERGeneration, classification
Best forUnderstanding tasksGeneration tasks

BERT Variants and Evolution

  • RoBERTa (Liu et al., 2019): Removed NSP, more data, larger batches
  • ALBERT (Lan et al., 2020): Parameter sharing, factorized embeddings
  • DistilBERT (Sanh et al., 2019): Knowledge distillation, 60% params, 97% performance
  • DeBERTa (He et al., 2021): Disentangled attention, enhanced decoder

Fine-Tuning Pretrained Models

Transfer Learning Paradigm

The modern NLP paradigm follows a two-stage approach:

  1. Pre-training: Learn general language representations on large corpora (self-supervised)
  2. Fine-tuning: Adapt to specific downstream tasks (supervised)
Fine-Tuning WorkflowPre-trainingLarge corpus (books, Wikipedia)Self-supervised objectivesDays to weeks on TPU podsBERT: 3.3B wordsGPT-3: 300B tokensOutput: Pretrained modelFine-tuneFine-tuningTask-specific labeled dataTask head + full model updateMinutes to hours on GPUSST-2: 67K sentencesMNLI: 393K pairsOutput: Task-specific modelDeployProduction model

Task-Specific Heads

Different tasks require different output layers:

TaskInputOutputHead
Sentiment[CLS] tokenBinary/multi-classLinear + softmax
NERAll tokensPer-token labelsLinear + CRF
QAPassage + QuestionStart/end spanTwo linear layers
SimilarityTwo [CLS]Similarity scoreCosine / MLP
GenerationAll tokensNext token distributionLanguage head

Hugging Face Transformers Implementation

Installation

pip install transformers datasets accelerate

Basic Usage

from transformers import AutoTokenizer, AutoModel

# Load pretrained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize input
text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, hidden_dim)
# outputs.pooler_output: (batch, hidden_dim) - [CLS] token

Fine-Tuning for Text Classification

from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

# Tokenize
def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,
    )

tokenized = dataset.map(tokenize_fn, batched=True)

# Training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

# Fine-tune
trainer.train()

Fine-Tuning for Named Entity Recognition

from transformers import AutoModelForTokenClassification
import numpy as np
import evaluate

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=9,  # CoNLL-2003 NER tags
)

# Tokenize with alignment
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        prev_word_id = None
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)  # Special tokens
            elif word_id != prev_word_id:
                label_ids.append(label[word_id])
            else:
                label_ids.append(-100)  # Subword tokens
            prev_word_id = word_id
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

tokenized_ner = dataset.map(tokenize_and_align_labels, batched=True)

# Compute metrics
def compute_metrics_ner(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=2)
    true_labels = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    true_preds = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    results = metric.compute(predictions=true_preds, references=true_labels)
    return results

Using Pipelines for Inference

from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This movie is amazing!")
# {'label': 'POSITIVE', 'score': 0.9998}

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Apple was founded by Steve Jobs in California")
# [{'entity_group': 'ORG', 'word': 'Apple', ...},
#  {'entity_group': 'PER', 'word': 'Steve Jobs', ...},
#  {'entity_group': 'LOC', 'word': 'California', ...}]

# Question Answering
qa = pipeline("question-answering")
result = qa(
    question="When was BERT published?",
    context="BERT was published by Google in October 2018.",
)
# {'answer': 'October 2018', 'score': 0.95, ...}

# Text Generation (GPT-2)
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)

Advanced: Custom Model Architecture

import torch
import torch.nn as nn
from transformers import AutoModel

class CustomTransformerModel(nn.Module):
    def __init__(self, model_name, num_classes):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(self.encoder.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes),
        )

    def forward(self, input_ids, attention_mask=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        pooled = outputs.pooler_output  # [CLS] token
        logits = self.classifier(pooled)
        return logits

# Usage
model = CustomTransformerModel("bert-base-uncased", num_classes=5)

Key Takeaways

  1. Self-attention enables parallel processing of sequences with direct connections between all positions, solving the sequential bottleneck of RNNs.

  2. Multi-head attention allows the model to capture diverse relationships (syntactic, semantic, positional) simultaneously.

  3. Positional encoding injects sequence order information into the permutation-equivariant self-attention mechanism.

  4. BERT (encoder-only) excels at understanding tasks through bidirectional pre-training with MLM and NSP objectives.

  5. GPT (decoder-only) excels at generation tasks through autoregressive next-token prediction.

  6. Fine-tuning pretrained models on task-specific data achieves strong performance with minimal labeled data and compute.

  7. Hugging Face Transformers provides a unified API for accessing, fine-tuning, and deploying transformer models.


References

  1. Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
  2. Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
  3. Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
  4. Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
  5. Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
  6. Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
  7. Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP.

Advertisement

Need Expert Data Science Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement