Transformers and BERT: Attention Is All You Need

From RNNs to Transformers

The Limitations of Recurrent Architectures

Recurrent Neural Networks (RNNs) and their variants (LSTM, GRU) processed sequences step-by-step, maintaining a hidden state that accumulated information over time. While effective for many tasks, they suffered from fundamental limitations:

Sequential Bottleneck: RNNs process tokens one at a time, making parallelization impossible. For a sequence of length , the time complexity is sequential operations, preventing GPU acceleration.

Long-Range Dependencies: Despite gating mechanisms, RNNs struggled to maintain information over long distances. The gradient signal must propagate through every intermediate step, leading to vanishing or exploding gradients.

Information Bottleneck: The fixed-size hidden state must compress all relevant information from the entire sequence, creating a capacity bottleneck.

The Attention Revolution

The Transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), eliminated recurrence entirely. The key insight: use self-attention to model relationships between all positions simultaneously, enabling full parallelization and direct connections across arbitrary distances.

The Transformer achieves sequential operations (fully parallelizable) and compute complexity, trading sequential depth for quadratic attention computation — a favorable tradeoff on modern hardware.

RNNs	Transformers
Sequential: steps	Parallel: steps
Hidden state bottleneck	Full pairwise attention
compute	compute
Struggles with long range	Direct connections across all distances

Self-Attention Mechanism

Query, Key, Value Framework

Self-attention computes a weighted sum of all positions in a sequence, where the weights are determined by the compatibility (dot product) between positions. Each input token is projected into three vectors:

where and are learned projection matrices.

Scaled Dot-Product Attention

Given a sequence of tokens with embeddings , we compute the full attention operation as:

where:

— Query matrix
— Key matrix
— Value matrix

Why scale by ? When is large, the dot products tend to have large magnitudes, pushing the softmax into regions with extremely small gradients. Scaling by (the standard deviation of the dot product under random initialization) keeps the softmax in a regime with useful gradients:

Attention as Soft Retrieval

Self-attention can be interpreted as a soft content-based retrieval system:

Queries represent what each token is "looking for"
Keys represent what each token "offers"
Values represent the actual information carried by each token
The attention weights determine how much information to retrieve from position for position

Computational Complexity

The full attention computation requires:

Matrix multiplication :
Softmax:
Weighted sum:
Total:

This quadratic complexity in sequence length is the primary limitation of standard Transformers for very long sequences.

Multi-Head Attention

Parallel Attention Heads

Multi-Head Attention runs multiple attention operations in parallel, allowing the model to attend to different types of relationships simultaneously:

where , , , and .

Why Multiple Heads Work

Different heads specialize in different linguistic phenomena:

Head 1 may attend to syntactic dependencies (subject-verb agreement)
Head 2 may capture semantic relationships (modifier-modified)
Head 3 may track positional patterns (adjacent tokens)
Head 4 may resolve coreference (pronoun-antecedent)

With heads each using dimensions, the total compute is equivalent to a single head with full dimensionality, but the representational capacity is significantly richer.

Attention Head Visualization

In practice, attention patterns can be visualized as heatmaps where each row represents a query token and each column represents a key token:

Positional Encoding

The Need for Position Information

Self-attention is permutation-equivariant — it treats the input as a set, not a sequence. Without positional information, "The cat sat" and "sat cat The" would produce identical representations. Positional encodings inject sequence order information.

Sinusoidal Positional Encoding

The original Transformer uses fixed sinusoidal functions:

where is the position index and is the dimension index.

Key Properties:

Bounded values: for all positions and dimensions
Relative positions learnable: can be expressed as a linear function of (rotation matrix in each frequency pair)
Unique encoding per position: Each position has a distinct encoding vector
Generalization to unseen lengths: The model can extrapolate to sequences longer than seen in training

Alternative Positional Encodings

Learned Positional Embeddings: BERT and GPT use learned position embeddings stored in a lookup table. This is simpler but limits generalization to unseen sequence lengths.

Rotary Position Embeddings (RoPE): Encodes positions by rotating query and key vectors in 2D planes, enabling relative position awareness through dot products. Used in modern LLMs like LLaMA.

ALiBi (Attention with Linear Biases): Adds linear bias terms to attention scores based on relative distance, without explicit positional encoding.

Transformer Encoder Architecture

Encoder Block

Each Transformer encoder block consists of two sub-layers with residual connections and layer normalization:

Mathematical Formulation

Sub-layer 1: Multi-Head Self-Attention

Sub-layer 2: Position-wise Feed-Forward Network

The FFN is applied independently to each position but shares the same parameters across positions — hence "position-wise."

Layer Normalization

where and are computed across the feature dimension for each token, and are learned scale and shift parameters.

Transformer Decoder Architecture

Decoder Block

The decoder extends the encoder with an additional masked self-attention sub-layer and encoder-decoder cross-attention:

Causal Masking

For autoregressive generation, the decoder uses a causal mask to prevent attending to future positions:

Applied before softmax: ensures position only attends to positions .

BERT: Bidirectional Encoder Representations from Transformers

Architecture Overview

BERT (Devlin et al., 2019) uses only the Transformer encoder stack, enabling bidirectional context understanding — unlike GPT which is left-to-right only.

Key: Bidirectional — each token attends to ALL other tokens (left and right)

BERT Model Variants

Model	Layers ()	Hidden ()	Heads ()	Parameters
BERT-base	12	768	12	110M
BERT-large	24	1024	16	340M

Pre-Training Objectives

BERT is pre-trained on two self-supervised tasks:

1. Masked Language Modeling (MLM)

Randomly mask 15% of input tokens and predict them:

Of the masked tokens:

80% are replaced with [MASK]
10% are replaced with a random token
10% are left unchanged

This creates a mismatch between pre-training and fine-tuning (no [MASK] token at inference), so the 80/10/10 strategy mitigates this.

2. Next Sentence Prediction (NSP)

Given sentence pairs , predict whether actually follows in the corpus:

BERT Input Representation

The input to BERT is the sum of three embeddings:

where:

Token Embedding: WordPiece tokenization (max 512 tokens)
Position Embedding: Learned (max 512 positions)
Segment Embedding: Distinguishes sentence A vs B (for NSP)

Special tokens: [CLS] (classification), [SEP] (separator), [MASK] (masked), [PAD] (padding)

GPT: Autoregressive Language Modeling

GPT Architecture

GPT (Radford et al., 2018) uses only the Transformer decoder stack, training autoregressively to predict the next token:

BERT vs GPT: Key Differences

Aspect	BERT	GPT
Architecture	Encoder only	Decoder only
Attention	Bidirectional	Unidirectional (causal)
Pre-training	MLM + NSP	Next token prediction
Fine-tuning	Classification, QA, NER	Generation, classification
Best for	Understanding tasks	Generation tasks

BERT Variants and Evolution

RoBERTa (Liu et al., 2019): Removed NSP, more data, larger batches
ALBERT (Lan et al., 2020): Parameter sharing, factorized embeddings
DistilBERT (Sanh et al., 2019): Knowledge distillation, 60% params, 97% performance
DeBERTa (He et al., 2021): Disentangled attention, enhanced decoder

Fine-Tuning Pretrained Models

Transfer Learning Paradigm

The modern NLP paradigm follows a two-stage approach:

Pre-training: Learn general language representations on large corpora (self-supervised)
Fine-tuning: Adapt to specific downstream tasks (supervised)

Task-Specific Heads

Different tasks require different output layers:

Task	Input	Output	Head
Sentiment	`[CLS]` token	Binary/multi-class	Linear + softmax
NER	All tokens	Per-token labels	Linear + CRF
QA	Passage + Question	Start/end span	Two linear layers
Similarity	Two `[CLS]`	Similarity score	Cosine / MLP
Generation	All tokens	Next token distribution	Language head

Hugging Face Transformers Implementation

Installation

pip install transformers datasets accelerate

Basic Usage

from transformers import AutoTokenizer, AutoModel

# Load pretrained BERT
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenize input
text = "The cat sat on the mat"
inputs = tokenizer(text, return_tensors="pt")

# Forward pass
outputs = model(**inputs)
# outputs.last_hidden_state: (batch, seq_len, hidden_dim)
# outputs.pooler_output: (batch, hidden_dim) - [CLS] token

Fine-Tuning for Text Classification

from transformers import (
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset

# Load dataset
dataset = load_dataset("imdb")

# Load model with classification head
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=2,
)

# Tokenize
def tokenize_fn(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=512,
    )

tokenized = dataset.map(tokenize_fn, batched=True)

# Training arguments
args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
    compute_metrics=compute_metrics,
)

# Fine-tune
trainer.train()

Fine-Tuning for Named Entity Recognition

from transformers import AutoModelForTokenClassification
import numpy as np
import evaluate

# Load model
model = AutoModelForTokenClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=9,  # CoNLL-2003 NER tags
)

# Tokenize with alignment
def tokenize_and_align_labels(examples):
    tokenized = tokenizer(
        examples["tokens"],
        truncation=True,
        is_split_into_words=True,
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized.word_ids(batch_index=i)
        label_ids = []
        prev_word_id = None
        for word_id in word_ids:
            if word_id is None:
                label_ids.append(-100)  # Special tokens
            elif word_id != prev_word_id:
                label_ids.append(label[word_id])
            else:
                label_ids.append(-100)  # Subword tokens
            prev_word_id = word_id
        labels.append(label_ids)
    tokenized["labels"] = labels
    return tokenized

tokenized_ner = dataset.map(tokenize_and_align_labels, batched=True)

# Compute metrics
def compute_metrics_ner(pred):
    predictions, labels = pred
    predictions = np.argmax(predictions, axis=2)
    true_labels = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    true_preds = [
        [label for label, pred in zip(label, prediction) if label != -100]
        for label, prediction in zip(labels, predictions)
    ]
    results = metric.compute(predictions=true_preds, references=true_labels)
    return results

Using Pipelines for Inference

from transformers import pipeline

# Sentiment Analysis
classifier = pipeline("sentiment-analysis")
result = classifier("This movie is amazing!")
# {'label': 'POSITIVE', 'score': 0.9998}

# Named Entity Recognition
ner = pipeline("ner", grouped_entities=True)
result = ner("Apple was founded by Steve Jobs in California")
# [{'entity_group': 'ORG', 'word': 'Apple', ...},
#  {'entity_group': 'PER', 'word': 'Steve Jobs', ...},
#  {'entity_group': 'LOC', 'word': 'California', ...}]

# Question Answering
qa = pipeline("question-answering")
result = qa(
    question="When was BERT published?",
    context="BERT was published by Google in October 2018.",
)
# {'answer': 'October 2018', 'score': 0.95, ...}

# Text Generation (GPT-2)
generator = pipeline("text-generation", model="gpt2")
result = generator("The future of AI is", max_length=50)

Advanced: Custom Model Architecture

import torch
import torch.nn as nn
from transformers import AutoModel

class CustomTransformerModel(nn.Module):
    def __init__(self, model_name, num_classes):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)
        self.classifier = nn.Sequential(
            nn.Dropout(0.1),
            nn.Linear(self.encoder.config.hidden_size, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, num_classes),
        )

    def forward(self, input_ids, attention_mask=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )
        pooled = outputs.pooler_output  # [CLS] token
        logits = self.classifier(pooled)
        return logits

# Usage
model = CustomTransformerModel("bert-base-uncased", num_classes=5)

Key Takeaways

Self-attention enables parallel processing of sequences with direct connections between all positions, solving the sequential bottleneck of RNNs.
Multi-head attention allows the model to capture diverse relationships (syntactic, semantic, positional) simultaneously.
Positional encoding injects sequence order information into the permutation-equivariant self-attention mechanism.
BERT (encoder-only) excels at understanding tasks through bidirectional pre-training with MLM and NSP objectives.
GPT (decoder-only) excels at generation tasks through autoregressive next-token prediction.
Fine-tuning pretrained models on task-specific data achieves strong performance with minimal labeled data and compute.
Hugging Face Transformers provides a unified API for accessing, fine-tuning, and deploying transformer models.

References

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS.
Devlin, J., et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
Radford, A., et al. (2018). Improving Language Understanding by Generative Pre-Training. OpenAI.
Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI.
Brown, T., et al. (2020). Language Models are Few-Shot Learners. NeurIPS.
Liu, Y., et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach.
Wolf, T., et al. (2020). Transformers: State-of-the-Art Natural Language Processing. EMNLP.