Transformer Architecture

The Transformer Revolution

Introduced in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017), the Transformer architecture revolutionized natural language processing and became the foundation for virtually all modern generative AI models.

Why Transformers?

Parallelization: Unlike RNNs, processes all tokens simultaneously
Long-range Dependencies: Attention mechanism captures relationships across entire sequences
Scalability: Performance improves predictably with size

Core Components

Self-Attention Mechanism

Positional Encoding

Since Transformers process all tokens in parallel, positional encodings inject sequence order information:

import torch
import math

def positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# Example: 100 positions, 512 dimensions
pe = positional_encoding(100, 512)
print(f"Positional encoding shape: {pe.shape}")

Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces:

import torch.nn as nn
import torch

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(output)

Model Variants

Variant	Architecture	Use Case
BERT	Encoder-only	Understanding, Classification
GPT	Decoder-only	Text Generation
T5	Encoder-Decoder	Translation, Summarization
BART	Encoder-Decoder	Summarization, QA

Key Hyperparameters

# Typical Transformer configuration
config = {
    "d_model": 768,      # Embedding dimension
    "num_heads": 12,     # Number of attention heads
    "num_layers": 12,    # Number of transformer layers
    "d_ff": 3072,        # Feed-forward dimension
    "vocab_size": 50257, # Vocabulary size
    "max_seq_len": 2048, # Maximum sequence length
    "dropout": 0.1       # Dropout rate
}

Summary

The Transformer architecture's self-attention mechanism enables efficient parallel processing and captures long-range dependencies, making it the foundation of modern generative AI.

Next: We'll explore the attention mechanism in greater detail.

Transformer Architecture

Transformer Architecture

The Transformer Revolution

Why Transformers?

Core Components

Self-Attention Mechanism

Positional Encoding

Multi-Head Attention

Model Variants

Key Hyperparameters

Summary

Premium Content

Need Expert Generative AI Help?