πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Transformer Architecture

🟒 Free Lesson

Advertisement

Transformer Architecture

Transformer Architecture OverviewEncoderMulti-Head AttentionAdd & NormFeed ForwardAdd & NormOutputEmbeddingsCross-AttentionMulti-Head AttAdd & NormFeed ForwardAdd & NormDecoderMasked AttentionAdd & NormCross-AttentionAdd & NormFeed ForwardAdd & NormLinear+ Softmax

The Transformer Revolution

Introduced in the seminal paper "Attention Is All You Need" (Vaswani et al., 2017), the Transformer architecture revolutionized natural language processing and became the foundation for virtually all modern generative AI models.

Why Transformers?

  • Parallelization: Unlike RNNs, processes all tokens simultaneously
  • Long-range Dependencies: Attention mechanism captures relationships across entire sequences
  • Scalability: Performance improves predictably with size

Core Components

Self-Attention Mechanism

Self-Attention FlowInput XEmbeddingsx W Qx W Kx W VQ x K^T/ sqrt(dk)Softmaxx VOutQueries, Keys, ValuesAttention ScoresScaled ScoresWeights

Positional Encoding

Since Transformers process all tokens in parallel, positional encodings inject sequence order information:

import torch
import math

def positional_encoding(max_len, d_model):
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len).unsqueeze(1).float()
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * -(math.log(10000.0) / d_model))

    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)

    return pe

# Example: 100 positions, 512 dimensions
pe = positional_encoding(100, 512)
print(f"Positional encoding shape: {pe.shape}")

Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces:

import torch.nn as nn
import torch

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.d_k = d_model // num_heads

        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attention_weights, V)
        return output, attention_weights

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        Q = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        output, attention_weights = self.scaled_dot_product_attention(Q, K, V, mask)

        output = output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)
        return self.W_o(output)

Model Variants

VariantArchitectureUse Case
BERTEncoder-onlyUnderstanding, Classification
GPTDecoder-onlyText Generation
T5Encoder-DecoderTranslation, Summarization
BARTEncoder-DecoderSummarization, QA

Key Hyperparameters

# Typical Transformer configuration
config = {
    "d_model": 768,      # Embedding dimension
    "num_heads": 12,     # Number of attention heads
    "num_layers": 12,    # Number of transformer layers
    "d_ff": 3072,        # Feed-forward dimension
    "vocab_size": 50257, # Vocabulary size
    "max_seq_len": 2048, # Maximum sequence length
    "dropout": 0.1       # Dropout rate
}

Summary

The Transformer architecture's self-attention mechanism enables efficient parallel processing and captures long-range dependencies, making it the foundation of modern generative AI.

Next: We'll explore the attention mechanism in greater detail.

⭐

Premium Content

Transformer Architecture

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert Generative AI Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement