GPT Architecture — Complete Guide

GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token. It powers ChatGPT, GPT-4, and modern LLMs.

Decoder-Only Architecture

GPT Architecture:

Input tokens → Embedding + Position → N Decoder Blocks → Output logits

Each Decoder Block:
┌─────────────────────────────────┐
│  Masked Self-Attention          │
│  (can only attend to past)      │
│  + Add & Norm                   │
├─────────────────────────────────┤
│  Feed-Forward Network           │
│  + Add & Norm                   │
└─────────────────────────────────┘

Mask prevents looking ahead:
Token 4 can attend to: 1, 2, 3, 4 (NOT 5, 6, 7)

Autoregressive Generation

Generation process:

Step 1: "The" → predict next → "cat"
Step 2: "The cat" → predict next → "sat"
Step 3: "The cat sat" → predict next → "on"
Step 4: "The cat sat on" → predict next → "the"
...

Each prediction uses ALL previous tokens as context.

Scaling Laws

Key insight: Performance scales with:

1. Model size (parameters)
   More parameters → better performance
   GPT-1: 117M → GPT-3: 175B → GPT-4: ~1.8T

2. Dataset size (tokens)
   More data → better performance
   GPT-3: 300B tokens

3. Compute (FLOPs)
   More compute → better performance
   GPT-3: ~3.14 × 10²³ FLOPs

Power law: Loss ∝ N^(-α) + D^(-β) + C^(-γ)

Training

Pre-training:
├─ Predict next token on massive text corpus
├─ Objective: minimize cross-entropy loss
├─ 13+ trillion tokens
└─ Weeks/months on thousands of GPUs

Fine-tuning:
├─ Supervised fine-tuning (SFT)
├─ RLHF (Reinforcement Learning from Human Feedback)
└─ Constitutional AI (Anthropic)

Key Takeaways

GPT is a decoder-only transformer — predicts next token
Masked attention prevents looking ahead
Autoregressive generation produces text one token at a time
Scaling laws predict performance from size, data, compute
Pre-training + fine-tuning is the two-stage approach
RLHF aligns models with human preferences
GPT-4 uses MoE (Mixture of Experts) architecture
Context window limits how much text the model can process

GPT Architecture — Decoder-Only Transformers Complete Guide

GPT Architecture — Complete Guide

Decoder-Only Architecture

Autoregressive Generation

Scaling Laws

Training

Key Takeaways

Need Expert Machine Learning Help?