GPT Architecture — Complete Guide
GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token. It powers ChatGPT, GPT-4, and modern LLMs.
Decoder-Only Architecture
GPT Architecture:
Input tokens → Embedding + Position → N Decoder Blocks → Output logits
Each Decoder Block:
┌─────────────────────────────────┐
│ Masked Self-Attention │
│ (can only attend to past) │
│ + Add & Norm │
├─────────────────────────────────┤
│ Feed-Forward Network │
│ + Add & Norm │
└─────────────────────────────────┘
Mask prevents looking ahead:
Token 4 can attend to: 1, 2, 3, 4 (NOT 5, 6, 7)
Autoregressive Generation
Generation process:
Step 1: "The" → predict next → "cat"
Step 2: "The cat" → predict next → "sat"
Step 3: "The cat sat" → predict next → "on"
Step 4: "The cat sat on" → predict next → "the"
...
Each prediction uses ALL previous tokens as context.
Scaling Laws
Key insight: Performance scales with:
1. Model size (parameters)
More parameters → better performance
GPT-1: 117M → GPT-3: 175B → GPT-4: ~1.8T
2. Dataset size (tokens)
More data → better performance
GPT-3: 300B tokens
3. Compute (FLOPs)
More compute → better performance
GPT-3: ~3.14 × 10²³ FLOPs
Power law: Loss ∝ N^(-α) + D^(-β) + C^(-γ)
Training
Pre-training:
├─ Predict next token on massive text corpus
├─ Objective: minimize cross-entropy loss
├─ 13+ trillion tokens
└─ Weeks/months on thousands of GPUs
Fine-tuning:
├─ Supervised fine-tuning (SFT)
├─ RLHF (Reinforcement Learning from Human Feedback)
└─ Constitutional AI (Anthropic)
Key Takeaways
- GPT is a decoder-only transformer — predicts next token
- Masked attention prevents looking ahead
- Autoregressive generation produces text one token at a time
- Scaling laws predict performance from size, data, compute
- Pre-training + fine-tuning is the two-stage approach
- RLHF aligns models with human preferences
- GPT-4 uses MoE (Mixture of Experts) architecture
- Context window limits how much text the model can process