GPT Architecture — Decoder-Only Transformers Complete Guide

Advanced TopicsLLM ArchitectureFree Lesson

Advertisement

GPT Architecture — Complete Guide

GPT (Generative Pre-trained Transformer) is a decoder-only transformer trained to predict the next token. It powers ChatGPT, GPT-4, and modern LLMs.


Decoder-Only Architecture

GPT Architecture:

Input tokens → Embedding + Position → N Decoder Blocks → Output logits

Each Decoder Block:
┌─────────────────────────────────┐
│  Masked Self-Attention          │
│  (can only attend to past)      │
│  + Add & Norm                   │
├─────────────────────────────────┤
│  Feed-Forward Network           │
│  + Add & Norm                   │
└─────────────────────────────────┘

Mask prevents looking ahead:
Token 4 can attend to: 1, 2, 3, 4 (NOT 5, 6, 7)

Autoregressive Generation

Generation process:

Step 1: "The" → predict next → "cat"
Step 2: "The cat" → predict next → "sat"
Step 3: "The cat sat" → predict next → "on"
Step 4: "The cat sat on" → predict next → "the"
...

Each prediction uses ALL previous tokens as context.

Scaling Laws

Key insight: Performance scales with:

1. Model size (parameters)
   More parameters → better performance
   GPT-1: 117M → GPT-3: 175B → GPT-4: ~1.8T

2. Dataset size (tokens)
   More data → better performance
   GPT-3: 300B tokens

3. Compute (FLOPs)
   More compute → better performance
   GPT-3: ~3.14 × 10²³ FLOPs

Power law: Loss ∝ N^(-α) + D^(-β) + C^(-γ)

Training

Pre-training:
├─ Predict next token on massive text corpus
├─ Objective: minimize cross-entropy loss
├─ 13+ trillion tokens
└─ Weeks/months on thousands of GPUs

Fine-tuning:
├─ Supervised fine-tuning (SFT)
├─ RLHF (Reinforcement Learning from Human Feedback)
└─ Constitutional AI (Anthropic)

Key Takeaways

  1. GPT is a decoder-only transformer — predicts next token
  2. Masked attention prevents looking ahead
  3. Autoregressive generation produces text one token at a time
  4. Scaling laws predict performance from size, data, compute
  5. Pre-training + fine-tuning is the two-stage approach
  6. RLHF aligns models with human preferences
  7. GPT-4 uses MoE (Mixture of Experts) architecture
  8. Context window limits how much text the model can process

Advertisement

Need Expert Machine Learning Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement