🎯 The Interview Question
"Explain the neural scaling laws and their implications for training large language models. What is the Chinchilla scaling law, and how does it differ from the earlier Kaplan scaling law? What does compute-optimal training mean, and how do you determine the optimal model size and data size for a given compute budget?"
This question is fundamental for anyone training large models at OpenAI (GPT) and Anthropic (Claude).
📚 Detailed Answer
Neural Scaling Laws
Scaling laws describe how model performance improves with more compute, data, and parameters:
where:
- : loss (cross-entropy)
- : number of parameters
- : number of training tokens
- : compute budget (FLOPs)
- : scaling exponents
- : irreducible loss
Kaplan Scaling Law (OpenAI, 2020)
Found power-law relationships:
Key finding: For fixed compute , optimal allocation:
This suggests large models, less data for optimal efficiency.
Chinchilla Scaling Law (DeepMind, 2022)
Reanalyzed scaling and found different optimal allocation:
where:
- ,
- ,
- (irreducible loss)
Optimal allocation for compute :
This suggests equal scaling of parameters and data.
⚠️
The Chinchilla law overturned the prevailing wisdom. GPT-3 (175B params, 300B tokens) was significantly undertrained. Chinchilla (70B params, 1.4T tokens) achieved better performance with 4× less parameters but 4× more data.
Mathematical Derivation
For fixed compute (approximate FLOPs):
Minimize loss subject to compute constraint:
subject to:
Using Lagrange multipliers:
Solving gives:
where , .
For Chinchilla values: , .
Practical Implications
Compute Budget Allocation
| Budget (FLOPs) | Optimal N | Optimal D | Example Model |
|---|---|---|---|
| 400M | 8B | Small LM | |
| 1B | 20B | Medium LM | |
| 3B | 60B | GPT-NeoX | |
| 10B | 200B | Chinchilla-class | |
| 30B | 600B | Frontier |
Beyond Chinchilla
Inference-Optimal Training
Chinchilla optimizes for training efficiency, not total cost:
If inference is expensive (many queries), prefer smaller models:
This explains why LLaMA-7B is popular despite being "undertrained" by Chinchilla.
Over-Training
For deployment efficiency, train longer than Chinchilla suggests:
Benefits:
- Better performance per parameter
- Lower inference cost
- More efficient serving
Example: LLaMA-2 7B trains on 2T tokens (3× Chinchilla optimal)
Data-Constrained Scaling
When data is limited:
Repeated data has diminishing returns but still helps.
Scaling Laws for Emergent Abilities
Some abilities only appear above certain scale thresholds:
| Ability | Emergence Scale | Example |
|---|---|---|
| In-context learning | ~1B params | Few-shot prompting |
| Chain-of-thought | ~100B params | Step-by-step reasoning |
| Code generation | ~10B params | Python code |
These are not predicted by scaling laws — they appear suddenly.
Practical Guidelines
Follow-Up Questions
Q: Why did Chinchilla contradict Kaplan's findings? A: Kaplan's analysis had methodological issues (fixed number of training steps, learning rate schedule). Chinchilla used proper hyperparameter tuning for each configuration.
Q: How do scaling laws apply to fine-tuning? A: Scaling laws are primarily for pre-training. Fine-tuning scales differently — smaller datasets can be sufficient because the model already has learned representations.
Q: What are the limitations of scaling laws? A: They don't predict emergent abilities, don't account for data quality, and may not hold at extreme scales. They also don't capture downstream task performance.