Scaling Laws: Chinchilla, Compute-Optimal Training — Asked at OpenAI & Anthropic

🎯 The Interview Question

"Explain the neural scaling laws and their implications for training large language models. What is the Chinchilla scaling law, and how does it differ from the earlier Kaplan scaling law? What does compute-optimal training mean, and how do you determine the optimal model size and data size for a given compute budget?"

This question is fundamental for anyone training large models at OpenAI (GPT) and Anthropic (Claude).

📚 Detailed Answer

Neural Scaling Laws

Scaling laws describe how model performance improves with more compute, data, and parameters:

L(N, D, C) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

where:

$L$ : loss (cross-entropy)
$N$ : number of parameters
$D$ : number of training tokens
$C$ : compute budget (FLOPs)
$\alpha_N, \alpha_D$ : scaling exponents
$L_\infty$ : irreducible loss

Kaplan Scaling Law (OpenAI, 2020)

Found power-law relationships:

L(N) \propto N^{-0.076}

L(D) \propto D^{-0.095}

L(C) \propto C^{-0.050}

Key finding: For fixed compute $C$ , optimal allocation:

N \propto C^{0.73}, \quad D \propto C^{0.27}

This suggests large models, less data for optimal efficiency.

Chinchilla Scaling Law (DeepMind, 2022)

Reanalyzed scaling and found different optimal allocation:

L(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where:

$A = 406.4$ , $\alpha = 0.34$
$B = 410.7$ , $\beta = 0.28$
$E = 1.69$ (irreducible loss)

Optimal allocation for compute $C$ :

N_{opt} \propto C^{0.50}, \quad D_{opt} \propto C^{0.50}

This suggests equal scaling of parameters and data.

⚠️

The Chinchilla law overturned the prevailing wisdom. GPT-3 (175B params, 300B tokens) was significantly undertrained. Chinchilla (70B params, 1.4T tokens) achieved better performance with 4× less parameters but 4× more data.

Mathematical Derivation

For fixed compute $C = 6ND$ (approximate FLOPs):

Minimize loss subject to compute constraint:

\mathcal{L} = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

subject to: $ND = C/6$

Using Lagrange multipliers:

\frac{\partial \mathcal{L}}{\partial N} = \lambda \frac{\partial (ND)}{\partial N}

-\frac{\alpha A}{N^{\alpha+1}} = \lambda D

-\frac{\beta B}{D^{\beta+1}} = \lambda N

Solving gives:

N_{opt} = G \cdot C^a, \quad D_{opt} = G^{-1} \cdot C^b

where $a = \frac{\beta}{\alpha + \beta}$ , $b = \frac{\alpha}{\alpha + \beta}$ .

For Chinchilla values: $a \approx 0.50$ , $b \approx 0.50$ .

Practical Implications

Compute Budget Allocation

Budget (FLOPs)	Optimal N	Optimal D	Example Model
$10^{20}$	400M	8B	Small LM
$10^{21}$	1B	20B	Medium LM
$10^{22}$	3B	60B	GPT-NeoX
$10^{23}$	10B	200B	Chinchilla-class
$10^{24}$	30B	600B	Frontier

Beyond Chinchilla

Inference-Optimal Training

Chinchilla optimizes for training efficiency, not total cost:

\text{Total Cost} = \text{Training Cost} + \text{Inference Cost} \times \text{Queries}

If inference is expensive (many queries), prefer smaller models:

N_{inference-opt} < N_{chinchilla}

This explains why LLaMA-7B is popular despite being "undertrained" by Chinchilla.

Over-Training

For deployment efficiency, train longer than Chinchilla suggests:

D_{over} > D_{chinchilla}

Benefits:

Better performance per parameter
Lower inference cost
More efficient serving

Example: LLaMA-2 7B trains on 2T tokens (3× Chinchilla optimal)

Data-Constrained Scaling

When data is limited:

L(N, D, D_{unique}) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + \frac{C}{D_{unique}^\gamma} + E

Repeated data has diminishing returns but still helps.

Scaling Laws for Emergent Abilities

Some abilities only appear above certain scale thresholds:

Ability	Emergence Scale	Example
In-context learning	~1B params	Few-shot prompting
Chain-of-thought	~100B params	Step-by-step reasoning
Code generation	~10B params	Python code

These are not predicted by scaling laws — they appear suddenly.

Practical Guidelines

Follow-Up Questions

Q: Why did Chinchilla contradict Kaplan's findings? A: Kaplan's analysis had methodological issues (fixed number of training steps, learning rate schedule). Chinchilla used proper hyperparameter tuning for each configuration.

Q: How do scaling laws apply to fine-tuning? A: Scaling laws are primarily for pre-training. Fine-tuning scales differently — smaller datasets can be sufficient because the model already has learned representations.

Q: What are the limitations of scaling laws? A: They don't predict emergent abilities, don't account for data quality, and may not hold at extreme scales. They also don't capture downstream task performance.