🎉 75% of content is free forever — Unlock Premium from $10/mo →
CW
Search courses…
💼 Servicesℹ️ About✉️ ContactView Pricing Plansfrom $10

Scaling Laws: Chinchilla, Compute-Optimal Training — Asked at OpenAI & Anthropic

Deep Learning Premium InterviewsScaling Laws⭐ Premium

Advertisement

OpenAI & Anthropic

Scaling Laws: Chinchilla, Compute-Optimal Training & LLM Design

Premium Interview Preparation — Scaling Laws Mastery

🎯 The Interview Question

"Explain the neural scaling laws and their implications for training large language models. What is the Chinchilla scaling law, and how does it differ from the earlier Kaplan scaling law? What does compute-optimal training mean, and how do you determine the optimal model size and data size for a given compute budget?"

This question is fundamental for anyone training large models at OpenAI (GPT) and Anthropic (Claude).


📚 Detailed Answer

Neural Scaling Laws

Scaling laws describe how model performance improves with more compute, data, and parameters:

L(N,D,C)(NcN)αN+(DcD)αD+LL(N, D, C) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty

where:

  • LL: loss (cross-entropy)
  • NN: number of parameters
  • DD: number of training tokens
  • CC: compute budget (FLOPs)
  • αN,αD\alpha_N, \alpha_D: scaling exponents
  • LL_\infty: irreducible loss

Kaplan Scaling Law (OpenAI, 2020)

Found power-law relationships:

L(N)N0.076L(N) \propto N^{-0.076}
L(D)D0.095L(D) \propto D^{-0.095}
L(C)C0.050L(C) \propto C^{-0.050}

Key finding: For fixed compute CC, optimal allocation:

NC0.73,DC0.27N \propto C^{0.73}, \quad D \propto C^{0.27}

This suggests large models, less data for optimal efficiency.

Chinchilla Scaling Law (DeepMind, 2022)

Reanalyzed scaling and found different optimal allocation:

L(N,D)=ANα+BDβ+EL(N, D) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

where:

  • A=406.4A = 406.4, α=0.34\alpha = 0.34
  • B=410.7B = 410.7, β=0.28\beta = 0.28
  • E=1.69E = 1.69 (irreducible loss)

Optimal allocation for compute CC:

NoptC0.50,DoptC0.50N_{opt} \propto C^{0.50}, \quad D_{opt} \propto C^{0.50}

This suggests equal scaling of parameters and data.

⚠️

The Chinchilla law overturned the prevailing wisdom. GPT-3 (175B params, 300B tokens) was significantly undertrained. Chinchilla (70B params, 1.4T tokens) achieved better performance with 4× less parameters but 4× more data.

Mathematical Derivation

For fixed compute C=6NDC = 6ND (approximate FLOPs):

Minimize loss subject to compute constraint:

L=ANα+BDβ+E\mathcal{L} = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + E

subject to: ND=C/6ND = C/6

Using Lagrange multipliers:

LN=λ(ND)N\frac{\partial \mathcal{L}}{\partial N} = \lambda \frac{\partial (ND)}{\partial N}
αANα+1=λD-\frac{\alpha A}{N^{\alpha+1}} = \lambda D
βBDβ+1=λN-\frac{\beta B}{D^{\beta+1}} = \lambda N

Solving gives:

Nopt=GCa,Dopt=G1CbN_{opt} = G \cdot C^a, \quad D_{opt} = G^{-1} \cdot C^b

where a=βα+βa = \frac{\beta}{\alpha + \beta}, b=αα+βb = \frac{\alpha}{\alpha + \beta}.

For Chinchilla values: a0.50a \approx 0.50, b0.50b \approx 0.50.

Practical Implications

Compute Budget Allocation

Budget (FLOPs)Optimal NOptimal DExample Model
102010^{20}400M8BSmall LM
102110^{21}1B20BMedium LM
102210^{22}3B60BGPT-NeoX
102310^{23}10B200BChinchilla-class
102410^{24}30B600BFrontier

Beyond Chinchilla

Inference-Optimal Training

Chinchilla optimizes for training efficiency, not total cost:

Total Cost=Training Cost+Inference Cost×Queries\text{Total Cost} = \text{Training Cost} + \text{Inference Cost} \times \text{Queries}

If inference is expensive (many queries), prefer smaller models:

Ninferenceopt<NchinchillaN_{inference-opt} < N_{chinchilla}

This explains why LLaMA-7B is popular despite being "undertrained" by Chinchilla.

Over-Training

For deployment efficiency, train longer than Chinchilla suggests:

Dover>DchinchillaD_{over} > D_{chinchilla}

Benefits:

  • Better performance per parameter
  • Lower inference cost
  • More efficient serving

Example: LLaMA-2 7B trains on 2T tokens (3× Chinchilla optimal)

Data-Constrained Scaling

When data is limited:

L(N,D,Dunique)=ANα+BDβ+CDuniqueγ+EL(N, D, D_{unique}) = \frac{A}{N^\alpha} + \frac{B}{D^\beta} + \frac{C}{D_{unique}^\gamma} + E

Repeated data has diminishing returns but still helps.

Scaling Laws for Emergent Abilities

Some abilities only appear above certain scale thresholds:

AbilityEmergence ScaleExample
In-context learning~1B paramsFew-shot prompting
Chain-of-thought~100B paramsStep-by-step reasoning
Code generation~10B paramsPython code

These are not predicted by scaling laws — they appear suddenly.

Practical Guidelines

Follow-Up Questions

Q: Why did Chinchilla contradict Kaplan's findings? A: Kaplan's analysis had methodological issues (fixed number of training steps, learning rate schedule). Chinchilla used proper hyperparameter tuning for each configuration.

Q: How do scaling laws apply to fine-tuning? A: Scaling laws are primarily for pre-training. Fine-tuning scales differently — smaller datasets can be sufficient because the model already has learned representations.

Q: What are the limitations of scaling laws? A: They don't predict emergent abilities, don't account for data quality, and may not hold at extreme scales. They also don't capture downstream task performance.

Related Topics

Advertisement