Scaling Laws & Chinchilla
Scaling laws describe the relationship between model performance and key factors like model size, dataset size, and compute budget. Understanding these laws is crucial for making informed decisions about LLM training.
Why Scaling Laws Matter
Training large language models requires enormous resources. Scaling laws help us:
- Predict performance before committing resources
- Allocate compute budget optimally between model size and data
- Compare training runs across different scales
- Plan training for new models
Kaplan Scaling Laws
OpenAI's Kaplan et al. (2020) established the first comprehensive scaling laws for language models.
The observation that language model performance (cross-entropy loss) follows a power law relationship with model size (N), dataset size (D), and compute budget (C), with each factor contributing independently.
Kaplan Data Scaling
Here,
- =
- =
- =
- =
Kaplan Compute Scaling
Here,
- =
- =
- =
- =
import numpy as np
import matplotlib.pyplot as plt
def kaplan_scaling_law(N, N_c=8.8e13, alpha_N=0.076):
"""Compute predicted loss using Kaplan scaling law."""
return (N_c / N) ** alpha_N
def kaplan_data_scaling(D, D_c=5.4e13, alpha_D=0.095):
"""Compute predicted loss using data scaling law."""
return (D_c / D) ** alpha_D
def kaplan_compute_scaling(C, C_c=3.1e8, alpha_C=0.050):
"""Compute predicted loss using compute scaling law."""
return (C_c / C) ** alpha_C
# Example predictions
model_sizes = np.logspace(6, 11, 100) # 1M to 100B parameters
losses = [kaplan_scaling_law(N) for N in model_sizes]
print(f"1B parameter model predicted loss: {kaplan_scaling_law(1e9):.3f}")
print(f"10B parameter model predicted loss: {kaplan_scaling_law(1e10):.3f}")
print(f"100B parameter model predicted loss: {kaplan_scaling_law(1e11):.3f}")
Kaplan's original work suggested that model size was more important than dataset size, leading to very large models trained on relatively less data. This was later corrected by Chinchilla.
Chinchilla Optimal Training
DeepMind's Chinchilla (Hoffmann et al., 2022) revised the scaling laws, showing that model size and data should scale equally.
For a given compute budget C, the optimal configuration allocates roughly equal resources to model size and data size. Specifically, the optimal number of tokens is approximately 20 times the number of parameters.
Why Chinchilla Matters
Before Chinchilla:
- GPT-3: 175B parameters, trained on 300B tokens (ratio: 1.7)
- This was heavily undertrained
After Chinchilla:
- Chinchilla: 70B parameters, trained on 1.4T tokens (ratio: 20)
- Achieved better performance than GPT-3 with fewer parameters
Chinchilla Compute Budget Allocation
Here,
- =
- =
- =
- =
def chinchilla_optimal_allocation(compute_budget_flops):
"""Compute optimal model size and data for given compute budget."""
# From Chinchilla: C ā 6ND and D ā 20N
# Substituting: C ā 6N(20N) = 120N²
# Therefore: N = sqrt(C / 120)
N_optimal = np.sqrt(compute_budget_flops / 120)
D_optimal = 20 * N_optimal
return {
"optimal_params": N_optimal,
"optimal_tokens": D_optimal,
"tokens_per_param": D_optimal / N_optimal,
"model_size_billions": N_optimal / 1e9,
"data_size_trillions": D_optimal / 1e12
}
# Example compute budgets
compute_budgets = {
"1e21 FLOPs (small)": 1e21,
"1e23 FLOPs (medium)": 1e23,
"1e25 FLOPs (large)": 1e25,
"1e26 FLOPs (GPT-4 scale)": 1e26
}
for name, budget in compute_budgets.items():
allocation = chinchilla_optimal_allocation(budget)
print(f"\n{name}:")
print(f" Optimal model: {allocation['model_size_billions']:.1f}B params")
print(f" Optimal data: {allocation['data_size_trillions']:.2f}T tokens")
print(f" Ratio: {allocation['tokens_per_param']:.1f} tokens/param")
Compute-Optimal Frontier
The compute-optimal frontier represents the best achievable loss for a given compute budget:
Compute-Optimal Loss
Here,
- =
- =
- =
- =
def compute_optimal_frontier():
"""Generate compute-optimal frontier data."""
compute_range = np.logspace(17, 27, 100) # FLOPs
# Kaplan frontier
kaplan_loss = [kaplan_compute_scaling(C) for C in compute_range]
# Chinchilla frontier (better scaling)
chinchilla_loss = [(3.1e8 / C) ** 0.034 for C in compute_range]
return compute_range, kaplan_loss, chinchilla_loss
# Compare frontiers
C_range, L_kaplan, L_chinchilla = compute_optimal_frontier()
# At 1e24 FLOPs
C_target = 1e24
L_k = kaplan_compute_scaling(C_target)
L_c = (3.1e8 / C_target) ** 0.034
print(f"At 1e24 FLOPs:")
print(f" Kaplan prediction: {L_k:.3f}")
print(f" Chinchilla prediction: {L_c:.3f}")
print(f" Improvement: {(L_k - L_c) / L_k * 100:.1f}%")
Practical Implications for Training
Training Budget Planning
class TrainingPlanner:
def __init__(self, target_loss: float, available_flops: float):
self.target_loss = target_loss
self.available_flops = available_flops
def plan_training(self):
"""Create training plan based on compute budget."""
chinchilla = chinchilla_optimal_allocation(self.available_flops)
# Check if target loss is achievable
optimal_loss = (3.1e8 / self.available_flops) ** 0.034
if optimal_loss > self.target_loss:
return {
"feasible": False,
"message": f"Target loss {self.target_loss:.3f} not achievable. "
f"Optimal loss: {optimal_loss:.3f}"
}
# Determine if overtrain or use larger model
N = chinchilla["optimal_params"]
D = chinchilla["optimal_tokens"]
return {
"feasible": True,
"model_size_b": N / 1e9,
"data_tokens_t": D / 1e12,
"training_tokens": int(D),
"estimated_loss": optimal_loss,
"tokens_per_param": D / N
}
def compare_approaches(self):
"""Compare different training approaches."""
chinchilla = chinchilla_optimal_allocation(self.available_flops)
approaches = {
"Chinchilla Optimal": {
"N": chinchilla["optimal_params"],
"D": chinchilla["optimal_tokens"],
"tokens_per_param": 20
},
"Overtrained (40 tok/param)": {
"N": chinchilla["optimal_params"] / 2,
"D": chinchilla["optimal_tokens"],
"tokens_per_param": 40
},
"Large Model (10 tok/param)": {
"N": chinchilla["optimal_params"] * 2,
"D": chinchilla["optimal_tokens"],
"tokens_per_param": 10
}
}
results = {}
for name, config in approaches.items():
C = 6 * config["N"] * config["D"]
loss = (3.1e8 / C) ** 0.034
results[name] = {
"model_size": config["N"] / 1e9,
"data_tokens": config["D"] / 1e12,
"compute_flops": C,
"predicted_loss": loss
}
return results
Common Training Configurations
| Configuration | Parameters | Training Tokens | Ratio | Use Case |
|---|---|---|---|---|
| Chinchilla Optimal | 70B | 1.4T | 20 | Research baseline |
| Overtrained Small | 7B | 2T | 286 | Production efficiency |
| Overtrained Medium | 13B | 2T | 154 | Balanced performance |
| Undertrained Large | 70B | 400B | 5.7 | Quick experimentation |
In practice, many teams "overtrain" smaller models (using more tokens per parameter than Chinchilla optimal) because smaller models are much cheaper to serve. The inference cost savings often outweigh the training inefficiency.
Scaling Laws Beyond Chinchilla
Emergent Abilities
Some capabilities only appear at certain scales:
emergent_abilities = {
"chain_of_thought_reasoning": "Appears around 62B parameters",
"few_shot_learning": "Significant improvement above 10B",
"instruction_following": "Strong above 13B",
"code_generation": "Noticeable above 6B",
"multilingual_transfer": "Improves dramatically above 10B"
}
Inference-Optimal Scaling
Inference-Optimal Model Size
Here,
- =
- =
- =
- =
def inference_optimal_size(
training_budget_flops: float,
expected_inference_tokens: float,
cost_per_training_flop: float = 1.0,
cost_per_inference_flop: float = 1.0
):
"""Determine model size considering inference costs."""
chinchilla = chinchilla_optimal_allocation(training_budget_flops)
N_chinchilla = chinchilla["optimal_params"]
# Total cost = training cost + inference cost
# For each possible model size N:
# Training tokens D = C_train / (6N)
# Training loss = f(C_train)
# Inference cost = 6N * expected_inference_tokens
# Find N that minimizes total cost for target loss
N_range = np.logspace(8, 11, 100)
total_costs = []
for N in N_range:
# Training tokens needed for Chinchilla-optimal
D = 20 * N
C_train = 6 * N * D
train_cost = C_train * cost_per_training_flop
# Inference cost
C_infer = 6 * N * expected_inference_tokens
infer_cost = C_infer * cost_per_inference_flop
total_costs.append(train_cost + infer_cost)
optimal_idx = np.argmin(total_costs)
return N_range[optimal_idx]
Summary
- Kaplan scaling laws describe power-law relationships between loss and model/data/compute
- Chinchilla established that optimal training uses ~20 tokens per parameter
- Compute-optimal loss scales as L*(C) = (C_c*/C)^0.034
- Chinchilla corrects Kaplan's finding that model size is more important than data
- In practice, many teams overtrain smaller models for inference efficiency
- Training budget planning should account for both training and inference costs
- Emergent abilities appear at certain scale thresholds
Practice Exercises
-
Scaling Law Prediction: Use Kaplan and Chinchilla scaling laws to predict performance for 1B, 10B, and 100B parameter models.
-
Budget Planning: Given a $100K compute budget, determine the optimal model size and training tokens.
-
Overtraining Analysis: Compare Chinchilla-optimal vs overtrained configurations for a 7B model. What is the inference cost savings?
-
Compute Allocation: You have 1e25 FLOPs. Should you train a 70B model on 240B tokens or a 14B model on 1.2T tokens?
-
Emergent Abilities: Research which capabilities emerge at different scales. Create a scale-ability matrix.
Previous: 23 - Open Source LLM Ecosystem <- | Next: 25 - Building Production LLM Applications ->