LLM Theory

Scaling Laws and Chinchilla — The Science of Making LLMs Bigger

Scaling laws reveal predictable power-law relationships between model performance and factors like size, data, and compute.

Kaplan Laws — Performance follows power laws with model size, dataset size, and compute budget independently
Chinchilla Optimal — The compute-optimal ratio is roughly 20 tokens per parameter, correcting Kaplan's model-first bias
Practical Implications — Many teams overtrain smaller models because inference cost savings outweigh training inefficiency

"Chinchilla proved that data is as important as model size — GPT-3 was severely undertrained at 1.7 tokens per parameter."

Scaling Laws & Chinchilla

Scaling laws describe the relationship between model performance and key factors like model size, dataset size, and compute budget. Understanding these laws is crucial for making informed decisions about LLM training.

Why Scaling Laws Matter

Training large language models requires enormous resources. Scaling laws help us:

Predict performance before committing resources
Allocate compute budget optimally between model size and data
Compare training runs across different scales
Plan training for new models

Kaplan Scaling Laws

OpenAI's Kaplan et al. (2020) established the first comprehensive scaling laws for language models.

import numpy as np
import matplotlib.pyplot as plt

def kaplan_scaling_law(N, N_c=8.8e13, alpha_N=0.076):
    """Compute predicted loss using Kaplan scaling law."""
    return (N_c / N) ** alpha_N

def kaplan_data_scaling(D, D_c=5.4e13, alpha_D=0.095):
    """Compute predicted loss using data scaling law."""
    return (D_c / D) ** alpha_D

def kaplan_compute_scaling(C, C_c=3.1e8, alpha_C=0.050):
    """Compute predicted loss using compute scaling law."""
    return (C_c / C) ** alpha_C

# Example predictions
model_sizes = np.logspace(6, 11, 100)  # 1M to 100B parameters
losses = [kaplan_scaling_law(N) for N in model_sizes]

print(f"1B parameter model predicted loss: {kaplan_scaling_law(1e9):.3f}")
print(f"10B parameter model predicted loss: {kaplan_scaling_law(1e10):.3f}")
print(f"100B parameter model predicted loss: {kaplan_scaling_law(1e11):.3f}")

Chinchilla Optimal Training

DeepMind's Chinchilla (Hoffmann et al., 2022) revised the scaling laws, showing that model size and data should scale equally.

Why Chinchilla Matters

Before Chinchilla:

GPT-3: 175B parameters, trained on 300B tokens (ratio: 1.7)
This was heavily undertrained

After Chinchilla:

Chinchilla: 70B parameters, trained on 1.4T tokens (ratio: 20)
Achieved better performance than GPT-3 with fewer parameters

def chinchilla_optimal_allocation(compute_budget_flops):
    """Compute optimal model size and data for given compute budget."""
    # From Chinchilla: C ≈ 6ND and D ≈ 20N
    # Substituting: C ≈ 6N(20N) = 120N²
    # Therefore: N = sqrt(C / 120)
    N_optimal = np.sqrt(compute_budget_flops / 120)
    D_optimal = 20 * N_optimal

    return {
        "optimal_params": N_optimal,
        "optimal_tokens": D_optimal,
        "tokens_per_param": D_optimal / N_optimal,
        "model_size_billions": N_optimal / 1e9,
        "data_size_trillions": D_optimal / 1e12
    }

# Example compute budgets
compute_budgets = {
    "1e21 FLOPs (small)": 1e21,
    "1e23 FLOPs (medium)": 1e23,
    "1e25 FLOPs (large)": 1e25,
    "1e26 FLOPs (GPT-4 scale)": 1e26
}

for name, budget in compute_budgets.items():
    allocation = chinchilla_optimal_allocation(budget)
    print(f"\n{name}:")
    print(f"  Optimal model: {allocation['model_size_billions']:.1f}B params")
    print(f"  Optimal data: {allocation['data_size_trillions']:.2f}T tokens")
    print(f"  Ratio: {allocation['tokens_per_param']:.1f} tokens/param")

Compute-Optimal Frontier

The compute-optimal frontier represents the best achievable loss for a given compute budget:

def compute_optimal_frontier():
    """Generate compute-optimal frontier data."""
    compute_range = np.logspace(17, 27, 100)  # FLOPs

    # Kaplan frontier
    kaplan_loss = [kaplan_compute_scaling(C) for C in compute_range]

    # Chinchilla frontier (better scaling)
    chinchilla_loss = [(3.1e8 / C) ** 0.034 for C in compute_range]

    return compute_range, kaplan_loss, chinchilla_loss

# Compare frontiers
C_range, L_kaplan, L_chinchilla = compute_optimal_frontier()

# At 1e24 FLOPs
C_target = 1e24
L_k = kaplan_compute_scaling(C_target)
L_c = (3.1e8 / C_target) ** 0.034
print(f"At 1e24 FLOPs:")
print(f"  Kaplan prediction: {L_k:.3f}")
print(f"  Chinchilla prediction: {L_c:.3f}")
print(f"  Improvement: {(L_k - L_c) / L_k * 100:.1f}%")

Practical Implications for Training

Training Budget Planning

class TrainingPlanner:
    def __init__(self, target_loss: float, available_flops: float):
        self.target_loss = target_loss
        self.available_flops = available_flops

    def plan_training(self):
        """Create training plan based on compute budget."""
        chinchilla = chinchilla_optimal_allocation(self.available_flops)

        # Check if target loss is achievable
        optimal_loss = (3.1e8 / self.available_flops) ** 0.034

        if optimal_loss > self.target_loss:
            return {
                "feasible": False,
                "message": f"Target loss {self.target_loss:.3f} not achievable. "
                          f"Optimal loss: {optimal_loss:.3f}"
            }

        # Determine if overtrain or use larger model
        N = chinchilla["optimal_params"]
        D = chinchilla["optimal_tokens"]

        return {
            "feasible": True,
            "model_size_b": N / 1e9,
            "data_tokens_t": D / 1e12,
            "training_tokens": int(D),
            "estimated_loss": optimal_loss,
            "tokens_per_param": D / N
        }

    def compare_approaches(self):
        """Compare different training approaches."""
        chinchilla = chinchilla_optimal_allocation(self.available_flops)

        approaches = {
            "Chinchilla Optimal": {
                "N": chinchilla["optimal_params"],
                "D": chinchilla["optimal_tokens"],
                "tokens_per_param": 20
            },
            "Overtrained (40 tok/param)": {
                "N": chinchilla["optimal_params"] / 2,
                "D": chinchilla["optimal_tokens"],
                "tokens_per_param": 40
            },
            "Large Model (10 tok/param)": {
                "N": chinchilla["optimal_params"] * 2,
                "D": chinchilla["optimal_tokens"],
                "tokens_per_param": 10
            }
        }

        results = {}
        for name, config in approaches.items():
            C = 6 * config["N"] * config["D"]
            loss = (3.1e8 / C) ** 0.034
            results[name] = {
                "model_size": config["N"] / 1e9,
                "data_tokens": config["D"] / 1e12,
                "compute_flops": C,
                "predicted_loss": loss
            }

        return results

Common Training Configurations

Configuration	Parameters	Training Tokens	Ratio	Use Case
Chinchilla Optimal	70B	1.4T	20	Research baseline
Overtrained Small	7B	2T	286	Production efficiency
Overtrained Medium	13B	2T	154	Balanced performance
Undertrained Large	70B	400B	5.7	Quick experimentation

Scaling Laws Beyond Chinchilla

Emergent Abilities

Some capabilities only appear at certain scales:

emergent_abilities = {
    "chain_of_thought_reasoning": "Appears around 62B parameters",
    "few_shot_learning": "Significant improvement above 10B",
    "instruction_following": "Strong above 13B",
    "code_generation": "Noticeable above 6B",
    "multilingual_transfer": "Improves dramatically above 10B"
}

Inference-Optimal Scaling

def inference_optimal_size(
    training_budget_flops: float,
    expected_inference_tokens: float,
    cost_per_training_flop: float = 1.0,
    cost_per_inference_flop: float = 1.0
):
    """Determine model size considering inference costs."""
    chinchilla = chinchilla_optimal_allocation(training_budget_flops)
    N_chinchilla = chinchilla["optimal_params"]

    # Total cost = training cost + inference cost
    # For each possible model size N:
    # Training tokens D = C_train / (6N)
    # Training loss = f(C_train)
    # Inference cost = 6N * expected_inference_tokens

    # Find N that minimizes total cost for target loss
    N_range = np.logspace(8, 11, 100)
    total_costs = []

    for N in N_range:
        # Training tokens needed for Chinchilla-optimal
        D = 20 * N
        C_train = 6 * N * D
        train_cost = C_train * cost_per_training_flop

        # Inference cost
        C_infer = 6 * N * expected_inference_tokens
        infer_cost = C_infer * cost_per_inference_flop

        total_costs.append(train_cost + infer_cost)

    optimal_idx = np.argmin(total_costs)
    return N_range[optimal_idx]

Summary

Practice Exercises

Scaling Law Prediction: Use Kaplan and Chinchilla scaling laws to predict performance for 1B, 10B, and 100B parameter models.
Budget Planning: Given a $100K compute budget, determine the optimal model size and training tokens.
Overtraining Analysis: Compare Chinchilla-optimal vs overtrained configurations for a 7B model. What is the inference cost savings?
Compute Allocation: You have 1e25 FLOPs. Should you train a 70B model on 240B tokens or a 14B model on 1.2T tokens?
Emergent Abilities: Research which capabilities emerge at different scales. Create a scale-ability matrix.

What to Learn Next

-> Building Production LLM Applications Applying scaling insights to real-world training and deployment budgets.

-> Pretraining Language Models The training process where scaling laws directly apply.

-> LLM Architecture Deep Dive How architectural choices interact with scaling laws.

-> Mixture of Experts MoE as an alternative scaling strategy with different compute tradeoffs.

-> Long Context and Context Window How context length scaling interacts with overall model scaling.

-> Multimodal LLMs Scaling laws applied to multimodal model training.

Previous: 23 - Open Source LLM Ecosystem <- | Next: 25 - Building Production LLM Applications ->

Scaling Laws & Chinchilla

Scaling Laws and Chinchilla — The Science of Making LLMs Bigger

Scaling Laws & Chinchilla

Why Scaling Laws Matter

Kaplan Scaling Laws

Chinchilla Optimal Training

Why Chinchilla Matters

Compute-Optimal Frontier

Practical Implications for Training

Training Budget Planning

Common Training Configurations

Scaling Laws Beyond Chinchilla

Emergent Abilities

Inference-Optimal Scaling

Summary

Practice Exercises

What to Learn Next

Need Expert LLM Help?