Scaling Laws & Chinchilla

FoundationsScaling LawsFree Lesson

Advertisement

Scaling Laws & Chinchilla

Scaling laws describe the relationship between model performance and key factors like model size, dataset size, and compute budget. Understanding these laws is crucial for making informed decisions about LLM training.

Why Scaling Laws Matter

Training large language models requires enormous resources. Scaling laws help us:

  • Predict performance before committing resources
  • Allocate compute budget optimally between model size and data
  • Compare training runs across different scales
  • Plan training for new models

Kaplan Scaling Laws

OpenAI's Kaplan et al. (2020) established the first comprehensive scaling laws for language models.

The observation that language model performance (cross-entropy loss) follows a power law relationship with model size (N), dataset size (D), and compute budget (C), with each factor contributing independently.

L(N)=left(fracNcNright)alphaNL(N) = \\left(\\frac{N_c}{N}\\right)^{\\alpha_N}

Kaplan Data Scaling

L(D)=left(fracDcDright)alphaDL(D) = \\left(\\frac{D_c}{D}\\right)^{\\alpha_D}

Here,

  • =
  • =
  • =
  • =

Kaplan Compute Scaling

L(C)=left(fracCcCright)alphaCL(C) = \\left(\\frac{C_c}{C}\\right)^{\\alpha_C}

Here,

  • =
  • =
  • =
  • =
import numpy as np
import matplotlib.pyplot as plt

def kaplan_scaling_law(N, N_c=8.8e13, alpha_N=0.076):
    """Compute predicted loss using Kaplan scaling law."""
    return (N_c / N) ** alpha_N

def kaplan_data_scaling(D, D_c=5.4e13, alpha_D=0.095):
    """Compute predicted loss using data scaling law."""
    return (D_c / D) ** alpha_D

def kaplan_compute_scaling(C, C_c=3.1e8, alpha_C=0.050):
    """Compute predicted loss using compute scaling law."""
    return (C_c / C) ** alpha_C

# Example predictions
model_sizes = np.logspace(6, 11, 100)  # 1M to 100B parameters
losses = [kaplan_scaling_law(N) for N in model_sizes]

print(f"1B parameter model predicted loss: {kaplan_scaling_law(1e9):.3f}")
print(f"10B parameter model predicted loss: {kaplan_scaling_law(1e10):.3f}")
print(f"100B parameter model predicted loss: {kaplan_scaling_law(1e11):.3f}")

Kaplan's original work suggested that model size was more important than dataset size, leading to very large models trained on relatively less data. This was later corrected by Chinchilla.

Chinchilla Optimal Training

DeepMind's Chinchilla (Hoffmann et al., 2022) revised the scaling laws, showing that model size and data should scale equally.

Dtextoptimalapprox20timesND_{\\text{optimal}} \\approx 20 \\times N

For a given compute budget C, the optimal configuration allocates roughly equal resources to model size and data size. Specifically, the optimal number of tokens is approximately 20 times the number of parameters.

Why Chinchilla Matters

Before Chinchilla:

  • GPT-3: 175B parameters, trained on 300B tokens (ratio: 1.7)
  • This was heavily undertrained

After Chinchilla:

  • Chinchilla: 70B parameters, trained on 1.4T tokens (ratio: 20)
  • Achieved better performance than GPT-3 with fewer parameters

Chinchilla Compute Budget Allocation

Capprox6NDC \\approx 6ND

Here,

  • =
  • =
  • =
  • =
def chinchilla_optimal_allocation(compute_budget_flops):
    """Compute optimal model size and data for given compute budget."""
    # From Chinchilla: C ā‰ˆ 6ND and D ā‰ˆ 20N
    # Substituting: C ā‰ˆ 6N(20N) = 120N²
    # Therefore: N = sqrt(C / 120)
    N_optimal = np.sqrt(compute_budget_flops / 120)
    D_optimal = 20 * N_optimal

    return {
        "optimal_params": N_optimal,
        "optimal_tokens": D_optimal,
        "tokens_per_param": D_optimal / N_optimal,
        "model_size_billions": N_optimal / 1e9,
        "data_size_trillions": D_optimal / 1e12
    }

# Example compute budgets
compute_budgets = {
    "1e21 FLOPs (small)": 1e21,
    "1e23 FLOPs (medium)": 1e23,
    "1e25 FLOPs (large)": 1e25,
    "1e26 FLOPs (GPT-4 scale)": 1e26
}

for name, budget in compute_budgets.items():
    allocation = chinchilla_optimal_allocation(budget)
    print(f"\n{name}:")
    print(f"  Optimal model: {allocation['model_size_billions']:.1f}B params")
    print(f"  Optimal data: {allocation['data_size_trillions']:.2f}T tokens")
    print(f"  Ratio: {allocation['tokens_per_param']:.1f} tokens/param")

Compute-Optimal Frontier

The compute-optimal frontier represents the best achievable loss for a given compute budget:

Compute-Optimal Loss

Lāˆ—(C)=left(fracCcāˆ—Cright)alphaCāˆ—L^*(C) = \\left(\\frac{C_c^*}{C}\\right)^{\\alpha_C^*}

Here,

  • =
  • =
  • =
  • =
def compute_optimal_frontier():
    """Generate compute-optimal frontier data."""
    compute_range = np.logspace(17, 27, 100)  # FLOPs

    # Kaplan frontier
    kaplan_loss = [kaplan_compute_scaling(C) for C in compute_range]

    # Chinchilla frontier (better scaling)
    chinchilla_loss = [(3.1e8 / C) ** 0.034 for C in compute_range]

    return compute_range, kaplan_loss, chinchilla_loss

# Compare frontiers
C_range, L_kaplan, L_chinchilla = compute_optimal_frontier()

# At 1e24 FLOPs
C_target = 1e24
L_k = kaplan_compute_scaling(C_target)
L_c = (3.1e8 / C_target) ** 0.034
print(f"At 1e24 FLOPs:")
print(f"  Kaplan prediction: {L_k:.3f}")
print(f"  Chinchilla prediction: {L_c:.3f}")
print(f"  Improvement: {(L_k - L_c) / L_k * 100:.1f}%")

Practical Implications for Training

Training Budget Planning

class TrainingPlanner:
    def __init__(self, target_loss: float, available_flops: float):
        self.target_loss = target_loss
        self.available_flops = available_flops

    def plan_training(self):
        """Create training plan based on compute budget."""
        chinchilla = chinchilla_optimal_allocation(self.available_flops)

        # Check if target loss is achievable
        optimal_loss = (3.1e8 / self.available_flops) ** 0.034

        if optimal_loss > self.target_loss:
            return {
                "feasible": False,
                "message": f"Target loss {self.target_loss:.3f} not achievable. "
                          f"Optimal loss: {optimal_loss:.3f}"
            }

        # Determine if overtrain or use larger model
        N = chinchilla["optimal_params"]
        D = chinchilla["optimal_tokens"]

        return {
            "feasible": True,
            "model_size_b": N / 1e9,
            "data_tokens_t": D / 1e12,
            "training_tokens": int(D),
            "estimated_loss": optimal_loss,
            "tokens_per_param": D / N
        }

    def compare_approaches(self):
        """Compare different training approaches."""
        chinchilla = chinchilla_optimal_allocation(self.available_flops)

        approaches = {
            "Chinchilla Optimal": {
                "N": chinchilla["optimal_params"],
                "D": chinchilla["optimal_tokens"],
                "tokens_per_param": 20
            },
            "Overtrained (40 tok/param)": {
                "N": chinchilla["optimal_params"] / 2,
                "D": chinchilla["optimal_tokens"],
                "tokens_per_param": 40
            },
            "Large Model (10 tok/param)": {
                "N": chinchilla["optimal_params"] * 2,
                "D": chinchilla["optimal_tokens"],
                "tokens_per_param": 10
            }
        }

        results = {}
        for name, config in approaches.items():
            C = 6 * config["N"] * config["D"]
            loss = (3.1e8 / C) ** 0.034
            results[name] = {
                "model_size": config["N"] / 1e9,
                "data_tokens": config["D"] / 1e12,
                "compute_flops": C,
                "predicted_loss": loss
            }

        return results

Common Training Configurations

ConfigurationParametersTraining TokensRatioUse Case
Chinchilla Optimal70B1.4T20Research baseline
Overtrained Small7B2T286Production efficiency
Overtrained Medium13B2T154Balanced performance
Undertrained Large70B400B5.7Quick experimentation

In practice, many teams "overtrain" smaller models (using more tokens per parameter than Chinchilla optimal) because smaller models are much cheaper to serve. The inference cost savings often outweigh the training inefficiency.

Scaling Laws Beyond Chinchilla

Emergent Abilities

Some capabilities only appear at certain scales:

emergent_abilities = {
    "chain_of_thought_reasoning": "Appears around 62B parameters",
    "few_shot_learning": "Significant improvement above 10B",
    "instruction_following": "Strong above 13B",
    "code_generation": "Noticeable above 6B",
    "multilingual_transfer": "Improves dramatically above 10B"
}

Inference-Optimal Scaling

Inference-Optimal Model Size

Ntextinferenceāˆ’optimal=NtextChinchillatimesleft(fracCtextinferenceCtexttrainingright)1/3N_{\\text{inference-optimal}} = N_{\\text{Chinchilla}} \\times \\left(\\frac{C_{\\text{inference}}}{C_{\\text{training}}}\\right)^{1/3}

Here,

  • =
  • =
  • =
  • =
def inference_optimal_size(
    training_budget_flops: float,
    expected_inference_tokens: float,
    cost_per_training_flop: float = 1.0,
    cost_per_inference_flop: float = 1.0
):
    """Determine model size considering inference costs."""
    chinchilla = chinchilla_optimal_allocation(training_budget_flops)
    N_chinchilla = chinchilla["optimal_params"]

    # Total cost = training cost + inference cost
    # For each possible model size N:
    # Training tokens D = C_train / (6N)
    # Training loss = f(C_train)
    # Inference cost = 6N * expected_inference_tokens

    # Find N that minimizes total cost for target loss
    N_range = np.logspace(8, 11, 100)
    total_costs = []

    for N in N_range:
        # Training tokens needed for Chinchilla-optimal
        D = 20 * N
        C_train = 6 * N * D
        train_cost = C_train * cost_per_training_flop

        # Inference cost
        C_infer = 6 * N * expected_inference_tokens
        infer_cost = C_infer * cost_per_inference_flop

        total_costs.append(train_cost + infer_cost)

    optimal_idx = np.argmin(total_costs)
    return N_range[optimal_idx]

Summary

  • Kaplan scaling laws describe power-law relationships between loss and model/data/compute
  • Chinchilla established that optimal training uses ~20 tokens per parameter
  • Compute-optimal loss scales as L*(C) = (C_c*/C)^0.034
  • Chinchilla corrects Kaplan's finding that model size is more important than data
  • In practice, many teams overtrain smaller models for inference efficiency
  • Training budget planning should account for both training and inference costs
  • Emergent abilities appear at certain scale thresholds

Practice Exercises

  1. Scaling Law Prediction: Use Kaplan and Chinchilla scaling laws to predict performance for 1B, 10B, and 100B parameter models.

  2. Budget Planning: Given a $100K compute budget, determine the optimal model size and training tokens.

  3. Overtraining Analysis: Compare Chinchilla-optimal vs overtrained configurations for a 7B model. What is the inference cost savings?

  4. Compute Allocation: You have 1e25 FLOPs. Should you train a 70B model on 240B tokens or a 14B model on 1.2T tokens?

  5. Emergent Abilities: Research which capabilities emerge at different scales. Create a scale-ability matrix.


Previous: 23 - Open Source LLM Ecosystem <- | Next: 25 - Building Production LLM Applications ->

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement