LLM Training

Pretraining Language Models — Learning Language from the Internet

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This guide covers training objectives, loss functions, and the scaling laws that govern this process.

CLM vs MLM — Causal Language Modeling dominates modern LLM training
Scaling Laws — Model size and data should scale equally with compute
Data Quality — Deduplication and curation matter as much as quantity

Scale is not just a feature—it is the strategy.

Pre-training Language Models

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This tutorial covers the objectives, loss functions, and scaling laws that govern this process.

Language Modeling Objectives

Causal Language Modeling (CLM)

CLM is the objective used by GPT, LLaMA, and most modern LLMs. The model predicts the next token given all previous tokens.

Masked Language Modeling (MLM)

MLM is the objective used by BERT. The model predicts randomly masked tokens given the bidirectional context.

Cross-Entropy Loss

The training objective for language models is typically the cross-entropy loss between the model's predicted distribution and the true token distribution.

Perplexity

Perplexity is the standard evaluation metric for language models, measuring how well the model predicts the test data.

Intuitively, perplexity represents the average branching factor---the number of equally likely next tokens the model considers at each position.

Training Data

The quality and diversity of training data are critical for LLM performance.

Data Sources

Web crawls: Common Crawl, C4, RefinedWeb
Books: Books3, Gutenberg
Code: GitHub, StackOverflow
Academic: arXiv, S2ORC
Wikipedia: Multiple languages

Data Quality Pipeline

Deduplication: Remove duplicate documents (exact and fuzzy)
Filtering: Remove low-quality, toxic, or PII content
Re-weighting: Adjust domain proportions based on quality
Tokenization: Convert to token sequences

Chinchilla Scaling Laws

The Chinchilla paper (Hoffmann et al., 2022) established optimal scaling relationships between model size and data size.

This implies that for optimal performance, model size and data size should scale equally with compute. This challenged the prior paradigm of training very large models on insufficient data.

Chinchilla vs GPT-3

Model	Parameters	Tokens	Tokens/Param	FLOPs
GPT-3	175B	300B	1.7	3.1e23
Chinchilla	70B	1.4T	20	5.0e23

Chinchilla achieves better performance with fewer parameters but more data, demonstrating the importance of data scaling.

Curriculum Learning

Curriculum learning involves presenting training data in a structured order, from easy to hard examples.

Practical Example: Pre-training with HuggingFace

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

def tokenize(examples):
    return tokenizer(examples["text"], truncation=True, max_length=2048)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

training_args = TrainingArguments(
    output_dir="./llama2-pretrained",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=2e-4,
    warmup_steps=1000,
    max_steps=100000,
    fp16=True,
    logging_steps=100,
    save_steps=10000,
    optim="adamw_torch",
    weight_decay=0.1,
    lr_scheduler_type="cosine",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized,
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

Practice Exercises

Mathematical: Calculate the perplexity of a model with cross-entropy loss of 3.2 nats. How does this compare to a model with loss of 2.8 nats?
Analysis: Given a compute budget of 1e24 FLOPs, what is the Chinchilla-optimal model size and training data size? How does this compare to GPT-3's configuration?
Implementation: Implement a simple CLM training loop from scratch using PyTorch. Train a small model on a text file and track perplexity over training.
Research: Compare the data mixing proportions used in LLaMA 2, Mistral, and Qwen. How do their domain weights differ?

What to Learn Next

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> LoRA and PEFT Efficient fine-tuning without full retraining using low-rank adaptation.

-> QLoRA and Quantization Running LLMs on consumer hardware with INT4 quantization.

-> RLHF and Alignment Making LLMs safe and helpful through reinforcement learning from human feedback.

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> Scaling Laws and Chinchilla Understanding the mathematical relationships governing model performance.

Pre-training Language Models

Pretraining Language Models — Learning Language from the Internet

Pre-training Language Models

Language Modeling Objectives

Causal Language Modeling (CLM)

Masked Language Modeling (MLM)

Cross-Entropy Loss

Perplexity

Training Data

Data Sources

Data Quality Pipeline

Chinchilla Scaling Laws

Chinchilla vs GPT-3

Curriculum Learning

Practical Example: Pre-training with HuggingFace

Practice Exercises

What to Learn Next

Need Expert LLM Help?