Pre-training Language Models
Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This tutorial covers the objectives, loss functions, and scaling laws that govern this process.
DfPre-training
Pre-training is the process of training a language model on a large unlabeled text corpus using self-supervised learning objectives. The model learns to predict tokens from context, acquiring general knowledge about language structure, semantics, and world knowledge.
Language Modeling Objectives
Causal Language Modeling (CLM)
CLM is the objective used by GPT, LLaMA, and most modern LLMs. The model predicts the next token given all previous tokens.
Causal Language Modeling
Here,
- =Token at position t
- =Sequence length
- =Model parameters
Masked Language Modeling (MLM)
MLM is the objective used by BERT. The model predicts randomly masked tokens given the bidirectional context.
Masked Language Modeling
Here,
- =Set of masked positions
- =Unmasked tokens (bidirectional context)
CLM is preferred for LLMs because: (1) it enables autoregressive generation, (2) it scales more naturally to large models, and (3) in-context learning emerges from the CLM objective.
Cross-Entropy Loss
The training objective for language models is typically the cross-entropy loss between the model's predicted distribution and the true token distribution.
Perplexity
Perplexity is the standard evaluation metric for language models, measuring how well the model predicts the test data.
Perplexity
Here,
- =Perplexity
- =Cross-entropy loss
- =Test sequence length
Intuitively, perplexity represents the average branching factor---the number of equally likely next tokens the model considers at each position.
Perplexity as Branching Factor
Here,
- =Entropy of the model in bits per token
A perplexity of 10 means the model is, on average, as uncertain as choosing uniformly among 10 candidates. GPT-3 achieves ~20 perplexity on standard benchmarks; GPT-4 achieves ~10-15.
Training Data
The quality and diversity of training data are critical for LLM performance.
Data Sources
- Web crawls: Common Crawl, C4, RefinedWeb
- Books: Books3, Gutenberg
- Code: GitHub, StackOverflow
- Academic: arXiv, S2ORC
- Wikipedia: Multiple languages
Data Quality Pipeline
- Deduplication: Remove duplicate documents (exact and fuzzy)
- Filtering: Remove low-quality, toxic, or PII content
- Re-weighting: Adjust domain proportions based on quality
- Tokenization: Convert to token sequences
Data Mixing Proportions
Here,
- =Weight for domain i
- =Size of domain i
Chinchilla Scaling Laws
The Chinchilla paper (Hoffmann et al., 2022) established optimal scaling relationships between model size and data size.
This implies that for optimal performance, model size and data size should scale equally with compute. This challenged the prior paradigm of training very large models on insufficient data.
Chinchilla vs GPT-3
| Model | Parameters | Tokens | Tokens/Param | FLOPs |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7 | 3.1e23 |
| Chinchilla | 70B | 1.4T | 20 | 5.0e23 |
Chinchilla achieves better performance with fewer parameters but more data, demonstrating the importance of data scaling.
For a detailed treatment of scaling laws, see our module on Scaling Laws and Chinchilla.
Curriculum Learning
Curriculum learning involves presenting training data in a structured order, from easy to hard examples.
DfCurriculum Learning
Curriculum learning is a training strategy where data is presented in order of increasing difficulty. For LLMs, this can mean: (1) starting with shorter sequences, (2) gradually increasing data complexity, or (3) focusing on higher-quality data later in training.
Practical Example: Pre-training with HuggingFace
`python from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling ) from datasets import load_dataset
model_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)
dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")
def tokenize(examples): return tokenizer(examples["text"], truncation=True, max_length=2048)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])
training_args = TrainingArguments( output_dir="./llama2-pretrained", per_device_train_batch_size=4, gradient_accumulation_steps=8, learning_rate=2e-4, warmup_steps=1000, max_steps=100000, fp16=True, logging_steps=100, save_steps=10000, optim="adamw_torch", weight_decay=0.1, lr_scheduler_type="cosine", )
trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), )
trainer.train() `
Pre-training from scratch requires massive compute resources (typically thousands of GPUs for months). For most practitioners, fine-tuning a pre-trained model is more practical. See our modules on Fine-tuning and LoRA.
Practice Exercises
-
Mathematical: Calculate the perplexity of a model with cross-entropy loss of 3.2 nats. How does this compare to a model with loss of 2.8 nats?
-
Analysis: Given a compute budget of 1e24 FLOPs, what is the Chinchilla-optimal model size and training data size? How does this compare to GPT-3's configuration?
-
Implementation: Implement a simple CLM training loop from scratch using PyTorch. Train a small model on a text file and track perplexity over training.
-
Research: Compare the data mixing proportions used in LLaMA 2, Mistral, and Qwen. How do their domain weights differ?
Key Takeaways:
- Causal Language Modeling (CLM) is the dominant pre-training objective for LLMs
- Cross-entropy loss measures the difference between predicted and true token distributions
- Perplexity (exp of loss) measures model uncertainty in bits per token
- Chinchilla scaling laws show model size and data should scale equally with compute
- Data quality and deduplication are as important as data quantity
- Curriculum learning can improve training efficiency