Pre-training Language Models

TrainingPre-trainingFree Lesson

Advertisement

Pre-training Language Models

Pre-training is the foundational stage of LLM development, where models learn general language representations from massive text corpora. This tutorial covers the objectives, loss functions, and scaling laws that govern this process.

DfPre-training

Pre-training is the process of training a language model on a large unlabeled text corpus using self-supervised learning objectives. The model learns to predict tokens from context, acquiring general knowledge about language structure, semantics, and world knowledge.

Language Modeling Objectives

Causal Language Modeling (CLM)

CLM is the objective used by GPT, LLaMA, and most modern LLMs. The model predicts the next token given all previous tokens.

Causal Language Modeling

mathcalLtextCLM=βˆ’sumt=1TlogP(xt∣x1,ldots,xtβˆ’1;theta)\\mathcal{L}_{\\text{CLM}} = -\\sum_{t=1}^{T} \\log P(x_t | x_1, \\ldots, x_{t-1}; \\theta)

Here,

  • xtx_t=Token at position t
  • TT=Sequence length
  • ΞΈ\theta=Model parameters

Masked Language Modeling (MLM)

MLM is the objective used by BERT. The model predicts randomly masked tokens given the bidirectional context.

Masked Language Modeling

mathcalLtextMLM=βˆ’sumtinmathcalMlogP(xt∣xsetminusmathcalM;theta)\\mathcal{L}_{\\text{MLM}} = -\\sum_{t \\in \\mathcal{M}} \\log P(x_t | x_{\\setminus \\mathcal{M}}; \\theta)

Here,

  • M\mathcal{M}=Set of masked positions
  • xβˆ–Mx_{\setminus \mathcal{M}}=Unmasked tokens (bidirectional context)

CLM is preferred for LLMs because: (1) it enables autoregressive generation, (2) it scales more naturally to large models, and (3) in-context learning emerges from the CLM objective.

Cross-Entropy Loss

The training objective for language models is typically the cross-entropy loss between the model's predicted distribution and the true token distribution.

\\mathcal{L}(\\theta) = -\\frac{1}{T} \\sum_{t=1}^{T} \\log P_\\theta(x_t | x_{<t}) = -\\frac{1}{T} \\sum_{t=1}^{T} \\log \\frac{\\exp(z_{x_t})}{\\sum_{v=1}^{V} \\exp(z_v)}

Perplexity

Perplexity is the standard evaluation metric for language models, measuring how well the model predicts the test data.

Perplexity

\\text{PPL} = \\exp(\\mathcal{L}) = \\exp\\left(-\\frac{1}{T} \\sum_{t=1}^{T} \\log P_\\theta(x_t | x_{<t})\\right)

Here,

  • PPL\text{PPL}=Perplexity
  • L\mathcal{L}=Cross-entropy loss
  • TT=Test sequence length

Intuitively, perplexity represents the average branching factor---the number of equally likely next tokens the model considers at each position.

Perplexity as Branching Factor

textPPL=2H=2βˆ’frac1Tsumtlog2P(xt∣x<t)\\text{PPL} = 2^{H} = 2^{-\\frac{1}{T} \\sum_t \\log_2 P(x_t | x_{<t})}

Here,

  • HH=Entropy of the model in bits per token

A perplexity of 10 means the model is, on average, as uncertain as choosing uniformly among 10 candidates. GPT-3 achieves ~20 perplexity on standard benchmarks; GPT-4 achieves ~10-15.

Training Data

The quality and diversity of training data are critical for LLM performance.

Data Sources

  • Web crawls: Common Crawl, C4, RefinedWeb
  • Books: Books3, Gutenberg
  • Code: GitHub, StackOverflow
  • Academic: arXiv, S2ORC
  • Wikipedia: Multiple languages

Data Quality Pipeline

  1. Deduplication: Remove duplicate documents (exact and fuzzy)
  2. Filtering: Remove low-quality, toxic, or PII content
  3. Re-weighting: Adjust domain proportions based on quality
  4. Tokenization: Convert to token sequences

Data Mixing Proportions

P(textdomaini)=fracwicdot∣Di∣sumjwjcdot∣Dj∣P(\\text{domain}_i) = \\frac{w_i \\cdot |D_i|}{\\sum_j w_j \\cdot |D_j|}

Here,

  • wiw_i=Weight for domain i
  • ∣Di∣|D_i|=Size of domain i

Chinchilla Scaling Laws

The Chinchilla paper (Hoffmann et al., 2022) established optimal scaling relationships between model size and data size.

NtextoptproptoC0.5,quadDtextoptproptoC0.5N_{\\text{opt}} \\propto C^{0.5}, \\quad D_{\\text{opt}} \\propto C^{0.5}

This implies that for optimal performance, model size and data size should scale equally with compute. This challenged the prior paradigm of training very large models on insufficient data.

Chinchilla vs GPT-3

ModelParametersTokensTokens/ParamFLOPs
GPT-3175B300B1.73.1e23
Chinchilla70B1.4T205.0e23

Chinchilla achieves better performance with fewer parameters but more data, demonstrating the importance of data scaling.

For a detailed treatment of scaling laws, see our module on Scaling Laws and Chinchilla.

Curriculum Learning

Curriculum learning involves presenting training data in a structured order, from easy to hard examples.

DfCurriculum Learning

Curriculum learning is a training strategy where data is presented in order of increasing difficulty. For LLMs, this can mean: (1) starting with shorter sequences, (2) gradually increasing data complexity, or (3) focusing on higher-quality data later in training.

Practical Example: Pre-training with HuggingFace

`python from transformers import ( AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, DataCollatorForLanguageModeling ) from datasets import load_dataset

model_name = "meta-llama/Llama-2-7b-hf" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name)

dataset = load_dataset("wikitext", "wikitext-103-raw-v1", split="train")

def tokenize(examples): return tokenizer(examples["text"], truncation=True, max_length=2048)

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text"])

training_args = TrainingArguments( output_dir="./llama2-pretrained", per_device_train_batch_size=4, gradient_accumulation_steps=8, learning_rate=2e-4, warmup_steps=1000, max_steps=100000, fp16=True, logging_steps=100, save_steps=10000, optim="adamw_torch", weight_decay=0.1, lr_scheduler_type="cosine", )

trainer = Trainer( model=model, args=training_args, train_dataset=tokenized, data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False), )

trainer.train() `

Pre-training from scratch requires massive compute resources (typically thousands of GPUs for months). For most practitioners, fine-tuning a pre-trained model is more practical. See our modules on Fine-tuning and LoRA.

Practice Exercises

  1. Mathematical: Calculate the perplexity of a model with cross-entropy loss of 3.2 nats. How does this compare to a model with loss of 2.8 nats?

  2. Analysis: Given a compute budget of 1e24 FLOPs, what is the Chinchilla-optimal model size and training data size? How does this compare to GPT-3's configuration?

  3. Implementation: Implement a simple CLM training loop from scratch using PyTorch. Train a small model on a text file and track perplexity over training.

  4. Research: Compare the data mixing proportions used in LLaMA 2, Mistral, and Qwen. How do their domain weights differ?

Key Takeaways:

  • Causal Language Modeling (CLM) is the dominant pre-training objective for LLMs
  • Cross-entropy loss measures the difference between predicted and true token distributions
  • Perplexity (exp of loss) measures model uncertainty in bits per token
  • Chinchilla scaling laws show model size and data should scale equally with compute
  • Data quality and deduplication are as important as data quantity
  • Curriculum learning can improve training efficiency

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement