LLM Training

RLHF and Alignment — Making LLMs Safe and Helpful

Alignment ensures that LLMs behave in accordance with human values and intentions. This guide covers the RLHF pipeline, reward modeling, PPO, DPO, and theoretical foundations for building safe and helpful AI systems.

RLHF Pipeline — Supervised fine-tuning followed by reward modeling and PPO
DPO — Direct preference optimization bypasses reward modeling for simpler alignment
Constitutional AI — Reduces dependence on human annotation through principles

Alignment is not a feature—it is a responsibility.

RLHF and Alignment

Alignment ensures that LLMs behave in accordance with human values and intentions. This tutorial covers the reinforcement learning from human feedback (RLHF) pipeline, reward modeling, PPO, DPO, and theoretical foundations.

The Alignment Pipeline

The standard alignment pipeline consists of three stages:

Pre-training: Learn general language representations from large corpora
Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs
RLHF/DPO: Align with human preferences using reward modeling or direct optimization

Reward Modeling

What is a Reward Model?

A reward model is a neural network trained to predict human preferences. Given a prompt and two responses, it predicts which response a human would prefer.

Reward Model Training

This is the Bradley-Terry model for pairwise comparisons:

PPO: Proximal Policy Optimization

PPO is the standard RL algorithm used in RLHF to optimize the policy against the reward model.

PPO Objective

RLHF Objective with KL Penalty

The KL penalty prevents the policy from diverging too far from the reference model, avoiding reward hacking and mode collapse.

Reward Hacking

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) bypasses reward modeling and PPO, directly optimizing the policy on preference data.

DPO vs RLHF Comparison

Aspect	RLHF (PPO)	DPO
Reward model	Required	Not needed
Training stability	Unstable (RL)	Stable (supervised)
Compute cost	High (4 models)	Low (2 models)
Sample efficiency	Low	High
Performance	Strong	Competitive
Implementation	Complex	Simple

Other Alignment Methods

Constitutional AI (CAI)

RLAIF: AI Feedback

Practical Implementation

from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer

# Configure PPO
config = PPOConfig(
    learning_rate=1.41e-5,
    batch_size=64,
    mini_batch_size=16,
    ppo_epochs=4,
    kl_penalty="kl",
    init_kl_coef=0.2,
    target_kl=6.0,
)

# Load models
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
tokenizer = AutoTokenizer.from_pretrained("sft-model")

# Create trainer
ppo_trainer = PPOTrainer(
    config=config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
for batch in dataloader:
    query_tensors = [tokenizer.encode(q, return_tensors="pt")[0] for q in batch["query"]]
    
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)
    
    rewards = [reward_model(q, r) for q, r in zip(batch["query"], responses)]
    
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

DPO Training

from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig(
    beta=0.1,
    learning_rate=5e-7,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    max_length=1024,
    max_prompt_length=512,
)

trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
    train_dataset=preference_dataset,
    args=dpo_config,
)

trainer.train()

Practice Exercises

Mathematical: Derive the DPO loss from the RLHF objective. Show that the optimal policy can be expressed in closed form.
Implementation: Train a reward model on the Anthropic HH dataset. Evaluate its accuracy on held-out preference pairs.
Comparison: Compare PPO and DPO on the same task. Which is more stable? Which achieves better final performance?
Research: Investigate reward hacking in RLHF. Design a simple experiment that demonstrates the phenomenon.

What to Learn Next

-> Constitutional AI Reducing dependence on human annotation through AI self-alignment.

-> LLM Safety and Red Teaming Testing and hardening LLMs against adversarial attacks.

-> Fine-Tuning LLMs Customizing language models for your specific tasks and domains.

-> Instruction Tuning Teaching models to follow complex multi-step instructions reliably.

-> Pretraining Language Models Learning language from the internet with CLM, scaling laws, and data curation.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

RLHF and Alignment

RLHF and Alignment — Making LLMs Safe and Helpful

RLHF and Alignment

The Alignment Pipeline

Reward Modeling

What is a Reward Model?

Reward Model Training

PPO: Proximal Policy Optimization

PPO Objective

RLHF Objective with KL Penalty

Reward Hacking

DPO: Direct Preference Optimization

DPO vs RLHF Comparison

Other Alignment Methods

Constitutional AI (CAI)

RLAIF: AI Feedback

Practical Implementation

DPO Training

Practice Exercises

What to Learn Next

Need Expert LLM Help?