RLHF and Alignment
Alignment ensures that LLMs behave in accordance with human values and intentions. This tutorial covers the reinforcement learning from human feedback (RLHF) pipeline, reward modeling, PPO, DPO, and theoretical foundations.
DfAlignment
Alignment is the process of ensuring that an AI system's behavior matches human values, intentions, and preferences. For LLMs, alignment means producing helpful, harmless, and honest outputs that satisfy user intent.
The Alignment Pipeline
The standard alignment pipeline consists of three stages:
- Pre-training: Learn general language representations from large corpora
- Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs
- RLHF/DPO: Align with human preferences using reward modeling or direct optimization
For a detailed treatment of reinforcement learning fundamentals, see our module on Reinforcement Learning.
Reward Modeling
What is a Reward Model?
A reward model is a neural network trained to predict human preferences. Given a prompt and two responses, it predicts which response a human would prefer.
DfReward Model
A reward model R(x, y) assigns a scalar score to a (prompt, response) pair, representing how well the response satisfies human preferences. It is trained on pairwise comparison data from human annotators.
Reward Model Training
Reward Model Loss
Here,
- =Prompt/input
- =Preferred (winning) response
- =Dispreferred (losing) response
- =Sigmoid function
- =Reward model
This is the Bradley-Terry model for pairwise comparisons:
Bradley-Terry Preference Model
Here,
- =Probability that y_w is preferred over y_l
PPO: Proximal Policy Optimization
PPO is the standard RL algorithm used in RLHF to optimize the policy against the reward model.
PPO Objective
PPO Objective
Here,
- =Current policy
- =Previous policy
- =Estimated advantage at time t
- =Clip parameter (typically 0.2)
RLHF Objective with KL Penalty
The KL penalty prevents the policy from diverging too far from the reference model, avoiding reward hacking and mode collapse.
Reward Hacking
ThReward Hacking
Reward hacking occurs when the policy learns to exploit weaknesses in the reward model to obtain high rewards without actually satisfying human preferences. Formally, the policy finds y* = argmax_y R(x, y) such that R(x, y*) >> R(x, y_{\text{human}}), even though y* is not actually preferred by humans.
To mitigate reward hacking: (1) use a larger reward model, (2) train on diverse preference data, (3) apply KL constraints, (4) use reward model ensemble, (5) include constitutional AI principles.
DPO: Direct Preference Optimization
DPO (Rafailov et al., 2023) bypasses reward modeling and PPO, directly optimizing the policy on preference data.
DPO vs RLHF Comparison
| Aspect | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Required | Not needed |
| Training stability | Unstable (RL) | Stable (supervised) |
| Compute cost | High (4 models) | Low (2 models) |
| Sample efficiency | Low | High |
| Performance | Strong | Competitive |
| Implementation | Complex | Simple |
DPO's key insight: the optimal policy under RLHF can be expressed in closed form as a function of the reward, eliminating the need for explicit reward modeling.
Other Alignment Methods
Constitutional AI (CAI)
DfConstitutional AI
Constitutional AI uses a set of principles (constitution) to guide the model's self-improvement. The model critiques and revises its own outputs based on the principles, then trains on the improved data.
RLAIF: AI Feedback
DfRLAIF
RLAIF replaces human annotators with an AI system for generating preference data. A larger, more capable model provides feedback, reducing the cost and scaling limitations of human annotation.
Practical Implementation
`python from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import AutoTokenizer
Configure PPO
config = PPOConfig( learning_rate=1.41e-5, batch_size=64, mini_batch_size=16, ppo_epochs=4, kl_penalty="kl", init_kl_coef=0.2, target_kl=6.0, )
Load models
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") tokenizer = AutoTokenizer.from_pretrained("sft-model")
Create trainer
ppo_trainer = PPOTrainer( config=config, model=model, ref_model=ref_model, tokenizer=tokenizer, )
Training loop
for batch in dataloader: query_tensors = [tokenizer.encode(q, return_tensors="pt")[0] for q in batch["query"]]
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)
rewards = [reward_model(q, r) for q, r in zip(batch["query"], responses)]
stats = ppo_trainer.step(query_tensors, response_tensors, rewards) `
DPO Training
`python from trl import DPOTrainer, DPOConfig
dpo_config = DPOConfig( beta=0.1, learning_rate=5e-7, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_length=1024, max_prompt_length=512, )
trainer = DPOTrainer( model=model, ref_model=ref_model, tokenizer=tokenizer, train_dataset=preference_dataset, args=dpo_config, )
trainer.train() `
Practice Exercises
- Mathematical: Derive the DPO loss from the RLHF objective. Show that the optimal policy can be expressed in closed form.
- Implementation: Train a reward model on the Anthropic HH dataset. Evaluate its accuracy on held-out preference pairs.
- Comparison: Compare PPO and DPO on the same task. Which is more stable? Which achieves better final performance?
- Research: Investigate reward hacking in RLHF. Design a simple experiment that demonstrates the phenomenon.
Key Takeaways:
- Alignment ensures LLMs behave in accordance with human values
- RLHF uses reward modeling + PPO to optimize against human preferences
- DPO directly optimizes the policy on preference data, bypassing reward modeling
- KL penalties prevent reward hacking and mode collapse
- Reward hacking is a fundamental challenge in RLHF
- Constitutional AI and RLAIF reduce dependence on human annotation
- DPO is simpler and more stable than PPO for most alignment tasks