Alignment
DPO and Preference Optimization — Alignment Without RL
RLHF requires complex reinforcement learning pipelines. DPO (Direct Preference Optimization) aligns language models directly from preference data, eliminating the need for a reward model and PPO training.
- Preference Data — Human comparisons of good vs bad outputs
- Direct Optimization — Skip the reward model, optimize preferences directly
- Simpler Training — No RL instability, just supervised learning
Why learn from rewards when you can learn from preferences?
DPO and Preference Optimization
RLHF (Reinforcement Learning from Human Feedback) has two main challenges: training a reward model and optimizing with PPO (which is unstable and expensive). DPO (Rafailov et al., 2023) provides a simpler alternative that directly optimizes the language model on preference data.
DfDirect Preference Optimization (DPO)
DPO is a method that directly optimizes a language model to satisfy human preferences without training a separate reward model. It reformulates the RLHF objective as a simple classification loss on preference pairs (preferred vs rejected responses).
The RLHF Problem
Standard RLHF Pipeline
1. Collect preference data: (prompt, chosen, rejected)
2. Train reward model: r(prompt, response)
3. Optimize with PPO: max E[r(prompt, y)] - beta * KL(pi || pi_ref)
RLHF Objective
Here,
- =Total reward
- =Reward model score
- =KL penalty weight
- =Current policy
- =Reference policy (SFT model)
The DPO Solution
DPO Loss Function
DfDPO Loss
The DPO loss directly optimizes the language model to assign higher probability to preferred responses and lower probability to rejected responses, with a reference model to prevent mode collapse.
DPO Loss
Here,
- =Preferred (chosen) response
- =Rejected response
- =Current policy being optimized
- =Reference policy (frozen SFT model)
- =KL penalty weight (controls deviation from reference)
- =Sigmoid function
Why DPO Works
DPO shows that the optimal policy under the RLHF objective can be expressed in closed form as a function of the reward. By substituting this into the preference likelihood, we get a loss that directly optimizes the policy without needing to estimate rewards.
import torch
import torch.nn.functional as F
def dpo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
"""Compute DPO loss."""
# Log probabilities under current policy
pi_log_probs_chosen = policy_chosen.log_probabilities()
pi_log_probs_rejected = policy_rejected.log_probabilities()
# Log probabilities under reference model
ref_log_probs_chosen = reference_chosen.log_probabilities()
ref_log_probs_rejected = reference_rejected.log_probabilities()
# DPO loss
chosen_rewards = beta * (pi_log_probs_chosen - ref_log_probs_chosen)
rejected_rewards = beta * (pi_log_probs_rejected - ref_log_probs_rejected)
loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
return loss
Preference Data Collection
How to Collect Preferences
class PreferenceDataCollector:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
def collect_pair(self, prompt, num_samples=2):
"""Generate a pair of responses for comparison."""
responses = []
for _ in range(num_samples):
response = self.model.generate(
prompt,
temperature=0.7,
max_length=512
)
responses.append(response)
return {
"prompt": prompt,
"responses": responses,
"metadata": {"timestamp": time.time()}
}
def label_preference(self, prompt, response_a, response_b, human_label):
"""Label which response is preferred."""
if human_label == "A":
return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
elif human_label == "B":
return {"prompt": prompt, "chosen": response_b, "rejected": response_a}
else:
return None # Tie - skip this pair
Preference Data Quality
| Quality Factor | Impact | Best Practice |
|---|---|---|
| Inter-annotator agreement | High agreement = reliable signal | Use 3+ annotators, majority vote |
| Response diversity | More diversity = better learning | Sample with different temperatures |
| Prompt diversity | Broader coverage = better generalization | Mix task types and difficulty levels |
| Label noise | Noisy labels hurt alignment | Use confident labels only |
DPO Variants
IPO (Identity Preference Optimization)
DfIPO
IPO (Azar et al., 2023) addresses DPO's tendency to overfit to preference data. It uses a squared loss instead of log-sigmoid, providing more stable training.
def ipo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
"""IPO loss - more stable than DPO."""
chosen_logratios = policy_chosen.log_probabilities() - reference_chosen.log_probabilities()
rejected_logratios = policy_rejected.log_probabilities() - reference_rejected.log_probabilities()
loss = (chosen_logratios - rejected_logratios - 1/(2*beta)).pow(2).mean()
return loss
KTO (Kahneman-Tversky Optimization)
DfKTO
KTO (Ethayarajh et al., 2024) uses prospect theory-inspired loss that is asymmetric around the reference point, better modeling how humans evaluate outcomes.
ORPO (Odds Ratio Preference Optimization)
DfORPO
ORPO (Hong et al., 2024) combines SFT and preference optimization in a single training step, eliminating the need for a separate SFT phase.
DPO vs RLHF Comparison
| Feature | RLHF (PPO) | DPO |
|---|---|---|
| Reward model | Required | Not needed |
| Training stability | Unstable (PPO) | Stable (classification) |
| Hyperparameters | Many (PPO) | Few (mainly beta) |
| Compute cost | High | Low |
| Memory | High (4 models) | Low (2 models) |
| Theoretical foundation | RL theory | Classification theory |
| Performance | Slightly better | Slightly worse |
DPO is preferred for most use cases due to simplicity and stability. Use RLHF only when you need the absolute best performance and have the compute budget for PPO training.
Practice Exercises
-
DPO Implementation: Implement DPO training for a small language model. How does the beta parameter affect the tradeoff between preference satisfaction and deviation from the reference model?
-
Data Collection: Design a preference data collection pipeline. How many preference pairs are needed for good alignment?
-
A/B Testing: Compare DPO vs SFT on a set of prompts. What types of prompts show the biggest improvement from alignment?
-
Hyperparameter Analysis: How does beta affect the KL divergence from the reference model? Plot the tradeoff curve.
Key Takeaways
Summary: DPO and Preference Optimization
- DPO directly optimizes language models on preference data
- No reward model needed — eliminates the RLHF training pipeline
- Simpler and more stable than PPO-based RLHF
- Preference data consists of (prompt, chosen, rejected) triples
- Beta parameter controls the tradeoff between alignment and reference adherence
- DPO variants (IPO, KTO, ORPO) address specific limitations
- DPO is the default choice for alignment in most practical scenarios
What to Learn Next
-> RLHF and Alignment The original RLHF approach to alignment.
-> Constitutional AI Using AI feedback for alignment.
-> Fine-Tuning LLMs Customizing models for specific tasks.
-> Instruction Tuning Training models to follow instructions.
-> RLHF Alternatives Other approaches to alignment.
-> Alignment Tax and Capabilities How alignment affects model capabilities.