Alignment

DPO and Preference Optimization — Alignment Without RL

RLHF requires complex reinforcement learning pipelines. DPO (Direct Preference Optimization) aligns language models directly from preference data, eliminating the need for a reward model and PPO training.

Preference Data — Human comparisons of good vs bad outputs
Direct Optimization — Skip the reward model, optimize preferences directly
Simpler Training — No RL instability, just supervised learning

Why learn from rewards when you can learn from preferences?

DPO and Preference Optimization

RLHF (Reinforcement Learning from Human Feedback) has two main challenges: training a reward model and optimizing with PPO (which is unstable and expensive). DPO (Rafailov et al., 2023) provides a simpler alternative that directly optimizes the language model on preference data.

DfDirect Preference Optimization (DPO)

DPO is a method that directly optimizes a language model to satisfy human preferences without training a separate reward model. It reformulates the RLHF objective as a simple classification loss on preference pairs (preferred vs rejected responses).

The RLHF Problem

Standard RLHF Pipeline

Architecture Diagram

1. Collect preference data: (prompt, chosen, rejected)
2. Train reward model: r(prompt, response)
3. Optimize with PPO: max E[r(prompt, y)] - beta * KL(pi || pi_ref)

RLHF Objective

R(y) = r(y) - \beta \cdot \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}

Here,

$R(y)$ =Total reward
$r(y)$ =Reward model score
$\beta$ =KL penalty weight
$\pi(y|x)$ =Current policy
$\pi_{\text{ref}}(y|x)$ =Reference policy (SFT model)

The DPO Solution

DPO Loss Function

DfDPO Loss

The DPO loss directly optimizes the language model to assign higher probability to preferred responses and lower probability to rejected responses, with a reference model to prevent mode collapse.

DPO Loss

L_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

Here,

$y_w$ =Preferred (chosen) response
$y_l$ =Rejected response
$\pi$ =Current policy being optimized
$\pi_{\text{ref}}$ =Reference policy (frozen SFT model)
$\beta$ =KL penalty weight (controls deviation from reference)
$\sigma$ =Sigmoid function

Why DPO Works

DPO shows that the optimal policy under the RLHF objective can be expressed in closed form as a function of the reward. By substituting this into the preference likelihood, we get a loss that directly optimizes the policy without needing to estimate rewards.

import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
    """Compute DPO loss."""
    # Log probabilities under current policy
    pi_log_probs_chosen = policy_chosen.log_probabilities()
    pi_log_probs_rejected = policy_rejected.log_probabilities()
    
    # Log probabilities under reference model
    ref_log_probs_chosen = reference_chosen.log_probabilities()
    ref_log_probs_rejected = reference_rejected.log_probabilities()
    
    # DPO loss
    chosen_rewards = beta * (pi_log_probs_chosen - ref_log_probs_chosen)
    rejected_rewards = beta * (pi_log_probs_rejected - ref_log_probs_rejected)
    
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    
    return loss

Preference Data Collection

How to Collect Preferences

class PreferenceDataCollector:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def collect_pair(self, prompt, num_samples=2):
        """Generate a pair of responses for comparison."""
        responses = []
        for _ in range(num_samples):
            response = self.model.generate(
                prompt,
                temperature=0.7,
                max_length=512
            )
            responses.append(response)
        
        return {
            "prompt": prompt,
            "responses": responses,
            "metadata": {"timestamp": time.time()}
        }
    
    def label_preference(self, prompt, response_a, response_b, human_label):
        """Label which response is preferred."""
        if human_label == "A":
            return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
        elif human_label == "B":
            return {"prompt": prompt, "chosen": response_b, "rejected": response_a}
        else:
            return None  # Tie - skip this pair

Preference Data Quality

Quality Factor	Impact	Best Practice
Inter-annotator agreement	High agreement = reliable signal	Use 3+ annotators, majority vote
Response diversity	More diversity = better learning	Sample with different temperatures
Prompt diversity	Broader coverage = better generalization	Mix task types and difficulty levels
Label noise	Noisy labels hurt alignment	Use confident labels only

DPO Variants

IPO (Identity Preference Optimization)

DfIPO

IPO (Azar et al., 2023) addresses DPO's tendency to overfit to preference data. It uses a squared loss instead of log-sigmoid, providing more stable training.

def ipo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
    """IPO loss - more stable than DPO."""
    chosen_logratios = policy_chosen.log_probabilities() - reference_chosen.log_probabilities()
    rejected_logratios = policy_rejected.log_probabilities() - reference_rejected.log_probabilities()
    
    loss = (chosen_logratios - rejected_logratios - 1/(2*beta)).pow(2).mean()
    return loss

KTO (Kahneman-Tversky Optimization)

DfKTO

KTO (Ethayarajh et al., 2024) uses prospect theory-inspired loss that is asymmetric around the reference point, better modeling how humans evaluate outcomes.

ORPO (Odds Ratio Preference Optimization)

DfORPO

ORPO (Hong et al., 2024) combines SFT and preference optimization in a single training step, eliminating the need for a separate SFT phase.

DPO vs RLHF Comparison

Feature	RLHF (PPO)	DPO
Reward model	Required	Not needed
Training stability	Unstable (PPO)	Stable (classification)
Hyperparameters	Many (PPO)	Few (mainly beta)
Compute cost	High	Low
Memory	High (4 models)	Low (2 models)
Theoretical foundation	RL theory	Classification theory
Performance	Slightly better	Slightly worse

DPO is preferred for most use cases due to simplicity and stability. Use RLHF only when you need the absolute best performance and have the compute budget for PPO training.

Practice Exercises

DPO Implementation: Implement DPO training for a small language model. How does the beta parameter affect the tradeoff between preference satisfaction and deviation from the reference model?
Data Collection: Design a preference data collection pipeline. How many preference pairs are needed for good alignment?
A/B Testing: Compare DPO vs SFT on a set of prompts. What types of prompts show the biggest improvement from alignment?
Hyperparameter Analysis: How does beta affect the KL divergence from the reference model? Plot the tradeoff curve.

Key Takeaways

Summary: DPO and Preference Optimization

DPO directly optimizes language models on preference data
No reward model needed — eliminates the RLHF training pipeline
Simpler and more stable than PPO-based RLHF
Preference data consists of (prompt, chosen, rejected) triples
Beta parameter controls the tradeoff between alignment and reference adherence
DPO variants (IPO, KTO, ORPO) address specific limitations
DPO is the default choice for alignment in most practical scenarios

What to Learn Next

-> RLHF and Alignment The original RLHF approach to alignment.

-> Constitutional AI Using AI feedback for alignment.

-> Fine-Tuning LLMs Customizing models for specific tasks.

-> Instruction Tuning Training models to follow instructions.

-> RLHF Alternatives Other approaches to alignment.

-> Alignment Tax and Capabilities How alignment affects model capabilities.

DPO and Preference Optimization

DPO and Preference Optimization — Alignment Without RL

DPO and Preference Optimization

DfDirect Preference Optimization (DPO)

The RLHF Problem

Standard RLHF Pipeline

RLHF Objective

The DPO Solution

DPO Loss Function

DfDPO Loss

DPO Loss

Why DPO Works

Preference Data Collection

How to Collect Preferences

Preference Data Quality

DPO Variants

IPO (Identity Preference Optimization)

DfIPO

KTO (Kahneman-Tversky Optimization)

DfKTO

ORPO (Odds Ratio Preference Optimization)

DfORPO

DPO vs RLHF Comparison

Practice Exercises

Key Takeaways

Summary: DPO and Preference Optimization

What to Learn Next

Need Expert LLM Help?