CW

DPO and Preference Optimization

AlignmentPreference LearningFree Lesson

Advertisement

Alignment

DPO and Preference Optimization — Alignment Without RL

RLHF requires complex reinforcement learning pipelines. DPO (Direct Preference Optimization) aligns language models directly from preference data, eliminating the need for a reward model and PPO training.

  • Preference Data — Human comparisons of good vs bad outputs
  • Direct Optimization — Skip the reward model, optimize preferences directly
  • Simpler Training — No RL instability, just supervised learning

Why learn from rewards when you can learn from preferences?

DPO and Preference Optimization

RLHF (Reinforcement Learning from Human Feedback) has two main challenges: training a reward model and optimizing with PPO (which is unstable and expensive). DPO (Rafailov et al., 2023) provides a simpler alternative that directly optimizes the language model on preference data.

DfDirect Preference Optimization (DPO)

DPO is a method that directly optimizes a language model to satisfy human preferences without training a separate reward model. It reformulates the RLHF objective as a simple classification loss on preference pairs (preferred vs rejected responses).

The RLHF Problem

Standard RLHF Pipeline

Architecture Diagram
1. Collect preference data: (prompt, chosen, rejected)
2. Train reward model: r(prompt, response)
3. Optimize with PPO: max E[r(prompt, y)] - beta * KL(pi || pi_ref)

RLHF Objective

R(y)=r(y)βlogπ(yx)πref(yx)R(y) = r(y) - \beta \cdot \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}

Here,

  • R(y)R(y)=Total reward
  • r(y)r(y)=Reward model score
  • β\beta=KL penalty weight
  • π(yx)\pi(y|x)=Current policy
  • πref(yx)\pi_{\text{ref}}(y|x)=Reference policy (SFT model)

The DPO Solution

DPO Loss Function

DfDPO Loss

The DPO loss directly optimizes the language model to assign higher probability to preferred responses and lower probability to rejected responses, with a reference model to prevent mode collapse.

DPO Loss

LDPO=E(x,yw,yl)[logσ(βlogπ(ywx)πref(ywx)βlogπ(ylx)πref(ylx))]L_{\text{DPO}} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

Here,

  • ywy_w=Preferred (chosen) response
  • yly_l=Rejected response
  • π\pi=Current policy being optimized
  • πref\pi_{\text{ref}}=Reference policy (frozen SFT model)
  • β\beta=KL penalty weight (controls deviation from reference)
  • σ\sigma=Sigmoid function

Why DPO Works

DPO shows that the optimal policy under the RLHF objective can be expressed in closed form as a function of the reward. By substituting this into the preference likelihood, we get a loss that directly optimizes the policy without needing to estimate rewards.

import torch
import torch.nn.functional as F

def dpo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
    """Compute DPO loss."""
    # Log probabilities under current policy
    pi_log_probs_chosen = policy_chosen.log_probabilities()
    pi_log_probs_rejected = policy_rejected.log_probabilities()
    
    # Log probabilities under reference model
    ref_log_probs_chosen = reference_chosen.log_probabilities()
    ref_log_probs_rejected = reference_rejected.log_probabilities()
    
    # DPO loss
    chosen_rewards = beta * (pi_log_probs_chosen - ref_log_probs_chosen)
    rejected_rewards = beta * (pi_log_probs_rejected - ref_log_probs_rejected)
    
    loss = -F.logsigmoid(chosen_rewards - rejected_rewards).mean()
    
    return loss

Preference Data Collection

How to Collect Preferences

class PreferenceDataCollector:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
    
    def collect_pair(self, prompt, num_samples=2):
        """Generate a pair of responses for comparison."""
        responses = []
        for _ in range(num_samples):
            response = self.model.generate(
                prompt,
                temperature=0.7,
                max_length=512
            )
            responses.append(response)
        
        return {
            "prompt": prompt,
            "responses": responses,
            "metadata": {"timestamp": time.time()}
        }
    
    def label_preference(self, prompt, response_a, response_b, human_label):
        """Label which response is preferred."""
        if human_label == "A":
            return {"prompt": prompt, "chosen": response_a, "rejected": response_b}
        elif human_label == "B":
            return {"prompt": prompt, "chosen": response_b, "rejected": response_a}
        else:
            return None  # Tie - skip this pair

Preference Data Quality

Quality FactorImpactBest Practice
Inter-annotator agreementHigh agreement = reliable signalUse 3+ annotators, majority vote
Response diversityMore diversity = better learningSample with different temperatures
Prompt diversityBroader coverage = better generalizationMix task types and difficulty levels
Label noiseNoisy labels hurt alignmentUse confident labels only

DPO Variants

IPO (Identity Preference Optimization)

DfIPO

IPO (Azar et al., 2023) addresses DPO's tendency to overfit to preference data. It uses a squared loss instead of log-sigmoid, providing more stable training.

def ipo_loss(policy_chosen, policy_rejected, reference_chosen, reference_rejected, beta=0.1):
    """IPO loss - more stable than DPO."""
    chosen_logratios = policy_chosen.log_probabilities() - reference_chosen.log_probabilities()
    rejected_logratios = policy_rejected.log_probabilities() - reference_rejected.log_probabilities()
    
    loss = (chosen_logratios - rejected_logratios - 1/(2*beta)).pow(2).mean()
    return loss

KTO (Kahneman-Tversky Optimization)

DfKTO

KTO (Ethayarajh et al., 2024) uses prospect theory-inspired loss that is asymmetric around the reference point, better modeling how humans evaluate outcomes.

ORPO (Odds Ratio Preference Optimization)

DfORPO

ORPO (Hong et al., 2024) combines SFT and preference optimization in a single training step, eliminating the need for a separate SFT phase.

DPO vs RLHF Comparison

FeatureRLHF (PPO)DPO
Reward modelRequiredNot needed
Training stabilityUnstable (PPO)Stable (classification)
HyperparametersMany (PPO)Few (mainly beta)
Compute costHighLow
MemoryHigh (4 models)Low (2 models)
Theoretical foundationRL theoryClassification theory
PerformanceSlightly betterSlightly worse

DPO is preferred for most use cases due to simplicity and stability. Use RLHF only when you need the absolute best performance and have the compute budget for PPO training.

Practice Exercises

  1. DPO Implementation: Implement DPO training for a small language model. How does the beta parameter affect the tradeoff between preference satisfaction and deviation from the reference model?

  2. Data Collection: Design a preference data collection pipeline. How many preference pairs are needed for good alignment?

  3. A/B Testing: Compare DPO vs SFT on a set of prompts. What types of prompts show the biggest improvement from alignment?

  4. Hyperparameter Analysis: How does beta affect the KL divergence from the reference model? Plot the tradeoff curve.

Key Takeaways

Summary: DPO and Preference Optimization

  • DPO directly optimizes language models on preference data
  • No reward model needed — eliminates the RLHF training pipeline
  • Simpler and more stable than PPO-based RLHF
  • Preference data consists of (prompt, chosen, rejected) triples
  • Beta parameter controls the tradeoff between alignment and reference adherence
  • DPO variants (IPO, KTO, ORPO) address specific limitations
  • DPO is the default choice for alignment in most practical scenarios

What to Learn Next

-> RLHF and Alignment The original RLHF approach to alignment.

-> Constitutional AI Using AI feedback for alignment.

-> Fine-Tuning LLMs Customizing models for specific tasks.

-> Instruction Tuning Training models to follow instructions.

-> RLHF Alternatives Other approaches to alignment.

-> Alignment Tax and Capabilities How alignment affects model capabilities.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement