RLHF and Alignment

AlignmentRLHFFree Lesson

Advertisement

RLHF and Alignment

Alignment ensures that LLMs behave in accordance with human values and intentions. This tutorial covers the reinforcement learning from human feedback (RLHF) pipeline, reward modeling, PPO, DPO, and theoretical foundations.

DfAlignment

Alignment is the process of ensuring that an AI system's behavior matches human values, intentions, and preferences. For LLMs, alignment means producing helpful, harmless, and honest outputs that satisfy user intent.

The Alignment Pipeline

The standard alignment pipeline consists of three stages:

  1. Pre-training: Learn general language representations from large corpora
  2. Supervised Fine-Tuning (SFT): Fine-tune on high-quality instruction-response pairs
  3. RLHF/DPO: Align with human preferences using reward modeling or direct optimization

For a detailed treatment of reinforcement learning fundamentals, see our module on Reinforcement Learning.

Reward Modeling

What is a Reward Model?

A reward model is a neural network trained to predict human preferences. Given a prompt and two responses, it predicts which response a human would prefer.

DfReward Model

A reward model R(x, y) assigns a scalar score to a (prompt, response) pair, representing how well the response satisfies human preferences. It is trained on pairwise comparison data from human annotators.

Reward Model Training

Reward Model Loss

mathcalLtextRM=βˆ’mathbbE(x,yw,yl)left[logsigma(R(x,yw)βˆ’R(x,yl))right]\\mathcal{L}_{\\text{RM}} = -\\mathbb{E}_{(x, y_w, y_l)} \\left[\\log \\sigma(R(x, y_w) - R(x, y_l))\\right]

Here,

  • xx=Prompt/input
  • ywy_w=Preferred (winning) response
  • yly_l=Dispreferred (losing) response
  • Οƒ\sigma=Sigmoid function
  • RR=Reward model

This is the Bradley-Terry model for pairwise comparisons:

Bradley-Terry Preference Model

P(ywsuccyl∣x)=sigma(R(x,yw)βˆ’R(x,yl))P(y_w \\succ y_l | x) = \\sigma(R(x, y_w) - R(x, y_l))

Here,

  • P(yw≻yl∣x)P(y_w \succ y_l | x)=Probability that y_w is preferred over y_l

PPO: Proximal Policy Optimization

PPO is the standard RL algorithm used in RLHF to optimize the policy against the reward model.

PPO Objective

PPO Objective

\\mathcal{L}_{\\text{PPO}} = \\mathbb{E}_{t} \\left[\\min\\left(\\frac{\\pi_\\theta(a_t|s_t)}{\\pi_{\\theta_{\\text{old}}}(a_t|s_t)} \\hat{A}_t, \\text{clip}\\left(\\frac{\\pi_\\theta(a_t|s_t)}{\\pi_{\\theta_{\\text{old}}}(a_t|s_t)}, 1-\\epsilon, 1+\\epsilon\\right) \\hat{A}_t\\right)\\right]

Here,

  • πθ\pi_\theta=Current policy
  • πθold\pi_{\theta_{\text{old}}}=Previous policy
  • A^t\hat{A}_t=Estimated advantage at time t
  • Ο΅\epsilon=Clip parameter (typically 0.2)

RLHF Objective with KL Penalty

\\mathcal{L}_{\\text{RLHF}} = \\mathbb{E}_{x \\sim \\mathcal{D}, y \\sim \\pi_\\theta(y|x)} \\left[R(x, y) - \\beta \\cdot D_{\\text{KL}}(\\pi_\\theta(y|x) \\| \\pi_{\\text{ref}}(y|x))\\right]

The KL penalty prevents the policy from diverging too far from the reference model, avoiding reward hacking and mode collapse.

Reward Hacking

ThReward Hacking

Reward hacking occurs when the policy learns to exploit weaknesses in the reward model to obtain high rewards without actually satisfying human preferences. Formally, the policy finds y* = argmax_y R(x, y) such that R(x, y*) >> R(x, y_{\text{human}}), even though y* is not actually preferred by humans.

To mitigate reward hacking: (1) use a larger reward model, (2) train on diverse preference data, (3) apply KL constraints, (4) use reward model ensemble, (5) include constitutional AI principles.

DPO: Direct Preference Optimization

DPO (Rafailov et al., 2023) bypasses reward modeling and PPO, directly optimizing the policy on preference data.

\\mathcal{L}_{\\text{DPO}} = -\\mathbb{E}_{(x, y_w, y_l)} \\left[\\log \\sigma\\left(\\beta \\log \\frac{\\pi_\\theta(y_w|x)}{\\pi_{\\text{ref}}(y_w|x)} - \\beta \\log \\frac{\\pi_\\theta(y_l|x)}{\\pi_{\\text{ref}}(y_l|x)}\\right)\\right]

DPO vs RLHF Comparison

AspectRLHF (PPO)DPO
Reward modelRequiredNot needed
Training stabilityUnstable (RL)Stable (supervised)
Compute costHigh (4 models)Low (2 models)
Sample efficiencyLowHigh
PerformanceStrongCompetitive
ImplementationComplexSimple

DPO's key insight: the optimal policy under RLHF can be expressed in closed form as a function of the reward, eliminating the need for explicit reward modeling.

Other Alignment Methods

Constitutional AI (CAI)

DfConstitutional AI

Constitutional AI uses a set of principles (constitution) to guide the model's self-improvement. The model critiques and revises its own outputs based on the principles, then trains on the improved data.

RLAIF: AI Feedback

DfRLAIF

RLAIF replaces human annotators with an AI system for generating preference data. A larger, more capable model provides feedback, reducing the cost and scaling limitations of human annotation.

Practical Implementation

`python from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead from transformers import AutoTokenizer

Configure PPO

config = PPOConfig( learning_rate=1.41e-5, batch_size=64, mini_batch_size=16, ppo_epochs=4, kl_penalty="kl", init_kl_coef=0.2, target_kl=6.0, )

Load models

model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") ref_model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model") tokenizer = AutoTokenizer.from_pretrained("sft-model")

Create trainer

ppo_trainer = PPOTrainer( config=config, model=model, ref_model=ref_model, tokenizer=tokenizer, )

Training loop

for batch in dataloader: query_tensors = [tokenizer.encode(q, return_tensors="pt")[0] for q in batch["query"]]

response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)

rewards = [reward_model(q, r) for q, r in zip(batch["query"], responses)]

stats = ppo_trainer.step(query_tensors, response_tensors, rewards) `

DPO Training

`python from trl import DPOTrainer, DPOConfig

dpo_config = DPOConfig( beta=0.1, learning_rate=5e-7, per_device_train_batch_size=4, gradient_accumulation_steps=4, max_length=1024, max_prompt_length=512, )

trainer = DPOTrainer( model=model, ref_model=ref_model, tokenizer=tokenizer, train_dataset=preference_dataset, args=dpo_config, )

trainer.train() `

Practice Exercises

  1. Mathematical: Derive the DPO loss from the RLHF objective. Show that the optimal policy can be expressed in closed form.
  2. Implementation: Train a reward model on the Anthropic HH dataset. Evaluate its accuracy on held-out preference pairs.
  3. Comparison: Compare PPO and DPO on the same task. Which is more stable? Which achieves better final performance?
  4. Research: Investigate reward hacking in RLHF. Design a simple experiment that demonstrates the phenomenon.

Key Takeaways:

  • Alignment ensures LLMs behave in accordance with human values
  • RLHF uses reward modeling + PPO to optimize against human preferences
  • DPO directly optimizes the policy on preference data, bypassing reward modeling
  • KL penalties prevent reward hacking and mode collapse
  • Reward hacking is a fundamental challenge in RLHF
  • Constitutional AI and RLAIF reduce dependence on human annotation
  • DPO is simpler and more stable than PPO for most alignment tasks

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement