RLHF Alignment
The RLHF Process
RLHF aligns language models with human preferences through three stages:
Stage 1: Supervised Fine-tuning (SFT)
- Collect human demonstrations of desired behavior
- Fine-tune the base model on these examples
- Establishes baseline response quality
Stage 2: Reward Model (RM) Training
- Generate multiple responses for prompts
- Humans rank responses by quality
- Train a model to predict human preferences
Stage 3: Reinforcement Learning (PPO)
- Use the reward model to score new outputs
- Optimize the language model using PPO
- Balance reward maximization with KL penalty
Implementation
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from transformers import AutoTokenizer
import torch
def setup_rlhf(base_model_name):
config = PPOConfig(
model_name=base_model_name,
learning_rate=1.41e-5,
batch_size=64,
mini_batch_size=16,
ppo_epochs=4,
kl_penalty="kl",
init_kl_coef=0.2,
)
model = AutoModelForCausalLMWithValueHead.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
trainer = PPOTrainer(config=config, model=model, tokenizer=tokenizer)
return trainer, tokenizer
def rlhf_training_step(trainer, queries, tokenizer):
query_tensors = [
tokenizer.encode(q, return_tensors="pt")[0]
for q in queries
]
# Generate responses
response_tensors = trainer.generate(query_tensors)
# Get rewards from reward model
rewards = [get_reward(q, r) for q, r in zip(queries, response_tensors)]
# Run PPO step
stats = trainer.step(query_tensors, response_tensors, rewards)
return stats
Reward Model Training
class RewardModel(torch.nn.Module):
def __init__(self, base_model):
super().__init__()
self.base_model = base_model
self.reward_head = torch.nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.base_model(input_ids, attention_mask=attention_mask)
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward
def train_reward_model(model, dataset, epochs=1):
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for epoch in range(epochs):
for batch in dataset:
# Positive response (preferred)
reward_pos = model(batch['chosen_ids'], batch['chosen_mask'])
# Negative response (not preferred)
reward_neg = model(batch['rejected_ids'], batch['rejected_mask'])
# Ranking loss
loss = -torch.log(torch.sigmoid(reward_pos - reward_neg)).mean()
loss.backward()
optimizer.step()
optimizer.zero_grad()
Key Considerations
| Aspect | Consideration |
|---|---|
| KL Penalty | Prevents model from diverging too far from base |
| Reward Hacking | Model may find ways to maximize reward without real quality |
| Scalability | PPO is computationally expensive |
| Alternatives | DPO, RLHF variants may be simpler |
Summary
RLHF is crucial for creating AI systems that are helpful, harmless, and honest. It bridges the gap between raw model capabilities and human-aligned behavior.
Next: We'll explore Constitutional AI as an alternative alignment approach.