Constitutional AI
Constitutional AI (CAI) is a method developed by Anthropic for aligning language models to human values without relying on extensive human feedback. It uses a set of principles (a "constitution") to guide the model's self-critique and revision process, making alignment more scalable and transparent.
An alignment framework where an AI system is trained to follow a set of explicit principles (a constitution) through a two-phase process: (1) supervised learning from self-critique and revision, and (2) reinforcement learning from AI feedback (RLAIF) rather than human feedback.
Motivation
Traditional RLHF has scalability limitations:
- Human feedback is expensive and slow to collect
- Human preferences can be inconsistent
- Red-teaming requires human effort
- Safety guidelines are difficult to codify in reward models
CAI addresses these by using AI itself to provide feedback based on explicit principles.
The Constitution
A constitution is a set of natural language principles that guide model behavior. Example principles:
1. Choose the response that is least likely to be considered harmful.
2. Choose the response that is most helpful and harmless.
3. Choose the response that is most ethical and least likely to cause harm.
4. Choose the response that is most aligned with the values of a helpful assistant.
Phase 1: Supervised Learning from Self-Critique (SL-CAI)
The SL-CAI phase generates training data through a self-critique loop:
Step 1: Generate initial responses
- Sample a prompt from the training data
- Generate an initial response from the base model
Step 2: Self-critique
- Ask the model to critique its own response against the constitution
- "Identify specific ways in which the response might violate the principle: [principle]"
Step 3: Revision
- Ask the model to revise its response based on the critique
- "Please rewrite the response to address the issues identified above"
Step 4: Collect revised responses
- Use the revised responses as supervised training data
SL-CAI Loss Function
Here,
- =
- =
- =
- =
The model is fine-tuned on the revised responses using standard supervised learning.
Phase 2: Reinforcement Learning from AI Feedback (RLAIF)
In the RLAIF phase, AI-generated preferences replace human preferences:
Step 1: Generate response pairs
- For each prompt, generate two candidate responses
Step 2: AI preference labeling
- Ask the model (or a separate model) to choose which response is better according to the constitution
- "Considering the following principles: [constitution], which response is better?"
Step 3: Train reward model
- Train a reward model on the AI-generated preferences
Step 4: PPO optimization
- Use PPO to optimize the language model against the learned reward
RL-CAI Objective
Here,
- =
- =
- =
- =
- =
RLAIF vs RLHF
RLAIF achieves comparable or superior alignment performance to RLHF while requiring zero human feedback labels. The AI critic provides consistent, scalable preference signals that can be aligned with explicit principles.
| Aspect | RLHF | RLAIF |
|---|---|---|
| Feedback source | Human annotators | AI model |
| Cost | High ($15-25/hour per annotator) | Low (compute only) |
| Consistency | Variable inter-annotator agreement | Consistent within model |
| Scalability | Limited by annotator pool | Virtually unlimited |
| Transparency | Implicit preferences | Explicit constitutional principles |
| Bias | Human biases | Model biases |
Implementation Example
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dataclasses import dataclass
from typing import List, Dict
@dataclass
class ConstitutionalPrinciple:
name: str
critique_prompt: str
revision_prompt: str
class ConstitutionalAI:
def __init__(self, model_name: str, principles: List[ConstitutionalPrinciple]):
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.principles = principles
self.device = next(self.model.parameters()).device
def generate_response(self, prompt: str, max_new_tokens: int = 256) -> str:
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
do_sample=True
)
return self.tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
def critique(self, prompt: str, response: str, principle: ConstitutionalPrinciple) -> str:
critique_prompt = f"""Human: {prompt}
Assistant: {response}
{principle.critique_prompt}
Human: Please provide your critique:"""
return self.generate_response(critique_prompt)
def revise(self, prompt: str, response: str, critique: str, principle: ConstitutionalPrinciple) -> str:
revision_prompt = f"""Human: {prompt}
Assistant: {response}
Critique: {critique}
{principle.revision_prompt}
Human: Please provide the revised response:"""
return self.generate_response(revision_prompt)
def self_critique_loop(self, prompt: str, max_iterations: int = 3) -> str:
response = self.generate_response(prompt)
for i in range(max_iterations):
principle = self.principles[i % len(self.principles)]
critique = self.critique(prompt, response, principle)
if "no issues" in critique.lower() or "response is appropriate" in critique.lower():
break
response = self.revise(prompt, response, critique, principle)
return response
def generate_preference_pair(self, prompt: str) -> Dict:
response_a = self.generate_response(prompt)
response_b = self.generate_response(prompt)
preference_prompt = f"""Consider the following constitutional principles:
{[p.name for p in self.principles]}
Human: {prompt}
Response A: {response_a}
Response B: {response_b}
Which response better aligns with the principles? Respond with "A" or "B" and explain why:"""
preference = self.generate_response(preference_prompt)
return {
"prompt": prompt,
"response_a": response_a,
"response_b": response_b,
"preference": preference
}
principles = [
ConstitutionalPrinciple(
name="Harmlessness",
critique_prompt="Identify any ways this response might be harmful, dangerous, or unethical.",
revision_prompt="Please revise the response to remove any harmful content while maintaining helpfulness."
),
ConstitutionalPrinciple(
name="Helpfulness",
critique_prompt="Identify any ways this response could be more helpful, accurate, or informative.",
revision_prompt="Please revise the response to be more helpful and informative."
),
ConstitutionalPrinciple(
name="Honesty",
critique_prompt="Identify any inaccuracies, misleading information, or uncertain claims.",
revision_prompt="Please revise the response to be more accurate and honest about uncertainties."
)
]
cai = ConstitutionalAI("gpt2", principles)
revised = cai.self_critique_loop("How do I pick a lock?")
Red-Teaming with CAI
Constitutional AI enables systematic red-teaming by defining adversarial principles:
red_team_principles = [
ConstitutionalPrinciple(
name="Robustness to jailbreaks",
critique_prompt="Does this response resist common jailbreak attempts? Identify any vulnerabilities.",
revision_prompt="Revise the response to be robust against adversarial inputs."
),
ConstitutionalPrinciple(
name="Refusal of harmful requests",
critique_prompt="Does this response appropriately refuse harmful requests?",
revision_prompt="Revise to clearly decline the harmful request while being polite."
)
]
Comparison with RLHF
CAI was introduced by Bai et al. (2022) in "Constitutional AI: Harmlessness from AI Feedback." The key insight is that explicit principles make alignment more transparent, reproducible, and scalable compared to implicit human preferences.
The relationship between CAI and RLHF:
- SL-CAI replaces SFT on human-written demonstrations
- RLAIF replaces RLHF with AI-generated preferences
- Constitution replaces implicit human preferences with explicit principles
- Self-critique replaces human red-teaming
When implementing CAI, the quality of the constitution is critical. Start with clear, specific principles and iterate based on observed failure modes. The AI's ability to self-critique improves with model capability—larger models produce better critiques.
Practical Implementation Considerations
Model Selection:
- Self-critique requires a capable base model (typically 7B+ parameters)
- The AI critic can be the same model or a larger, separate model
- Larger models produce more nuanced critiques
Constitution Design:
- Start with broad principles, then add specificity
- Include both "do" and "don't" principles
- Test the constitution against known failure modes
- Version control the constitution like code
Training Protocol:
- Alternate between SL-CAI and RL-CAI phases
- Monitor for reward hacking in the RLAIF phase
- Use KL penalties to prevent deviation from base capabilities
Evaluation:
- Compare against RLHF-trained models on safety benchmarks
- Test with red-teaming to identify remaining vulnerabilities
- Measure helpfulness-harmlessness tradeoffs
Summary
- Constitutional AI replaces human feedback with AI feedback guided by explicit principles
- The SL-CAI phase generates training data through self-critique and revision loops
- The RLAIF phase trains reward models on AI-generated preferences
- CAI is more scalable, transparent, and reproducible than RLHF
- The constitution defines alignment principles in natural language
- Implementation requires careful constitution design and iteration
Practice Exercises
-
Constitution Design: Write a constitution with 5 principles for a customer service chatbot. Test how different principles affect the model's behavior.
-
Self-Critique Loop: Implement a 3-iteration self-critique loop. How does the response quality change across iterations?
-
RLAIF Data Generation: Generate 100 preference pairs using AI feedback. Compare the agreement rate with human preferences on the same examples.
-
Comparison Study: Train two models—one with CAI and one with standard SFT. Evaluate both on helpfulness and harmlessness benchmarks.
-
Adversarial Testing: Test your CAI-trained model against common jailbreak prompts. Identify which constitutional principles are most effective at preventing misuse.