Constitutional AI

AlignmentConstitutional AIFree Lesson

Advertisement

Constitutional AI

Constitutional AI (CAI) is a method developed by Anthropic for aligning language models to human values without relying on extensive human feedback. It uses a set of principles (a "constitution") to guide the model's self-critique and revision process, making alignment more scalable and transparent.

An alignment framework where an AI system is trained to follow a set of explicit principles (a constitution) through a two-phase process: (1) supervised learning from self-critique and revision, and (2) reinforcement learning from AI feedback (RLAIF) rather than human feedback.

Motivation

Traditional RLHF has scalability limitations:

  • Human feedback is expensive and slow to collect
  • Human preferences can be inconsistent
  • Red-teaming requires human effort
  • Safety guidelines are difficult to codify in reward models

CAI addresses these by using AI itself to provide feedback based on explicit principles.

The Constitution

A constitution is a set of natural language principles that guide model behavior. Example principles:

Architecture Diagram
1. Choose the response that is least likely to be considered harmful.
2. Choose the response that is most helpful and harmless.
3. Choose the response that is most ethical and least likely to cause harm.
4. Choose the response that is most aligned with the values of a helpful assistant.

Phase 1: Supervised Learning from Self-Critique (SL-CAI)

The SL-CAI phase generates training data through a self-critique loop:

Step 1: Generate initial responses

  • Sample a prompt from the training data
  • Generate an initial response from the base model

Step 2: Self-critique

  • Ask the model to critique its own response against the constitution
  • "Identify specific ways in which the response might violate the principle: [principle]"

Step 3: Revision

  • Ask the model to revise its response based on the critique
  • "Please rewrite the response to address the issues identified above"

Step 4: Collect revised responses

  • Use the revised responses as supervised training data

SL-CAI Loss Function

\\mathcal{L}_{\\text{SL-CAI}} = -\\sum_{i=1}^{N} \\log P_\\theta(\\mathbf{y}_i^{\\text{revised}} \\mid \\mathbf{x}_i)

Here,

  • =
  • =
  • =
  • =

The model is fine-tuned on the revised responses using standard supervised learning.

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

In the RLAIF phase, AI-generated preferences replace human preferences:

Step 1: Generate response pairs

  • For each prompt, generate two candidate responses

Step 2: AI preference labeling

  • Ask the model (or a separate model) to choose which response is better according to the constitution
  • "Considering the following principles: [constitution], which response is better?"

Step 3: Train reward model

  • Train a reward model on the AI-generated preferences

Step 4: PPO optimization

  • Use PPO to optimize the language model against the learned reward

RL-CAI Objective

\\max_{\\pi_\\theta} \\mathbb{E}_{x \\sim D, y \\sim \\pi_\\theta(\\cdot|x)} \\left[ R_\\phi(x, y) - \\beta \\, \\text{KL}(\\pi_\\theta(\\cdot|x) \\| \\pi_{\\text{ref}}(\\cdot|x)) \\right]

Here,

  • =
  • =
  • =
  • =
  • =
mathcalLtextCAI=mathcalLtextSLCAI+alphacdotmathcalLRL-CAI\\mathcal{L}_{\\text{CAI}} = \\mathcal{L}_{\\text{SL-CAI}} + \\alpha \\cdot \\mathcal{L}_{\text{RL-CAI}}

RLAIF vs RLHF

RLAIF achieves comparable or superior alignment performance to RLHF while requiring zero human feedback labels. The AI critic provides consistent, scalable preference signals that can be aligned with explicit principles.

AspectRLHFRLAIF
Feedback sourceHuman annotatorsAI model
CostHigh ($15-25/hour per annotator)Low (compute only)
ConsistencyVariable inter-annotator agreementConsistent within model
ScalabilityLimited by annotator poolVirtually unlimited
TransparencyImplicit preferencesExplicit constitutional principles
BiasHuman biasesModel biases

Implementation Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from dataclasses import dataclass
from typing import List, Dict

@dataclass
class ConstitutionalPrinciple:
    name: str
    critique_prompt: str
    revision_prompt: str

class ConstitutionalAI:
    def __init__(self, model_name: str, principles: List[ConstitutionalPrinciple]):
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.principles = principles
        self.device = next(self.model.parameters()).device
    
    def generate_response(self, prompt: str, max_new_tokens: int = 256) -> str:
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=0.7,
                do_sample=True
            )
        return self.tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
    
    def critique(self, prompt: str, response: str, principle: ConstitutionalPrinciple) -> str:
        critique_prompt = f"""Human: {prompt}
        
Assistant: {response}

{principle.critique_prompt}

Human: Please provide your critique:"""
        return self.generate_response(critique_prompt)
    
    def revise(self, prompt: str, response: str, critique: str, principle: ConstitutionalPrinciple) -> str:
        revision_prompt = f"""Human: {prompt}
        
Assistant: {response}

Critique: {critique}

{principle.revision_prompt}

Human: Please provide the revised response:"""
        return self.generate_response(revision_prompt)
    
    def self_critique_loop(self, prompt: str, max_iterations: int = 3) -> str:
        response = self.generate_response(prompt)
        
        for i in range(max_iterations):
            principle = self.principles[i % len(self.principles)]
            critique = self.critique(prompt, response, principle)
            
            if "no issues" in critique.lower() or "response is appropriate" in critique.lower():
                break
            
            response = self.revise(prompt, response, critique, principle)
        
        return response
    
    def generate_preference_pair(self, prompt: str) -> Dict:
        response_a = self.generate_response(prompt)
        response_b = self.generate_response(prompt)
        
        preference_prompt = f"""Consider the following constitutional principles:
{[p.name for p in self.principles]}

Human: {prompt}

Response A: {response_a}

Response B: {response_b}

Which response better aligns with the principles? Respond with "A" or "B" and explain why:"""
        
        preference = self.generate_response(preference_prompt)
        
        return {
            "prompt": prompt,
            "response_a": response_a,
            "response_b": response_b,
            "preference": preference
        }

principles = [
    ConstitutionalPrinciple(
        name="Harmlessness",
        critique_prompt="Identify any ways this response might be harmful, dangerous, or unethical.",
        revision_prompt="Please revise the response to remove any harmful content while maintaining helpfulness."
    ),
    ConstitutionalPrinciple(
        name="Helpfulness",
        critique_prompt="Identify any ways this response could be more helpful, accurate, or informative.",
        revision_prompt="Please revise the response to be more helpful and informative."
    ),
    ConstitutionalPrinciple(
        name="Honesty",
        critique_prompt="Identify any inaccuracies, misleading information, or uncertain claims.",
        revision_prompt="Please revise the response to be more accurate and honest about uncertainties."
    )
]

cai = ConstitutionalAI("gpt2", principles)
revised = cai.self_critique_loop("How do I pick a lock?")

Red-Teaming with CAI

Constitutional AI enables systematic red-teaming by defining adversarial principles:

red_team_principles = [
    ConstitutionalPrinciple(
        name="Robustness to jailbreaks",
        critique_prompt="Does this response resist common jailbreak attempts? Identify any vulnerabilities.",
        revision_prompt="Revise the response to be robust against adversarial inputs."
    ),
    ConstitutionalPrinciple(
        name="Refusal of harmful requests",
        critique_prompt="Does this response appropriately refuse harmful requests?",
        revision_prompt="Revise to clearly decline the harmful request while being polite."
    )
]

Comparison with RLHF

CAI was introduced by Bai et al. (2022) in "Constitutional AI: Harmlessness from AI Feedback." The key insight is that explicit principles make alignment more transparent, reproducible, and scalable compared to implicit human preferences.

The relationship between CAI and RLHF:

  1. SL-CAI replaces SFT on human-written demonstrations
  2. RLAIF replaces RLHF with AI-generated preferences
  3. Constitution replaces implicit human preferences with explicit principles
  4. Self-critique replaces human red-teaming

When implementing CAI, the quality of the constitution is critical. Start with clear, specific principles and iterate based on observed failure modes. The AI's ability to self-critique improves with model capability—larger models produce better critiques.

Practical Implementation Considerations

Model Selection:

  • Self-critique requires a capable base model (typically 7B+ parameters)
  • The AI critic can be the same model or a larger, separate model
  • Larger models produce more nuanced critiques

Constitution Design:

  • Start with broad principles, then add specificity
  • Include both "do" and "don't" principles
  • Test the constitution against known failure modes
  • Version control the constitution like code

Training Protocol:

  • Alternate between SL-CAI and RL-CAI phases
  • Monitor for reward hacking in the RLAIF phase
  • Use KL penalties to prevent deviation from base capabilities

Evaluation:

  • Compare against RLHF-trained models on safety benchmarks
  • Test with red-teaming to identify remaining vulnerabilities
  • Measure helpfulness-harmlessness tradeoffs

Summary

  • Constitutional AI replaces human feedback with AI feedback guided by explicit principles
  • The SL-CAI phase generates training data through self-critique and revision loops
  • The RLAIF phase trains reward models on AI-generated preferences
  • CAI is more scalable, transparent, and reproducible than RLHF
  • The constitution defines alignment principles in natural language
  • Implementation requires careful constitution design and iteration

Practice Exercises

  1. Constitution Design: Write a constitution with 5 principles for a customer service chatbot. Test how different principles affect the model's behavior.

  2. Self-Critique Loop: Implement a 3-iteration self-critique loop. How does the response quality change across iterations?

  3. RLAIF Data Generation: Generate 100 preference pairs using AI feedback. Compare the agreement rate with human preferences on the same examples.

  4. Comparison Study: Train two models—one with CAI and one with standard SFT. Evaluate both on helpfulness and harmlessness benchmarks.

  5. Adversarial Testing: Test your CAI-trained model against common jailbreak prompts. Identify which constitutional principles are most effective at preventing misuse.


Next: 15 - LLM Evaluation Benchmarks →

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement