LLM Evaluation

Bias and Fairness in LLMs

LLMs inherit and amplify societal biases present in their training data. Understanding, measuring, and mitigating these biases is essential for responsible deployment.

Sources — Training data, annotation, algorithmic amplification
Measurement — StereoSet, CrowS-Pairs, BBQ benchmarks
Mitigation — Data curation, fine-tuning, prompting, and decoding

Fairness is not an act of kindness—it is an act of justice.

Bias and Fairness in LLMs

LLMs inherit and amplify societal biases present in their training data. Understanding, measuring, and mitigating these biases is essential for responsible deployment across all applications.

DfBias in LLMs

Bias in the context of LLMs refers to systematic patterns in model outputs that produce unfair, stereotypical, or discriminatory results across demographic groups. Formally, a model exhibits bias if P(y | x, g₁) ≠ P(y | x, g₂) for equivalent inputs differing only in demographic attribute g.

Sources of Bias

1. Training Data Bias

The most fundamental source: the training corpus reflects historical and societal biases.

DfRepresentation Bias

Representation bias occurs when certain demographic groups are under- or over-represented in training data. If the frequency of group g in the corpus is P(g) ≠ P(g) in the target population, the model learns skewed associations.

Representation Imbalance Ratio

R(g) = \\frac{P_{\\text{corpus}}(g)}{P_{\\text{population}}(g)}

Here,

$P_{\text{corpus}}(g)$ =Frequency of group g in training data
$P_{\text{population}}(g)$ =True population frequency of group g

2. Annotation Bias

Human annotators bring their own biases when creating training data for instruction tuning or RLHF.

3. Algorithmic Amplification

Training objectives and architectural choices can amplify small data biases into large output biases.

Research has shown that GPT-3 associates "Muslim" with violence at a rate 50x higher than other religious groups. This demonstrates how training data biases can produce severe stereotypical associations.

Types of Bias in Detail

Understanding the different forms bias takes is essential for effective mitigation:

DfBias Taxonomy

Stereotyping: Assigning fixed characteristics to groups (e.g., "women are nurturing")
Denigration: Negative associations with specific groups
Underrepresentation: Disproportionately representing certain groups
Interaction bias: Differential treatment based on group membership
Confirmation bias: Reinforcing existing societal prejudices

How Bias Manifests in LLMs

Bias can appear at multiple stages of the LLM pipeline:

Data collection: Web crawls overrepresent certain demographics and viewpoints
Pretraining: The model learns statistical associations from biased data
Alignment/RLHF: Human raters may have systematic biases in preference data
Deployment: Users may use outputs in ways that amplify existing biases

Bias is not always harmful. Some biases reflect genuine statistical regularities in the world. The challenge is distinguishing between legitimate statistical patterns and harmful stereotypes that should be corrected.

Measuring Bias

Benchmark-Based Evaluation

Several standardized benchmarks exist for measuring social bias:

Benchmark	Approach	Bias Categories	Method
StereoSet	Sentence completion	Gender, race, religion, profession	Probability comparison
CrowS-Pairs	Sentence pairs	9 categories	Relative likelihood
BBQ	Question answering	Ambiguous/disambiguated contexts	Accuracy parity
WinoBias	Coreference	Gender, occupation	Winograd schema
BBQ-Civil	QA	Intersectional	Context manipulation

StereoSet Implicit Bias Score

\\text{IBS} = \\frac{1}{N} \\sum_{i=1}^{N} \\mathbb{1}[P(\\text{stereo}_i) > P(\\text{anti-stereo}_i)]

Here,

$N$ =Number of test examples
$\text{stereo}_i$ =Stereotypical completion for example i
$\text{anti-stereo}_i$ =Anti-stereotypical completion
$P(\cdot)$ =Model probability of the completion

Bias Score Calculation

If a model prefers stereotypical completions in 320 out of 500 examples: IBS = 320/500 = 0.64 An unbiased model would have IBS ≈ 0.5 (random preference between stereotypical and anti-stereotypical).

Counterfactual Evaluation

The counterfactual method measures bias by comparing outputs when demographic attributes are swapped:

Counterfactual Token Bias

\\text{CTB} = \\mathbb{E}_{x, g_1, g_2} \\left[ \\frac{P(y | x, g_1)}{P(y | x, g_2)} \\right]

Here,

$x$ =Input template (e.g., 'The {group} is a')
$g_1, g_2$ =Different demographic group words
$y$ =Target output token

Debiasing Techniques

Data-Level Debiasing

Resampling: Balance representation of demographic groups
Data augmentation: Generate counterfactual examples
Toxicity filtering: Remove explicitly biased content

Training-Level Debiasing

Adversarial Debiasing Objective

\\mathcal{L}_{\\text{debias}} = \\mathcal{L}_{\\text{LM}} - \\lambda \\cdot \\mathcal{L}_{\\text{adv}}(g)

Here,

$\mathcal{L}_{\text{LM}}$ =Standard language modeling loss
$\mathcal{L}_{ ext{adv}}(g)$ =Adversarial loss: predicting demographic attribute from output
$\lambda$ =Debiasing strength hyperparameter

The adversarial component encourages the model to produce outputs from which the demographic attribute cannot be predicted—enforcing demographic parity.

Inference-Level Debiasing

Prompt-based debiasing: Instruct the model to be fair and unbiased
Constrained decoding: Block stereotypical token sequences
Output filtering: Post-hoc filtering of biased outputs

A practical approach: combine data filtering (remove top 1% most biased examples) with lightweight adversarial debiasing during fine-tuning. This achieves 70-85% bias reduction with minimal capability loss.

Fairness Metrics

Metric	Definition	Property
Demographic Parity	P(ŷ=1 \| g=0) = P(ŷ=1 \| g=1)	Independence
Equalized Odds	P(ŷ=1 \| g, y=1) equal across g	Separation
Predictive Parity	P(y=1 \| ŷ=1, g) equal across g	Sufficiency
Counterfactual Fairness	P(ŷ \| do(g=0)) = P(ŷ \| do(g=1))	Invariance

Note that demographic parity, equalized odds, and predictive parity cannot all be satisfied simultaneously when base rates differ across groups (Chouldechova, 2017). Choosing which fairness criterion to optimize is a value judgment, not a purely technical decision.

Practice Exercises

Conceptual: Explain why equalized odds and demographic parity can conflict. Give a concrete example where satisfying one necessarily violates the other.
Mathematical: Given a model that predicts "doctor" with probability 0.8 for men and 0.3 for women in a template completion task, compute the counterfactual token bias. Is this model biased?
Practical: Using the StereoSet dataset, evaluate a small LLM's implicit bias across the four categories. Which category shows the most bias?
Research: Compare the effectiveness of data-level debiasing vs. inference-level debiasing. What are the tradeoffs in terms of bias reduction and capability preservation?

Key Takeaways:

Bias sources include training data representation, annotation, and algorithmic amplification
Standardized benchmarks (StereoSet, CrowS-Pairs, BBQ) enable systematic measurement
Counterfactual evaluation provides intuitive bias quantification
Debiasing techniques span data, training, and inference levels
Fairness metrics involve fundamental tradeoffs that require explicit value choices

What to Learn Next

-> Hallucination Detection and Mitigation Detecting and reducing factual errors in LLM outputs.

-> LLM Safety and Red Teaming Systematic adversarial testing for safety vulnerabilities.

-> Constitutional AI Training models with explicit behavioral principles.

-> RLHF and Alignment Aligning language models with human preferences and values.

-> LLM Benchmarking Suites Comprehensive benchmarks including bias evaluation.

-> Copyright and Legal Issues Legal frameworks governing AI fairness and discrimination.

Bias and Fairness in LLMs

Bias and Fairness in LLMs

Bias and Fairness in LLMs

DfBias in LLMs

Sources of Bias

1. Training Data Bias

DfRepresentation Bias

Representation Imbalance Ratio

2. Annotation Bias

3. Algorithmic Amplification

Types of Bias in Detail

DfBias Taxonomy

How Bias Manifests in LLMs

Measuring Bias

Benchmark-Based Evaluation

StereoSet Implicit Bias Score

Bias Score Calculation

Counterfactual Evaluation

Counterfactual Token Bias

Debiasing Techniques

Data-Level Debiasing

Training-Level Debiasing

Adversarial Debiasing Objective

Inference-Level Debiasing

Fairness Metrics

Practice Exercises

What to Learn Next

Need Expert LLM Help?