CW

Bias and Fairness in LLMs

EvaluationSafetyFree Lesson

Advertisement

LLM Evaluation

Bias and Fairness in LLMs

LLMs inherit and amplify societal biases present in their training data. Understanding, measuring, and mitigating these biases is essential for responsible deployment.

  • Sources — Training data, annotation, algorithmic amplification
  • Measurement — StereoSet, CrowS-Pairs, BBQ benchmarks
  • Mitigation — Data curation, fine-tuning, prompting, and decoding

Fairness is not an act of kindness—it is an act of justice.

Bias and Fairness in LLMs

LLMs inherit and amplify societal biases present in their training data. Understanding, measuring, and mitigating these biases is essential for responsible deployment across all applications.

DfBias in LLMs

Bias in the context of LLMs refers to systematic patterns in model outputs that produce unfair, stereotypical, or discriminatory results across demographic groups. Formally, a model exhibits bias if P(y | x, g₁) ≠ P(y | x, g₂) for equivalent inputs differing only in demographic attribute g.

Sources of Bias

1. Training Data Bias

The most fundamental source: the training corpus reflects historical and societal biases.

DfRepresentation Bias

Representation bias occurs when certain demographic groups are under- or over-represented in training data. If the frequency of group g in the corpus is P(g) ≠ P(g) in the target population, the model learns skewed associations.

Representation Imbalance Ratio

R(g)=fracPtextcorpus(g)Ptextpopulation(g)R(g) = \\frac{P_{\\text{corpus}}(g)}{P_{\\text{population}}(g)}

Here,

  • Pcorpus(g)P_{\text{corpus}}(g)=Frequency of group g in training data
  • Ppopulation(g)P_{\text{population}}(g)=True population frequency of group g

2. Annotation Bias

Human annotators bring their own biases when creating training data for instruction tuning or RLHF.

3. Algorithmic Amplification

Training objectives and architectural choices can amplify small data biases into large output biases.

Research has shown that GPT-3 associates "Muslim" with violence at a rate 50x higher than other religious groups. This demonstrates how training data biases can produce severe stereotypical associations.

Types of Bias in Detail

Understanding the different forms bias takes is essential for effective mitigation:

DfBias Taxonomy

  • Stereotyping: Assigning fixed characteristics to groups (e.g., "women are nurturing")
  • Denigration: Negative associations with specific groups
  • Underrepresentation: Disproportionately representing certain groups
  • Interaction bias: Differential treatment based on group membership
  • Confirmation bias: Reinforcing existing societal prejudices

How Bias Manifests in LLMs

Bias can appear at multiple stages of the LLM pipeline:

  1. Data collection: Web crawls overrepresent certain demographics and viewpoints
  2. Pretraining: The model learns statistical associations from biased data
  3. Alignment/RLHF: Human raters may have systematic biases in preference data
  4. Deployment: Users may use outputs in ways that amplify existing biases

Bias is not always harmful. Some biases reflect genuine statistical regularities in the world. The challenge is distinguishing between legitimate statistical patterns and harmful stereotypes that should be corrected.

Measuring Bias

Benchmark-Based Evaluation

Several standardized benchmarks exist for measuring social bias:

BenchmarkApproachBias CategoriesMethod
StereoSetSentence completionGender, race, religion, professionProbability comparison
CrowS-PairsSentence pairs9 categoriesRelative likelihood
BBQQuestion answeringAmbiguous/disambiguated contextsAccuracy parity
WinoBiasCoreferenceGender, occupationWinograd schema
BBQ-CivilQAIntersectionalContext manipulation

StereoSet Implicit Bias Score

textIBS=frac1Nsumi=1Nmathbb1[P(textstereoi)>P(textantistereoi)]\\text{IBS} = \\frac{1}{N} \\sum_{i=1}^{N} \\mathbb{1}[P(\\text{stereo}_i) > P(\\text{anti-stereo}_i)]

Here,

  • NN=Number of test examples
  • stereoi\text{stereo}_i=Stereotypical completion for example i
  • anti-stereoi\text{anti-stereo}_i=Anti-stereotypical completion
  • P()P(\cdot)=Model probability of the completion

Bias Score Calculation

If a model prefers stereotypical completions in 320 out of 500 examples: IBS = 320/500 = 0.64 An unbiased model would have IBS ≈ 0.5 (random preference between stereotypical and anti-stereotypical).

Counterfactual Evaluation

The counterfactual method measures bias by comparing outputs when demographic attributes are swapped:

Counterfactual Token Bias

textCTB=mathbbEx,g1,g2left[fracP(yx,g1)P(yx,g2)right]\\text{CTB} = \\mathbb{E}_{x, g_1, g_2} \\left[ \\frac{P(y | x, g_1)}{P(y | x, g_2)} \\right]

Here,

  • xx=Input template (e.g., 'The {group} is a')
  • g1,g2g_1, g_2=Different demographic group words
  • yy=Target output token

Debiasing Techniques

Data-Level Debiasing

  • Resampling: Balance representation of demographic groups
  • Data augmentation: Generate counterfactual examples
  • Toxicity filtering: Remove explicitly biased content

Training-Level Debiasing

Adversarial Debiasing Objective

mathcalLtextdebias=mathcalLtextLMlambdacdotmathcalLtextadv(g)\\mathcal{L}_{\\text{debias}} = \\mathcal{L}_{\\text{LM}} - \\lambda \\cdot \\mathcal{L}_{\\text{adv}}(g)

Here,

  • LLM\mathcal{L}_{\text{LM}}=Standard language modeling loss
  • Lextadv(g)\mathcal{L}_{ ext{adv}}(g)=Adversarial loss: predicting demographic attribute from output
  • λ\lambda=Debiasing strength hyperparameter

The adversarial component encourages the model to produce outputs from which the demographic attribute cannot be predicted—enforcing demographic parity.

Inference-Level Debiasing

  • Prompt-based debiasing: Instruct the model to be fair and unbiased
  • Constrained decoding: Block stereotypical token sequences
  • Output filtering: Post-hoc filtering of biased outputs

A practical approach: combine data filtering (remove top 1% most biased examples) with lightweight adversarial debiasing during fine-tuning. This achieves 70-85% bias reduction with minimal capability loss.

Fairness Metrics

MetricDefinitionProperty
Demographic ParityP(ŷ=1 | g=0) = P(ŷ=1 | g=1)Independence
Equalized OddsP(ŷ=1 | g, y=1) equal across gSeparation
Predictive ParityP(y=1 | ŷ=1, g) equal across gSufficiency
Counterfactual FairnessP(ŷ | do(g=0)) = P(ŷ | do(g=1))Invariance

Note that demographic parity, equalized odds, and predictive parity cannot all be satisfied simultaneously when base rates differ across groups (Chouldechova, 2017). Choosing which fairness criterion to optimize is a value judgment, not a purely technical decision.

Practice Exercises

  1. Conceptual: Explain why equalized odds and demographic parity can conflict. Give a concrete example where satisfying one necessarily violates the other.

  2. Mathematical: Given a model that predicts "doctor" with probability 0.8 for men and 0.3 for women in a template completion task, compute the counterfactual token bias. Is this model biased?

  3. Practical: Using the StereoSet dataset, evaluate a small LLM's implicit bias across the four categories. Which category shows the most bias?

  4. Research: Compare the effectiveness of data-level debiasing vs. inference-level debiasing. What are the tradeoffs in terms of bias reduction and capability preservation?

Key Takeaways:

  • Bias sources include training data representation, annotation, and algorithmic amplification
  • Standardized benchmarks (StereoSet, CrowS-Pairs, BBQ) enable systematic measurement
  • Counterfactual evaluation provides intuitive bias quantification
  • Debiasing techniques span data, training, and inference levels
  • Fairness metrics involve fundamental tradeoffs that require explicit value choices

What to Learn Next

-> Hallucination Detection and Mitigation Detecting and reducing factual errors in LLM outputs.

-> LLM Safety and Red Teaming Systematic adversarial testing for safety vulnerabilities.

-> Constitutional AI Training models with explicit behavioral principles.

-> RLHF and Alignment Aligning language models with human preferences and values.

-> LLM Benchmarking Suites Comprehensive benchmarks including bias evaluation.

-> Copyright and Legal Issues Legal frameworks governing AI fairness and discrimination.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement