LLM Evaluation
Hallucination Detection and Mitigation
LLM hallucinations—plausible-sounding yet factually incorrect outputs—remain one of the most critical failure modes in production systems. This guide covers the full pipeline from detection to mitigation.
- Taxonomy — Intrinsic vs extrinsic, factual vs faithfulness hallucinations
- Detection — Reference-based, reference-free, and model-based approaches
- Mitigation — Retrieval augmentation, training-time fixes, and decoding strategies
The truth is rarely pure and never simple.
Hallucination Detection and Mitigation
LLM hallucinations—plausible-sounding yet factually incorrect outputs—remain one of the most critical failure modes in production systems. Understanding their taxonomy, detection, and mitigation is essential for building trustworthy AI applications.
DfHallucination
A hallucination in the context of LLMs is a generated output that is fluent and grammatically correct but factually incorrect, unfaithful to the source context, or fabricated. Formally, given input x and model output y, a hallucination occurs when P(y | x) assigns high probability to outputs y that violate world knowledge or source fidelity.
Taxonomy of Hallucinations
Hallucinations can be classified along two axes: what is hallucinated and where the information should have come from.
DfHallucination Taxonomy
- Intrinsic hallucination: The output contradicts the source content (faithfulness violation)
- Extrinsic hallucination: The output introduces information not present in or inferable from the source (groundedness violation)
- Factual hallucination: The output violates established world knowledge
- Faithfulness hallucination: The output is unfaithful to the provided context (relevant in RAG scenarios)
| Type | Source of Error | Example |
|---|---|---|
| Intrinsic | Contradicts source | "The paper was published in 2021" when source says 2020 |
| Extrinsic | Fabricated detail | Inventing a citation that does not exist |
| Factual | World knowledge violation | "The Earth orbits the Sun in 400 days" |
| Faithfulness | Context ignored | Summarizing a document but adding unsupported claims |
Current LLMs hallucinate on 3-27% of queries depending on the task. For high-stakes applications (medical, legal), even small hallucination rates are unacceptable.
The Hallucination Problem in Practice
Why Hallucinations Occur
Hallucinations arise from several fundamental properties of how LLMs are trained and how they generate text:
DfRoot Causes of Hallucination
- Training data gaps: The model encounters topics underrepresented in training
- Objective mismatch: Next-token prediction optimizes fluency, not factual accuracy
- Distributional drift: Generated text drifts away from the training distribution
- Ambiguity resolution: The model must choose between equally likely completions
- Confabulation: The model fills gaps with plausible but invented details
The fundamental tension is that LLMs are trained to predict what a human would write, not what is true. This fluency-truth gap is the root of the hallucination problem.
Impact Across Domains
| Domain | Hallucination Rate | Risk Level | Consequence |
|---|---|---|---|
| Medical Q&A | 15-25% | Critical | Misdiagnosis, wrong treatment |
| Legal advice | 10-20% | High | Invalid arguments, liability |
| Financial analysis | 8-15% | High | Incorrect projections, losses |
| Education | 5-12% | Medium | Misinformation propagation |
| Creative writing | 20-40% | Low | Acceptable (fiction) |
In creative writing, what we call "hallucination" is actually a feature—generating novel content is the goal. The distinction between hallucination and creativity depends entirely on the task context and factual requirements.
Detection Methods
Reference-Based Detection
When a ground-truth reference exists, we can compute factual overlap metrics:
Factual Consistency Score
Here,
- =Model-generated output
- =Ground-truth reference
- =Set of atomic factual claims extracted from text
FCS Calculation
Given model output y = "Paris is the capital of France and has 2 million people" and reference y* = "Paris is the capital of France with 2.1 million people":
- claims(y) = {"Paris is capital of France", "Paris has 2M people"}
- claims(y*) = {"Paris is capital of France", "Paris has 2.1M people"}
- Intersection = {"Paris is capital of France"}
- FCS = 1/2 = 0.5
Reference-Free Detection (Model-Based)
For open-ended generation where no reference exists, we use LLM-as-judge or trained classifiers:
SelfCheckGPT Score
Here,
- =Input prompt
- =Model output to check
- =k-th sampled alternative response
- =Number of alternative samples
- =Token-level probabilities under the model
The intuition: if a claim in y is factual, independent samples from the model should agree. If it is hallucinated, samples will diverge on the details.
NLI-Based Detection
Natural Language Inference models can check whether a source supports a generated claim:
NLI Hallucination Detection
Here,
- =Source or context document
- =Generated claim
- =Pretrained NLI model
Mitigation Strategies
Retrieval-Augmented Generation (RAG)
Grounding generation in retrieved evidence is the most effective mitigation:
RAG Hallucination Reduction
Here,
- =Retrieved evidence set
- =Retrieval probability
- =Generation conditioned on evidence
Training-Time Mitigation
- Knowledge-grounded training: Train on (context, grounded-response) pairs
- Contrastive learning: Penalize hallucinated outputs, reward faithful ones
- RLHF with factuality reward: Reward model specifically scores factual accuracy
Decoding-Time Strategies
Factual Nucleus Sampling
Here,
- =Top-p nucleus of tokens with cumulative probability ≤ p
- =Nucleus sampling threshold
A key insight: reducing temperature and using smaller nucleus sampling regions reduces hallucination but also reduces creativity. This is the faithfulness-creativity tradeoff.
Combining RAG with constrained decoding (e.g., logit biasing toward tokens present in retrieved evidence) can reduce hallucination rates by 50-80% compared to baseline generation.
Evaluation Frameworks
| Framework | Method | Metric | Reference Required |
|---|---|---|---|
| FactScore | Atomic fact decomposition | Precision/Recall of facts | Optional |
| SelfCheckGPT | Multi-sample consistency | Agreement score | No |
| G-Eval | LLM-as-judge | Likert scale | Optional |
| TRU lens | Chain of verification | Claim verification rate | Yes |
| HaluEval | Hallucination detection QA | Binary classification | Yes |
Practice Exercises
-
Conceptual: Explain why SelfCheckGPT works for factual claims but may fail for subjective opinions. What properties of a claim make it amenable to consistency-based detection?
-
Mathematical: Given a hallucination rate of 5% per sentence and an average output of 10 sentences, compute the probability that at least one hallucination occurs in a generated response.
-
Practical: Implement a simple hallucination detector using an NLI model (e.g., DeBERTa-v3 fine-tuned on MNLI) that checks whether a source document entails a generated summary.
-
Research: Compare RAG-based mitigation with RLHF-based mitigation. Under what conditions does each approach dominate?
Key Takeaways:
- Hallucinations are classified as intrinsic (contradicts source), extrinsic (fabricated), factual (world knowledge), and faithfulness (ignores context)
- Detection methods include reference-based (FCS), reference-free (SelfCheckGPT), and NLI-based approaches
- RAG is the most effective mitigation, reducing hallucination rates by 50-80%
- Decoding strategies (temperature, nucleus size) control the faithfulness-creativity tradeoff
- Combining multiple mitigation strategies provides the strongest guarantees
What to Learn Next
-> Bias and Fairness in LLMs Measuring and mitigating biases in language model outputs.
-> LLM Evaluation Frameworks Comprehensive evaluation methodologies for language models.
-> Automated LLM Evaluation Using models to evaluate models at scale.
-> Red Teaming Methodologies Systematic adversarial testing of language models.
-> RAG System Design Building retrieval-augmented generation systems for factual grounding.
-> LLM Benchmarking Suites Comprehensive benchmarks including hallucination evaluation.