LLM Evaluation

Hallucination Detection and Mitigation

LLM hallucinations—plausible-sounding yet factually incorrect outputs—remain one of the most critical failure modes in production systems. This guide covers the full pipeline from detection to mitigation.

Taxonomy — Intrinsic vs extrinsic, factual vs faithfulness hallucinations
Detection — Reference-based, reference-free, and model-based approaches
Mitigation — Retrieval augmentation, training-time fixes, and decoding strategies

The truth is rarely pure and never simple.

Hallucination Detection and Mitigation

LLM hallucinations—plausible-sounding yet factually incorrect outputs—remain one of the most critical failure modes in production systems. Understanding their taxonomy, detection, and mitigation is essential for building trustworthy AI applications.

DfHallucination

A hallucination in the context of LLMs is a generated output that is fluent and grammatically correct but factually incorrect, unfaithful to the source context, or fabricated. Formally, given input x and model output y, a hallucination occurs when P(y | x) assigns high probability to outputs y that violate world knowledge or source fidelity.

Taxonomy of Hallucinations

Hallucinations can be classified along two axes: what is hallucinated and where the information should have come from.

DfHallucination Taxonomy

Intrinsic hallucination: The output contradicts the source content (faithfulness violation)
Extrinsic hallucination: The output introduces information not present in or inferable from the source (groundedness violation)
Factual hallucination: The output violates established world knowledge
Faithfulness hallucination: The output is unfaithful to the provided context (relevant in RAG scenarios)

Type	Source of Error	Example
Intrinsic	Contradicts source	"The paper was published in 2021" when source says 2020
Extrinsic	Fabricated detail	Inventing a citation that does not exist
Factual	World knowledge violation	"The Earth orbits the Sun in 400 days"
Faithfulness	Context ignored	Summarizing a document but adding unsupported claims

Current LLMs hallucinate on 3-27% of queries depending on the task. For high-stakes applications (medical, legal), even small hallucination rates are unacceptable.

The Hallucination Problem in Practice

Why Hallucinations Occur

Hallucinations arise from several fundamental properties of how LLMs are trained and how they generate text:

DfRoot Causes of Hallucination

Training data gaps: The model encounters topics underrepresented in training
Objective mismatch: Next-token prediction optimizes fluency, not factual accuracy
Distributional drift: Generated text drifts away from the training distribution
Ambiguity resolution: The model must choose between equally likely completions
Confabulation: The model fills gaps with plausible but invented details

The fundamental tension is that LLMs are trained to predict what a human would write, not what is true. This fluency-truth gap is the root of the hallucination problem.

Impact Across Domains

Domain	Hallucination Rate	Risk Level	Consequence
Medical Q&A	15-25%	Critical	Misdiagnosis, wrong treatment
Legal advice	10-20%	High	Invalid arguments, liability
Financial analysis	8-15%	High	Incorrect projections, losses
Education	5-12%	Medium	Misinformation propagation
Creative writing	20-40%	Low	Acceptable (fiction)

In creative writing, what we call "hallucination" is actually a feature—generating novel content is the goal. The distinction between hallucination and creativity depends entirely on the task context and factual requirements.

Detection Methods

Reference-Based Detection

When a ground-truth reference exists, we can compute factual overlap metrics:

Factual Consistency Score

FCS(y, y^*) = \\frac{|\\text{claims}(y) \\cap \\text{claims}(y^*)|}{|\\text{claims}(y)|}

Here,

$y$ =Model-generated output
$y^*$ =Ground-truth reference
$\text{claims}(\cdot)$ =Set of atomic factual claims extracted from text

FCS Calculation

Given model output y = "Paris is the capital of France and has 2 million people" and reference y* = "Paris is the capital of France with 2.1 million people":

claims(y) = {"Paris is capital of France", "Paris has 2M people"}
claims(y*) = {"Paris is capital of France", "Paris has 2.1M people"}
Intersection = {"Paris is capital of France"}
FCS = 1/2 = 0.5

Reference-Free Detection (Model-Based)

For open-ended generation where no reference exists, we use LLM-as-judge or trained classifiers:

SelfCheckGPT Score

\\text{SelfCheck}(x, y) = 1 - \\text{Corr}(\\{P(y_k | x)\\}_{k=1}^{K})

Here,

$x$ =Input prompt
$y$ =Model output to check
$y_k$ =k-th sampled alternative response
$K$ =Number of alternative samples
$P(y_k | x)$ =Token-level probabilities under the model

The intuition: if a claim in y is factual, independent samples from the model should agree. If it is hallucinated, samples will diverge on the details.

NLI-Based Detection

Natural Language Inference models can check whether a source supports a generated claim:

NLI Hallucination Detection

P(\\text{entail} | x, y) = \\text{NLI}_{\\theta}(x \\rightarrow y)

Here,

$x$ =Source or context document
$y$ =Generated claim
$\text{NLI}_{\theta}$ =Pretrained NLI model

Mitigation Strategies

Retrieval-Augmented Generation (RAG)

Grounding generation in retrieved evidence is the most effective mitigation:

RAG Hallucination Reduction

P_{\\text{RAG}}(y | x) = \\sum_{e \\in \\mathcal{E}} P(y | x, e) \\cdot P(e | x)

Here,

$\mathcal{E}$ =Retrieved evidence set
$P(e | x)$ =Retrieval probability
$P(y | x, e)$ =Generation conditioned on evidence

Training-Time Mitigation

Knowledge-grounded training: Train on (context, grounded-response) pairs
Contrastive learning: Penalize hallucinated outputs, reward faithful ones
RLHF with factuality reward: Reward model specifically scores factual accuracy

Decoding-Time Strategies

Factual Nucleus Sampling

p(y_t | x, y_{<t}) = \\begin{cases} \\frac{p(y_t | x, y_{<t})}{\\sum_{y' \\in V_p} p(y' | x, y_{<t})} & \\text{if } y_t \\in V_p \\\\ 0 & \\text{otherwise} \\end{cases}

Here,

$V_p$ =Top-p nucleus of tokens with cumulative probability ≤ p
$p$ =Nucleus sampling threshold

A key insight: reducing temperature and using smaller nucleus sampling regions reduces hallucination but also reduces creativity. This is the faithfulness-creativity tradeoff.

Combining RAG with constrained decoding (e.g., logit biasing toward tokens present in retrieved evidence) can reduce hallucination rates by 50-80% compared to baseline generation.

Evaluation Frameworks

Framework	Method	Metric	Reference Required
FactScore	Atomic fact decomposition	Precision/Recall of facts	Optional
SelfCheckGPT	Multi-sample consistency	Agreement score	No
G-Eval	LLM-as-judge	Likert scale	Optional
TRU lens	Chain of verification	Claim verification rate	Yes
HaluEval	Hallucination detection QA	Binary classification	Yes

Practice Exercises

Conceptual: Explain why SelfCheckGPT works for factual claims but may fail for subjective opinions. What properties of a claim make it amenable to consistency-based detection?
Mathematical: Given a hallucination rate of 5% per sentence and an average output of 10 sentences, compute the probability that at least one hallucination occurs in a generated response.
Practical: Implement a simple hallucination detector using an NLI model (e.g., DeBERTa-v3 fine-tuned on MNLI) that checks whether a source document entails a generated summary.
Research: Compare RAG-based mitigation with RLHF-based mitigation. Under what conditions does each approach dominate?

Key Takeaways:

Hallucinations are classified as intrinsic (contradicts source), extrinsic (fabricated), factual (world knowledge), and faithfulness (ignores context)
Detection methods include reference-based (FCS), reference-free (SelfCheckGPT), and NLI-based approaches
RAG is the most effective mitigation, reducing hallucination rates by 50-80%
Decoding strategies (temperature, nucleus size) control the faithfulness-creativity tradeoff
Combining multiple mitigation strategies provides the strongest guarantees

What to Learn Next

-> Bias and Fairness in LLMs Measuring and mitigating biases in language model outputs.

-> LLM Evaluation Frameworks Comprehensive evaluation methodologies for language models.

-> Automated LLM Evaluation Using models to evaluate models at scale.

-> Red Teaming Methodologies Systematic adversarial testing of language models.

-> RAG System Design Building retrieval-augmented generation systems for factual grounding.

-> LLM Benchmarking Suites Comprehensive benchmarks including hallucination evaluation.

Hallucination Detection and Mitigation

Hallucination Detection and Mitigation

Hallucination Detection and Mitigation

DfHallucination

Taxonomy of Hallucinations

DfHallucination Taxonomy

The Hallucination Problem in Practice

Why Hallucinations Occur

DfRoot Causes of Hallucination

Impact Across Domains

Detection Methods

Reference-Based Detection

Factual Consistency Score

FCS Calculation

Reference-Free Detection (Model-Based)

SelfCheckGPT Score

NLI-Based Detection

NLI Hallucination Detection

Mitigation Strategies

Retrieval-Augmented Generation (RAG)

RAG Hallucination Reduction

Training-Time Mitigation

Decoding-Time Strategies

Factual Nucleus Sampling

Evaluation Frameworks

Practice Exercises

What to Learn Next

Need Expert LLM Help?