Advanced RAG
RAG Evaluation — Measuring What Matters
RAG systems have multiple failure modes: irrelevant retrieval, hallucinated answers, unsupported claims. Proper evaluation requires measuring retrieval quality, generation faithfulness, and answer relevance simultaneously.
- Faithfulness — Does the answer stay true to the retrieved context?
- Answer Relevance — Does the answer actually address the question?
- Context Precision — Are the retrieved documents relevant and well-ranked?
You cannot improve what you cannot measure.
RAG Evaluation and Benchmarking
Evaluating RAG systems requires assessing multiple components: the retriever, the generator, and their interaction. A system might retrieve perfect documents but generate hallucinations, or retrieve irrelevant documents but generate plausible-sounding answers.
DfRAG Evaluation
RAG evaluation measures the quality of retrieval-augmented generation across three dimensions: (1) retrieval quality — are the right documents found? (2) faithfulness — does the answer align with retrieved context? (3) answer relevance — does the answer address the question?
Core Evaluation Metrics
Retrieval Metrics
Context Precision@k
Here,
- =Number of retrieved documents
Context Recall
Here,
- =Documents that contain information needed to answer the question
Mean Reciprocal Rank (MRR)
Here,
- =Rank of the first relevant document for query i
Generation Metrics
DfFaithfulness
Faithfulness measures whether the generated answer is supported by the retrieved context. An answer is faithful if every claim it makes can be attributed to the retrieved documents.
Faithfulness Score
Here,
- =Claims in the answer that are backed by retrieved context
- =All claims made in the answer
DfAnswer Relevance
Answer Relevance measures whether the generated answer actually addresses the question. An answer can be faithful but irrelevant (correctly citing documents that don't answer the question).
Answer Relevance Score
Here,
- =Original question
- =Generated answer (or paraphrase)
- =Number of generated answers/paraphrases
Hallucination Detection
DfHallucination in RAG
A hallucination occurs when the generated answer contains information not present in the retrieved context. In RAG, this is particularly dangerous because users trust the system to ground answers in retrieved documents.
def detect_hallucinations(answer, context, llm):
"""Detect claims in the answer not supported by context."""
prompt = f"""Analyze this answer for hallucinations.
Question: [hidden]
Context: {context}
Answer: {answer}
Identify each claim in the answer and mark it as:
- SUPPORTED: The claim is supported by the context
- UNSUPPORTED: The claim is not found in the context
- CONTRADICTED: The claim contradicts the context
Claims:"""
response = llm.generate(prompt)
return parse_claims(response)
RAGAS Framework
DfRAGAS
RAGAS (Retrieval Augmented Generation Assessment) is a framework that evaluates RAG systems using four metrics: faithfulness, answer relevance, context precision, and context recall. It uses LLM-as-judge to evaluate each component.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
def evaluate_rag_system(dataset, rag_pipeline):
"""Evaluate a RAG system using RAGAS metrics."""
# Generate answers
results = []
for sample in dataset:
answer = rag_pipeline.generate(sample["question"])
results.append({
"question": sample["question"],
"answer": answer["response"],
"contexts": answer["retrieved_contexts"],
"ground_truth": sample["ground_truth"]
})
# Evaluate with RAGAS
evaluation = evaluate(
results,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
]
)
return evaluation
RAGAS scores range from 0 to 1, where 1 is perfect. Typical production RAG systems achieve: faithfulness 0.7-0.9, answer relevance 0.6-0.8, context precision 0.5-0.7, context recall 0.6-0.8.
Evaluation Pipeline
class RAGEvaluationPipeline:
def __init__(self, rag_system, llm_judge):
self.rag_system = rag_system
self.llm_judge = llm_judge
def evaluate_single(self, question, ground_truth=None):
"""Evaluate a single query."""
# Get RAG output
output = self.rag_system.generate(question)
# Evaluate faithfulness
faithfulness = self.evaluate_faithfulness(
output["response"],
output["retrieved_contexts"]
)
# Evaluate answer relevance
relevance = self.evaluate_answer_relevance(
question,
output["response"]
)
# Evaluate context precision
precision = self.evaluate_context_precision(
question,
output["retrieved_contexts"],
ground_truth
)
return {
"question": question,
"answer": output["response"],
"faithfulness": faithfulness,
"answer_relevance": relevance,
"context_precision": precision,
"num_retrieved": len(output["retrieved_contexts"])
}
def evaluate_faithfulness(self, answer, contexts):
"""Evaluate if answer is faithful to context."""
prompt = f"""Rate how faithful this answer is to the provided context.
Context: {' '.join(contexts)}
Answer: {answer}
Rate from 0 to 1 (0 = completely hallucinated, 1 = fully supported):"""
score = self.llm_judge.generate(prompt)
return float(score)
def evaluate_answer_relevance(self, question, answer):
"""Evaluate if answer addresses the question."""
prompt = f"""Rate how relevant this answer is to the question.
Question: {question}
Answer: {answer}
Rate from 0 to 1 (0 = completely irrelevant, 1 = perfectly relevant):"""
score = self.llm_judge.generate(prompt)
return float(score)
Benchmarking RAG Systems
Benchmark Datasets
| Dataset | Domain | Questions | Context | Type |
|---|---|---|---|---|
| HotpotQA | Wikipedia | 113K | Multi-doc | Multi-hop |
| Natural Questions | Google Search | 307K | Wikipedia | Open-domain |
| TriviaQA | Web | 95K | Web + Wikipedia | Open-domain |
| MS MARCO | Web | 1M | Web passages | Passage ranking |
| FiQA | Finance | 6K | Financial docs | Domain-specific |
Evaluation Results Comparison
| System | Faithfulness | Relevance | Context Precision | Latency |
|---|---|---|---|---|
| Naive RAG | 0.72 | 0.65 | 0.58 | 200ms |
| Hybrid Search RAG | 0.78 | 0.71 | 0.65 | 250ms |
| Re-ranked RAG | 0.82 | 0.75 | 0.72 | 300ms |
| Self-RAG | 0.85 | 0.78 | 0.70 | 350ms |
| Graph RAG | 0.80 | 0.82 | 0.68 | 400ms |
Practice Exercises
-
Metric Comparison: Compare faithfulness scores using different LLM judges (GPT-4 vs Claude vs Llama). How much do scores vary across judges?
-
Failure Analysis: Analyze 50 RAG failures. Categorize them as retrieval failures, generation failures, or integration failures. What is the most common failure mode?
-
A/B Testing: Design an A/B test to compare two RAG configurations. What metrics would you track and how long would you run the test?
-
Custom Metrics: Design a custom evaluation metric for RAG in a specific domain (e.g., medical, legal). What unique aspects would you measure?
Key Takeaways
Summary: RAG Evaluation and Benchmarking
- Three pillars: faithfulness, answer relevance, context precision
- Faithfulness measures if answer is grounded in retrieved context
- Answer relevance measures if answer addresses the question
- Context precision measures if retrieval finds the right documents
- RAGAS provides automated evaluation using LLM-as-judge
- Hallucination detection identifies unsupported claims
- Benchmark datasets enable standardized comparison
- A/B testing validates improvements in production
What to Learn Next
-> RAG System Design Advanced RAG architecture and design patterns.
-> LLM Evaluation Benchmarks Evaluating LLMs on standard benchmarks.
-> Self-RAG and Adaptive Retrieval When to retrieve and when to rely on knowledge.
-> Retrieval-Augmented Generation RAG fundamentals and basic implementation.
-> Model Evaluation ML evaluation metrics and techniques.
-> Agentic RAG Systems Agent-based approaches to retrieval.