Advanced RAG

RAG Evaluation — Measuring What Matters

RAG systems have multiple failure modes: irrelevant retrieval, hallucinated answers, unsupported claims. Proper evaluation requires measuring retrieval quality, generation faithfulness, and answer relevance simultaneously.

Faithfulness — Does the answer stay true to the retrieved context?
Answer Relevance — Does the answer actually address the question?
Context Precision — Are the retrieved documents relevant and well-ranked?

You cannot improve what you cannot measure.

RAG Evaluation and Benchmarking

Evaluating RAG systems requires assessing multiple components: the retriever, the generator, and their interaction. A system might retrieve perfect documents but generate hallucinations, or retrieve irrelevant documents but generate plausible-sounding answers.

DfRAG Evaluation

RAG evaluation measures the quality of retrieval-augmented generation across three dimensions: (1) retrieval quality — are the right documents found? (2) faithfulness — does the answer align with retrieved context? (3) answer relevance — does the answer address the question?

Core Evaluation Metrics

Retrieval Metrics

Context Precision@k

\text{Precision@k} = \frac{|\{\text{relevant docs in top-k}\}|}{k}

Here,

$k$ =Number of retrieved documents

Context Recall

\text{Recall} = \frac{|\{\text{relevant docs retrieved}\}|}{|\{\text{total relevant docs}\}|}

Here,

$relevant docs$ =Documents that contain information needed to answer the question

Mean Reciprocal Rank (MRR)

\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

Here,

$rank_i$ =Rank of the first relevant document for query i

Generation Metrics

DfFaithfulness

Faithfulness measures whether the generated answer is supported by the retrieved context. An answer is faithful if every claim it makes can be attributed to the retrieved documents.

Faithfulness Score

F = \frac{|\{\text{supported claims}\}|}{|\{\text{total claims}\}|}

Here,

$supported claims$ =Claims in the answer that are backed by retrieved context
$total claims$ =All claims made in the answer

DfAnswer Relevance

Answer Relevance measures whether the generated answer actually addresses the question. An answer can be faithful but irrelevant (correctly citing documents that don't answer the question).

Answer Relevance Score

\text{AR} = \frac{1}{N} \sum_{i=1}^{N} \text{sim}(q, a_i)

Here,

$q$ =Original question
$a_i$ =Generated answer (or paraphrase)
$N$ =Number of generated answers/paraphrases

Hallucination Detection

DfHallucination in RAG

A hallucination occurs when the generated answer contains information not present in the retrieved context. In RAG, this is particularly dangerous because users trust the system to ground answers in retrieved documents.

def detect_hallucinations(answer, context, llm):
    """Detect claims in the answer not supported by context."""
    prompt = f"""Analyze this answer for hallucinations.

Question: [hidden]
Context: {context}
Answer: {answer}

Identify each claim in the answer and mark it as:
- SUPPORTED: The claim is supported by the context
- UNSUPPORTED: The claim is not found in the context
- CONTRADICTED: The claim contradicts the context

Claims:"""
    
    response = llm.generate(prompt)
    return parse_claims(response)

RAGAS Framework

DfRAGAS

RAGAS (Retrieval Augmented Generation Assessment) is a framework that evaluates RAG systems using four metrics: faithfulness, answer relevance, context precision, and context recall. It uses LLM-as-judge to evaluate each component.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

def evaluate_rag_system(dataset, rag_pipeline):
    """Evaluate a RAG system using RAGAS metrics."""
    # Generate answers
    results = []
    for sample in dataset:
        answer = rag_pipeline.generate(sample["question"])
        results.append({
            "question": sample["question"],
            "answer": answer["response"],
            "contexts": answer["retrieved_contexts"],
            "ground_truth": sample["ground_truth"]
        })
    
    # Evaluate with RAGAS
    evaluation = evaluate(
        results,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ]
    )
    
    return evaluation

RAGAS scores range from 0 to 1, where 1 is perfect. Typical production RAG systems achieve: faithfulness 0.7-0.9, answer relevance 0.6-0.8, context precision 0.5-0.7, context recall 0.6-0.8.

Evaluation Pipeline

class RAGEvaluationPipeline:
    def __init__(self, rag_system, llm_judge):
        self.rag_system = rag_system
        self.llm_judge = llm_judge
    
    def evaluate_single(self, question, ground_truth=None):
        """Evaluate a single query."""
        # Get RAG output
        output = self.rag_system.generate(question)
        
        # Evaluate faithfulness
        faithfulness = self.evaluate_faithfulness(
            output["response"],
            output["retrieved_contexts"]
        )
        
        # Evaluate answer relevance
        relevance = self.evaluate_answer_relevance(
            question,
            output["response"]
        )
        
        # Evaluate context precision
        precision = self.evaluate_context_precision(
            question,
            output["retrieved_contexts"],
            ground_truth
        )
        
        return {
            "question": question,
            "answer": output["response"],
            "faithfulness": faithfulness,
            "answer_relevance": relevance,
            "context_precision": precision,
            "num_retrieved": len(output["retrieved_contexts"])
        }
    
    def evaluate_faithfulness(self, answer, contexts):
        """Evaluate if answer is faithful to context."""
        prompt = f"""Rate how faithful this answer is to the provided context.

Context: {' '.join(contexts)}
Answer: {answer}

Rate from 0 to 1 (0 = completely hallucinated, 1 = fully supported):"""
        
        score = self.llm_judge.generate(prompt)
        return float(score)
    
    def evaluate_answer_relevance(self, question, answer):
        """Evaluate if answer addresses the question."""
        prompt = f"""Rate how relevant this answer is to the question.

Question: {question}
Answer: {answer}

Rate from 0 to 1 (0 = completely irrelevant, 1 = perfectly relevant):"""
        
        score = self.llm_judge.generate(prompt)
        return float(score)

Benchmarking RAG Systems

Benchmark Datasets

Dataset	Domain	Questions	Context	Type
HotpotQA	Wikipedia	113K	Multi-doc	Multi-hop
Natural Questions	Google Search	307K	Wikipedia	Open-domain
TriviaQA	Web	95K	Web + Wikipedia	Open-domain
MS MARCO	Web	1M	Web passages	Passage ranking
FiQA	Finance	6K	Financial docs	Domain-specific

Evaluation Results Comparison

System	Faithfulness	Relevance	Context Precision	Latency
Naive RAG	0.72	0.65	0.58	200ms
Hybrid Search RAG	0.78	0.71	0.65	250ms
Re-ranked RAG	0.82	0.75	0.72	300ms
Self-RAG	0.85	0.78	0.70	350ms
Graph RAG	0.80	0.82	0.68	400ms

Practice Exercises

Metric Comparison: Compare faithfulness scores using different LLM judges (GPT-4 vs Claude vs Llama). How much do scores vary across judges?
Failure Analysis: Analyze 50 RAG failures. Categorize them as retrieval failures, generation failures, or integration failures. What is the most common failure mode?
A/B Testing: Design an A/B test to compare two RAG configurations. What metrics would you track and how long would you run the test?
Custom Metrics: Design a custom evaluation metric for RAG in a specific domain (e.g., medical, legal). What unique aspects would you measure?

Key Takeaways

Summary: RAG Evaluation and Benchmarking

Three pillars: faithfulness, answer relevance, context precision
Faithfulness measures if answer is grounded in retrieved context
Answer relevance measures if answer addresses the question
Context precision measures if retrieval finds the right documents
RAGAS provides automated evaluation using LLM-as-judge
Hallucination detection identifies unsupported claims
Benchmark datasets enable standardized comparison
A/B testing validates improvements in production

What to Learn Next

-> RAG System Design Advanced RAG architecture and design patterns.

-> LLM Evaluation Benchmarks Evaluating LLMs on standard benchmarks.

-> Self-RAG and Adaptive Retrieval When to retrieve and when to rely on knowledge.

-> Retrieval-Augmented Generation RAG fundamentals and basic implementation.

-> Model Evaluation ML evaluation metrics and techniques.

-> Agentic RAG Systems Agent-based approaches to retrieval.

RAG Evaluation and Benchmarking

RAG Evaluation — Measuring What Matters

RAG Evaluation and Benchmarking

DfRAG Evaluation

Core Evaluation Metrics

Retrieval Metrics

Context Precision@k

Context Recall

Mean Reciprocal Rank (MRR)

Generation Metrics

DfFaithfulness

Faithfulness Score

DfAnswer Relevance

Answer Relevance Score

Hallucination Detection

DfHallucination in RAG

RAGAS Framework

DfRAGAS

Evaluation Pipeline

Benchmarking RAG Systems

Benchmark Datasets

Evaluation Results Comparison

Practice Exercises

Key Takeaways

Summary: RAG Evaluation and Benchmarking

What to Learn Next

Need Expert LLM Help?