CW

RAG Evaluation and Benchmarking

Advanced RAGRAG EvaluationFree Lesson

Advertisement

Advanced RAG

RAG Evaluation — Measuring What Matters

RAG systems have multiple failure modes: irrelevant retrieval, hallucinated answers, unsupported claims. Proper evaluation requires measuring retrieval quality, generation faithfulness, and answer relevance simultaneously.

  • Faithfulness — Does the answer stay true to the retrieved context?
  • Answer Relevance — Does the answer actually address the question?
  • Context Precision — Are the retrieved documents relevant and well-ranked?

You cannot improve what you cannot measure.

RAG Evaluation and Benchmarking

Evaluating RAG systems requires assessing multiple components: the retriever, the generator, and their interaction. A system might retrieve perfect documents but generate hallucinations, or retrieve irrelevant documents but generate plausible-sounding answers.

DfRAG Evaluation

RAG evaluation measures the quality of retrieval-augmented generation across three dimensions: (1) retrieval quality — are the right documents found? (2) faithfulness — does the answer align with retrieved context? (3) answer relevance — does the answer address the question?

Core Evaluation Metrics

Retrieval Metrics

Context Precision@k

Precision@k={relevant docs in top-k}k\text{Precision@k} = \frac{|\{\text{relevant docs in top-k}\}|}{k}

Here,

  • kk=Number of retrieved documents

Context Recall

Recall={relevant docs retrieved}{total relevant docs}\text{Recall} = \frac{|\{\text{relevant docs retrieved}\}|}{|\{\text{total relevant docs}\}|}

Here,

  • relevantdocsrelevant docs=Documents that contain information needed to answer the question

Mean Reciprocal Rank (MRR)

MRR=1Qi=1Q1ranki\text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{|Q|} \frac{1}{\text{rank}_i}

Here,

  • rankirank_i=Rank of the first relevant document for query i

Generation Metrics

DfFaithfulness

Faithfulness measures whether the generated answer is supported by the retrieved context. An answer is faithful if every claim it makes can be attributed to the retrieved documents.

Faithfulness Score

F={supported claims}{total claims}F = \frac{|\{\text{supported claims}\}|}{|\{\text{total claims}\}|}

Here,

  • supportedclaimssupported claims=Claims in the answer that are backed by retrieved context
  • totalclaimstotal claims=All claims made in the answer

DfAnswer Relevance

Answer Relevance measures whether the generated answer actually addresses the question. An answer can be faithful but irrelevant (correctly citing documents that don't answer the question).

Answer Relevance Score

AR=1Ni=1Nsim(q,ai)\text{AR} = \frac{1}{N} \sum_{i=1}^{N} \text{sim}(q, a_i)

Here,

  • qq=Original question
  • aia_i=Generated answer (or paraphrase)
  • NN=Number of generated answers/paraphrases

Hallucination Detection

DfHallucination in RAG

A hallucination occurs when the generated answer contains information not present in the retrieved context. In RAG, this is particularly dangerous because users trust the system to ground answers in retrieved documents.

def detect_hallucinations(answer, context, llm):
    """Detect claims in the answer not supported by context."""
    prompt = f"""Analyze this answer for hallucinations.

Question: [hidden]
Context: {context}
Answer: {answer}

Identify each claim in the answer and mark it as:
- SUPPORTED: The claim is supported by the context
- UNSUPPORTED: The claim is not found in the context
- CONTRADICTED: The claim contradicts the context

Claims:"""
    
    response = llm.generate(prompt)
    return parse_claims(response)

RAGAS Framework

DfRAGAS

RAGAS (Retrieval Augmented Generation Assessment) is a framework that evaluates RAG systems using four metrics: faithfulness, answer relevance, context precision, and context recall. It uses LLM-as-judge to evaluate each component.

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

def evaluate_rag_system(dataset, rag_pipeline):
    """Evaluate a RAG system using RAGAS metrics."""
    # Generate answers
    results = []
    for sample in dataset:
        answer = rag_pipeline.generate(sample["question"])
        results.append({
            "question": sample["question"],
            "answer": answer["response"],
            "contexts": answer["retrieved_contexts"],
            "ground_truth": sample["ground_truth"]
        })
    
    # Evaluate with RAGAS
    evaluation = evaluate(
        results,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ]
    )
    
    return evaluation

RAGAS scores range from 0 to 1, where 1 is perfect. Typical production RAG systems achieve: faithfulness 0.7-0.9, answer relevance 0.6-0.8, context precision 0.5-0.7, context recall 0.6-0.8.

Evaluation Pipeline

class RAGEvaluationPipeline:
    def __init__(self, rag_system, llm_judge):
        self.rag_system = rag_system
        self.llm_judge = llm_judge
    
    def evaluate_single(self, question, ground_truth=None):
        """Evaluate a single query."""
        # Get RAG output
        output = self.rag_system.generate(question)
        
        # Evaluate faithfulness
        faithfulness = self.evaluate_faithfulness(
            output["response"],
            output["retrieved_contexts"]
        )
        
        # Evaluate answer relevance
        relevance = self.evaluate_answer_relevance(
            question,
            output["response"]
        )
        
        # Evaluate context precision
        precision = self.evaluate_context_precision(
            question,
            output["retrieved_contexts"],
            ground_truth
        )
        
        return {
            "question": question,
            "answer": output["response"],
            "faithfulness": faithfulness,
            "answer_relevance": relevance,
            "context_precision": precision,
            "num_retrieved": len(output["retrieved_contexts"])
        }
    
    def evaluate_faithfulness(self, answer, contexts):
        """Evaluate if answer is faithful to context."""
        prompt = f"""Rate how faithful this answer is to the provided context.

Context: {' '.join(contexts)}
Answer: {answer}

Rate from 0 to 1 (0 = completely hallucinated, 1 = fully supported):"""
        
        score = self.llm_judge.generate(prompt)
        return float(score)
    
    def evaluate_answer_relevance(self, question, answer):
        """Evaluate if answer addresses the question."""
        prompt = f"""Rate how relevant this answer is to the question.

Question: {question}
Answer: {answer}

Rate from 0 to 1 (0 = completely irrelevant, 1 = perfectly relevant):"""
        
        score = self.llm_judge.generate(prompt)
        return float(score)

Benchmarking RAG Systems

Benchmark Datasets

DatasetDomainQuestionsContextType
HotpotQAWikipedia113KMulti-docMulti-hop
Natural QuestionsGoogle Search307KWikipediaOpen-domain
TriviaQAWeb95KWeb + WikipediaOpen-domain
MS MARCOWeb1MWeb passagesPassage ranking
FiQAFinance6KFinancial docsDomain-specific

Evaluation Results Comparison

SystemFaithfulnessRelevanceContext PrecisionLatency
Naive RAG0.720.650.58200ms
Hybrid Search RAG0.780.710.65250ms
Re-ranked RAG0.820.750.72300ms
Self-RAG0.850.780.70350ms
Graph RAG0.800.820.68400ms

Practice Exercises

  1. Metric Comparison: Compare faithfulness scores using different LLM judges (GPT-4 vs Claude vs Llama). How much do scores vary across judges?

  2. Failure Analysis: Analyze 50 RAG failures. Categorize them as retrieval failures, generation failures, or integration failures. What is the most common failure mode?

  3. A/B Testing: Design an A/B test to compare two RAG configurations. What metrics would you track and how long would you run the test?

  4. Custom Metrics: Design a custom evaluation metric for RAG in a specific domain (e.g., medical, legal). What unique aspects would you measure?

Key Takeaways

Summary: RAG Evaluation and Benchmarking

  • Three pillars: faithfulness, answer relevance, context precision
  • Faithfulness measures if answer is grounded in retrieved context
  • Answer relevance measures if answer addresses the question
  • Context precision measures if retrieval finds the right documents
  • RAGAS provides automated evaluation using LLM-as-judge
  • Hallucination detection identifies unsupported claims
  • Benchmark datasets enable standardized comparison
  • A/B testing validates improvements in production

What to Learn Next

-> RAG System Design Advanced RAG architecture and design patterns.

-> LLM Evaluation Benchmarks Evaluating LLMs on standard benchmarks.

-> Self-RAG and Adaptive Retrieval When to retrieve and when to rely on knowledge.

-> Retrieval-Augmented Generation RAG fundamentals and basic implementation.

-> Model Evaluation ML evaluation metrics and techniques.

-> Agentic RAG Systems Agent-based approaches to retrieval.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement