RAG System Design

SystemsRAG DesignFree Lesson

Advertisement

RAG System Design

Building production-quality RAG systems requires careful consideration of retrieval strategy, evaluation, and architecture. This tutorial covers advanced RAG design patterns and evaluation metrics.

DfAdvanced RAG

Advanced RAG goes beyond basic retrieve-and-generate by incorporating hybrid search, re-ranking, query expansion, and iterative retrieval to improve answer quality and faithfulness.

Advanced Retrieval Strategies

Hybrid Search

Combine sparse (keyword) and dense (semantic) retrieval:

Hybrid Search Score

textscore(q,d)=alphacdottextdensesim(q,d)+(1alpha)cdottextsparsesim(q,d)\\text{score}(q, d) = \\alpha \\cdot \\text{dense\\_sim}(q, d) + (1 - \\alpha) \\cdot \\text{sparse\\_sim}(q, d)

Here,

  • α\alpha=Weight for dense retrieval (typically 0.5-0.8)
  • qq=Query
  • dd=Document

Re-ranking

After initial retrieval, re-rank results using a cross-encoder:

DfCross-Encoder Re-ranking

A cross-encoder takes the query and document as a single input and outputs a relevance score. Unlike bi-encoders, cross-encoders can model fine-grained query-document interactions but are slower (cannot pre-compute document embeddings).

Re-ranking Score

textscore(q,d)=textcrossencoder(textconcat(q,d))\\text{score}(q, d) = \\text{cross\\_encoder}(\\text{concat}(q, d))

Here,

  • qq=Query
  • dd=Document
  • concatconcat=Concatenation with separator

Query Expansion

Expand the query to capture more relevant documents:

`python def expand_query(query, llm, n_expansions=3): prompt = f"""Generate {n_expansions} alternative phrasings of this query:

Original: {query}

Alternatives:"""

response = llm.generate(prompt) expanded = [query] + response.split("\n") return expanded `

Multi-Step Retrieval

Iteratively refine retrieval based on intermediate reasoning:

DfIterative RAG

Iterative RAG performs multiple rounds of retrieval and generation. The model generates intermediate reasoning, identifies knowledge gaps, retrieves additional information, and refines its answer.

Chunking Strategies

Fixed-Size Chunking

Split documents into fixed-size chunks with optional overlap:

Fixed Chunking

ci=texttext[icdots:icdots+w]c_i = \\text{text}[i \\cdot s : i \\cdot s + w]

Here,

  • ss=Stride (step size)
  • ww=Window (chunk size)

Semantic Chunking

Split documents at semantic boundaries (paragraphs, sections, topics):

DfSemantic Chunking

Semantic chunking uses the document's natural structure (headers, paragraphs, topic shifts) to create meaningful chunks that preserve context and coherence.

Recursive Chunking

Split documents hierarchically, using larger separators first:

python def recursive_split(text, separators=["\n\n", "\n", ". ", " "], chunk_size=512): for sep in separators: parts = text.split(sep) chunks = [] current = "" for part in parts: if len(current) + len(part) < chunk_size: current += sep + part if current else part else: if current: chunks.append(current) current = part if current: chunks.append(current) if all(len(c) <= chunk_size for c in chunks): return chunks return [text]

Evaluation Metrics

Retrieval Metrics

Precision@k

textPrecision@k=fractextrelevantdocsintopkk\\text{Precision@k} = \\frac{|\\text{relevant docs in top-k}|}{k}

Here,

  • kk=Number of retrieved documents

Recall@k

textRecall@k=fractextrelevantdocsintopktexttotalrelevantdocs\\text{Recall@k} = \\frac{|\\text{relevant docs in top-k}|}{|\\text{total relevant docs}|}

Here,

  • kk=Number of retrieved documents
textMRR=frac1Qsumi=1Qfrac1textranki\\text{MRR} = \\frac{1}{Q} \\sum_{i=1}^{Q} \\frac{1}{\\text{rank}_i}

Generation Metrics

Faithfulness

textFaithfulness=fractextstatementssupportedbycontexttexttotalstatements\\text{Faithfulness} = \\frac{\\text{statements supported by context}}{\\text{total statements}}

Here,

  • statementsstatements=Factual claims in the generated answer

Answer Relevance

textRelevance=textsim(textanswerembedding,textquestionembedding)\\text{Relevance} = \\text{sim}(\\text{answer\\_embedding}, \\text{question\\_embedding})

Here,

  • simsim=Semantic similarity between answer and question

Use RAGAS (Retrieval Augmented Generation Assessment) for comprehensive RAG evaluation: faithfulness, answer relevance, context precision, and context recall.

Production RAG Architecture

Components

  1. Document Processing Pipeline: Ingestion, chunking, embedding, indexing
  2. Query Processing: Query understanding, expansion, routing
  3. Retrieval Layer: Hybrid search with re-ranking
  4. Generation Layer: LLM with context integration
  5. Post-processing: Answer validation, citation generation
  6. Monitoring: Latency, quality, drift detection

Scaling Considerations

ComponentSmall ScaleProduction Scale
Documents< 100K> 10M
Queries/sec< 10> 1000
Latency target< 5s< 500ms
Vector DBChromaDBPinecone/Qdrant
EmbeddingLocal modelAPI (OpenAI/Cohere)
LLMSelf-hostedAPI with fallback

Practice Exercises

  1. Hybrid Search: Implement hybrid search combining BM25 and FAISS. Compare with dense-only retrieval.
  2. Re-ranking: Add a cross-encoder re-ranker to your RAG pipeline. Measure improvement in precision@5.
  3. Evaluation: Build a gold standard evaluation set with 100 queries. Measure precision, recall, MRR, and faithfulness.
  4. Production: Design a RAG system for a knowledge base with 1M documents. Address latency, scalability, and cost.

Key Takeaways:

  • Hybrid search combines keyword and semantic retrieval for better coverage
  • Cross-encoder re-ranking significantly improves retrieval quality
  • Chunking strategy affects retrieval granularity and context quality
  • Precision@k, recall@k, and MRR measure retrieval quality
  • Faithfulness and answer relevance measure generation quality
  • Production RAG requires careful attention to latency, scalability, and monitoring

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement