RAG System Design
Building production-quality RAG systems requires careful consideration of retrieval strategy, evaluation, and architecture. This tutorial covers advanced RAG design patterns and evaluation metrics.
DfAdvanced RAG
Advanced RAG goes beyond basic retrieve-and-generate by incorporating hybrid search, re-ranking, query expansion, and iterative retrieval to improve answer quality and faithfulness.
Advanced Retrieval Strategies
Hybrid Search
Combine sparse (keyword) and dense (semantic) retrieval:
Hybrid Search Score
Here,
- =Weight for dense retrieval (typically 0.5-0.8)
- =Query
- =Document
Re-ranking
After initial retrieval, re-rank results using a cross-encoder:
DfCross-Encoder Re-ranking
A cross-encoder takes the query and document as a single input and outputs a relevance score. Unlike bi-encoders, cross-encoders can model fine-grained query-document interactions but are slower (cannot pre-compute document embeddings).
Re-ranking Score
Here,
- =Query
- =Document
- =Concatenation with separator
Query Expansion
Expand the query to capture more relevant documents:
`python def expand_query(query, llm, n_expansions=3): prompt = f"""Generate {n_expansions} alternative phrasings of this query:
Original: {query}
Alternatives:"""
response = llm.generate(prompt) expanded = [query] + response.split("\n") return expanded `
Multi-Step Retrieval
Iteratively refine retrieval based on intermediate reasoning:
DfIterative RAG
Iterative RAG performs multiple rounds of retrieval and generation. The model generates intermediate reasoning, identifies knowledge gaps, retrieves additional information, and refines its answer.
Chunking Strategies
Fixed-Size Chunking
Split documents into fixed-size chunks with optional overlap:
Fixed Chunking
Here,
- =Stride (step size)
- =Window (chunk size)
Semantic Chunking
Split documents at semantic boundaries (paragraphs, sections, topics):
DfSemantic Chunking
Semantic chunking uses the document's natural structure (headers, paragraphs, topic shifts) to create meaningful chunks that preserve context and coherence.
Recursive Chunking
Split documents hierarchically, using larger separators first:
python def recursive_split(text, separators=["\n\n", "\n", ". ", " "], chunk_size=512): for sep in separators: parts = text.split(sep) chunks = [] current = "" for part in parts: if len(current) + len(part) < chunk_size: current += sep + part if current else part else: if current: chunks.append(current) current = part if current: chunks.append(current) if all(len(c) <= chunk_size for c in chunks): return chunks return [text]
Evaluation Metrics
Retrieval Metrics
Precision@k
Here,
- =Number of retrieved documents
Recall@k
Here,
- =Number of retrieved documents
Generation Metrics
Faithfulness
Here,
- =Factual claims in the generated answer
Answer Relevance
Here,
- =Semantic similarity between answer and question
Use RAGAS (Retrieval Augmented Generation Assessment) for comprehensive RAG evaluation: faithfulness, answer relevance, context precision, and context recall.
Production RAG Architecture
Components
- Document Processing Pipeline: Ingestion, chunking, embedding, indexing
- Query Processing: Query understanding, expansion, routing
- Retrieval Layer: Hybrid search with re-ranking
- Generation Layer: LLM with context integration
- Post-processing: Answer validation, citation generation
- Monitoring: Latency, quality, drift detection
Scaling Considerations
| Component | Small Scale | Production Scale |
|---|---|---|
| Documents | < 100K | > 10M |
| Queries/sec | < 10 | > 1000 |
| Latency target | < 5s | < 500ms |
| Vector DB | ChromaDB | Pinecone/Qdrant |
| Embedding | Local model | API (OpenAI/Cohere) |
| LLM | Self-hosted | API with fallback |
Practice Exercises
- Hybrid Search: Implement hybrid search combining BM25 and FAISS. Compare with dense-only retrieval.
- Re-ranking: Add a cross-encoder re-ranker to your RAG pipeline. Measure improvement in precision@5.
- Evaluation: Build a gold standard evaluation set with 100 queries. Measure precision, recall, MRR, and faithfulness.
- Production: Design a RAG system for a knowledge base with 1M documents. Address latency, scalability, and cost.
Key Takeaways:
- Hybrid search combines keyword and semantic retrieval for better coverage
- Cross-encoder re-ranking significantly improves retrieval quality
- Chunking strategy affects retrieval granularity and context quality
- Precision@k, recall@k, and MRR measure retrieval quality
- Faithfulness and answer relevance measure generation quality
- Production RAG requires careful attention to latency, scalability, and monitoring