Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval. This tutorial covers the architecture, components, and practical implementation of RAG systems.
DfRetrieval-Augmented Generation (RAG)
RAG is a paradigm that enhances language model outputs by retrieving relevant documents from an external knowledge base and incorporating them into the generation context. The model generates answers conditioned on both the query and the retrieved documents.
Why RAG Over Fine-tuning?
| Factor | RAG | Fine-tuning |
|---|---|---|
| Knowledge updates | Real-time (update index) | Requires retraining |
| Hallucination | Grounded in retrieved docs | May hallucinate |
| Explainability | Citations to sources | Black box |
| Cost | Lower (no retraining) | Higher (compute + data) |
| Freshness | Always current | Snapshot at training time |
RAG is preferred when knowledge changes frequently, when accuracy is critical, or when you need source attribution. Fine-tuning is better for learning new behaviors or styles.
RAG Architecture
Basic RAG Pipeline
- Indexing: Process documents into chunks and create embeddings
- Retrieval: Find relevant chunks given a query
- Augmentation: Combine retrieved chunks with the query
- Generation: Generate an answer using the LLM
Document Processing
Document Chunking
Here,
- =Original document collection
- =Text chunk i
- =Total number of chunks
Embedding Models
Embedding models convert text into dense vector representations for similarity search.
Text Embedding
Here,
- =Dense embedding vector
- =Embedding dimension (typically 384-1536)
- =Embedding model
Popular Embedding Models
| Model | Dimension | Max Tokens | Performance |
|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | 256 | Good, fast |
| BGE-large | 1024 | 512 | Excellent |
| OpenAI text-embedding-3 | 1536 | 8191 | Excellent |
| Cohere embed-v3 | 1024 | 512 | Excellent |
📝Cosine Similarity Calculation
Given two embedding vectors a = [0.5, 0.3, 0.8] and b = [0.4, 0.6, 0.7]:
- Dot product: 0.50.4 + 0.30.6 + 0.8*0.7 = 0.20 + 0.18 + 0.56 = 0.94
- Norm a: sqrt(0.25 + 0.09 + 0.64) = sqrt(0.98) = 0.990
- Norm b: sqrt(0.16 + 0.36 + 0.49) = sqrt(1.01) = 1.005
- Cosine similarity: 0.94 / (0.990 * 1.005) = 0.944 This high similarity score indicates the documents are semantically related.
Similarity Search
Dot Product Similarity
Here,
- =Embedding dimension
Cosine similarity is preferred when embeddings are not normalized. Dot product is faster and equivalent when embeddings are pre-normalized. Most modern embedding models output normalized vectors.
Vector Databases
Vector databases optimize storage and retrieval of embedding vectors.
| Database | Type | Features |
|---|---|---|
| FAISS | Library | Fast, local, CPU/GPU |
| Pinecone | Cloud | Managed, scalable |
| Weaviate | Self-hosted | Hybrid search, GraphQL |
| ChromaDB | Library | Simple, local |
| Qdrant | Self-hosted | Filtering, high performance |
HuggingFace + FAISS Example
`python from sentence_transformers import SentenceTransformer import faiss import numpy as np
Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")
Sample documents
documents = [ "Python is a high-level programming language.", "Machine learning is a subset of artificial intelligence.", "Deep learning uses neural networks with many layers.", "Transformers use self-attention mechanisms.", "BERT is an encoder-only transformer model.", "GPT is a decoder-only transformer model.", "Fine-tuning adapts pre-trained models to specific tasks.", "LoRA reduces the number of trainable parameters.", ]
Create embeddings
embeddings = model.encode(documents) embeddings = np.array(embeddings).astype("float32")
Build FAISS index
dimension = embeddings.shape[1] index = faiss.IndexFlatIP(dimension) # Inner product (cosine for normalized vectors) faiss.normalize_L2(embeddings) # Normalize for cosine similarity index.add(embeddings)
Query
query = "How do transformers work?" query_embedding = model.encode([query]) faiss.normalize_L2(query_embedding)
Retrieve top-k results
k = 3 distances, indices = index.search(query_embedding, k)
for i, (dist, idx) in enumerate(zip(distances[0], indices[0])): print(f"Rank {i+1} (score={dist:.4f}): {documents[idx]}") `
RAG Pipeline with LLM
`python from transformers import AutoModelForCausalLM, AutoTokenizer
llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
def rag_query(question, top_k=3):
Retrieve relevant documents
query_emb = model.encode([question]) faiss.normalize_L2(query_emb) _, indices = index.search(query_emb, top_k)
Build context
context = "\n".join([documents[i] for i in indices[0]])
Generate answer
prompt = f"""Based on the following context, answer the question.
Context: {context}
Question: {question} Answer:"""
inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True) output = llm.generate(**inputs, max_new_tokens=256, temperature=0.3) return tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
answer = rag_query("What is the difference between BERT and GPT?") print(answer) `
Practice Exercises
- Implementation: Build a RAG system over a collection of 100 Wikipedia articles. Measure retrieval accuracy and answer quality.
- Chunking: Compare fixed-size chunking (256, 512, 1024 tokens) with semantic chunking. Which produces better retrieval?
- Embeddings: Compare 3 different embedding models on your RAG task. Which gives the best retrieval quality?
- Evaluation: Implement precision@k and recall@k metrics for your retrieval system. How does k affect performance?
Key Takeaways:
- RAG combines retrieval with generation for grounded, up-to-date outputs
- RAG is preferred over fine-tuning for frequently changing knowledge
- Cosine similarity and dot product are the primary retrieval metrics
- Embedding models convert text to dense vectors for similarity search
- FAISS provides fast local vector search; cloud options available for scale
- Document chunking strategy significantly affects retrieval quality