Retrieval-Augmented Generation

SystemsRAGFree Lesson

Advertisement

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval. This tutorial covers the architecture, components, and practical implementation of RAG systems.

DfRetrieval-Augmented Generation (RAG)

RAG is a paradigm that enhances language model outputs by retrieving relevant documents from an external knowledge base and incorporating them into the generation context. The model generates answers conditioned on both the query and the retrieved documents.

Why RAG Over Fine-tuning?

FactorRAGFine-tuning
Knowledge updatesReal-time (update index)Requires retraining
HallucinationGrounded in retrieved docsMay hallucinate
ExplainabilityCitations to sourcesBlack box
CostLower (no retraining)Higher (compute + data)
FreshnessAlways currentSnapshot at training time

RAG is preferred when knowledge changes frequently, when accuracy is critical, or when you need source attribution. Fine-tuning is better for learning new behaviors or styles.

RAG Architecture

Basic RAG Pipeline

  1. Indexing: Process documents into chunks and create embeddings
  2. Retrieval: Find relevant chunks given a query
  3. Augmentation: Combine retrieved chunks with the query
  4. Generation: Generate an answer using the LLM

Document Processing

Document Chunking

D=d1,d2,ldots,dnrightarrowtextchunks(D)=c1,c2,ldots,cmD = \\{d_1, d_2, \\ldots, d_n\\} \\rightarrow \\text{chunks}(D) = \\{c_1, c_2, \\ldots, c_m\\}

Here,

  • DD=Original document collection
  • cic_i=Text chunk i
  • mm=Total number of chunks

Embedding Models

Embedding models convert text into dense vector representations for similarity search.

Text Embedding

mathbfe=ftextembed(texttext)inmathbbRd\\mathbf{e} = f_{\\text{embed}}(\\text{text}) \\in \\mathbb{R}^d

Here,

  • e\mathbf{e}=Dense embedding vector
  • dd=Embedding dimension (typically 384-1536)
  • fembedf_{\text{embed}}=Embedding model

Popular Embedding Models

ModelDimensionMax TokensPerformance
all-MiniLM-L6-v2384256Good, fast
BGE-large1024512Excellent
OpenAI text-embedding-315368191Excellent
Cohere embed-v31024512Excellent

📝Cosine Similarity Calculation

Given two embedding vectors a = [0.5, 0.3, 0.8] and b = [0.4, 0.6, 0.7]:

  • Dot product: 0.50.4 + 0.30.6 + 0.8*0.7 = 0.20 + 0.18 + 0.56 = 0.94
  • Norm a: sqrt(0.25 + 0.09 + 0.64) = sqrt(0.98) = 0.990
  • Norm b: sqrt(0.16 + 0.36 + 0.49) = sqrt(1.01) = 1.005
  • Cosine similarity: 0.94 / (0.990 * 1.005) = 0.944 This high similarity score indicates the documents are semantically related.

Similarity Search

textsim(mathbfa,mathbfb)=fracmathbfacdotmathbfbmathbfamathbfb\\text{sim}(\\mathbf{a}, \\mathbf{b}) = \\frac{\\mathbf{a} \\cdot \\mathbf{b}}{\\|\\mathbf{a}\\| \\|\\mathbf{b}\\|}

Dot Product Similarity

textsim(mathbfa,mathbfb)=mathbfacdotmathbfb=sumi=1daibi\\text{sim}(\\mathbf{a}, \\mathbf{b}) = \\mathbf{a} \\cdot \\mathbf{b} = \\sum_{i=1}^{d} a_i b_i

Here,

  • dd=Embedding dimension

Cosine similarity is preferred when embeddings are not normalized. Dot product is faster and equivalent when embeddings are pre-normalized. Most modern embedding models output normalized vectors.

Vector Databases

Vector databases optimize storage and retrieval of embedding vectors.

DatabaseTypeFeatures
FAISSLibraryFast, local, CPU/GPU
PineconeCloudManaged, scalable
WeaviateSelf-hostedHybrid search, GraphQL
ChromaDBLibrarySimple, local
QdrantSelf-hostedFiltering, high performance

HuggingFace + FAISS Example

`python from sentence_transformers import SentenceTransformer import faiss import numpy as np

Load embedding model

model = SentenceTransformer("all-MiniLM-L6-v2")

Sample documents

documents = [ "Python is a high-level programming language.", "Machine learning is a subset of artificial intelligence.", "Deep learning uses neural networks with many layers.", "Transformers use self-attention mechanisms.", "BERT is an encoder-only transformer model.", "GPT is a decoder-only transformer model.", "Fine-tuning adapts pre-trained models to specific tasks.", "LoRA reduces the number of trainable parameters.", ]

Create embeddings

embeddings = model.encode(documents) embeddings = np.array(embeddings).astype("float32")

Build FAISS index

dimension = embeddings.shape[1] index = faiss.IndexFlatIP(dimension) # Inner product (cosine for normalized vectors) faiss.normalize_L2(embeddings) # Normalize for cosine similarity index.add(embeddings)

Query

query = "How do transformers work?" query_embedding = model.encode([query]) faiss.normalize_L2(query_embedding)

Retrieve top-k results

k = 3 distances, indices = index.search(query_embedding, k)

for i, (dist, idx) in enumerate(zip(distances[0], indices[0])): print(f"Rank {i+1} (score={dist:.4f}): {documents[idx]}") `

RAG Pipeline with LLM

`python from transformers import AutoModelForCausalLM, AutoTokenizer

llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf") tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

def rag_query(question, top_k=3):

Retrieve relevant documents

query_emb = model.encode([question]) faiss.normalize_L2(query_emb) _, indices = index.search(query_emb, top_k)

Build context

context = "\n".join([documents[i] for i in indices[0]])

Generate answer

prompt = f"""Based on the following context, answer the question.

Context: {context}

Question: {question} Answer:"""

inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True) output = llm.generate(**inputs, max_new_tokens=256, temperature=0.3) return tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

answer = rag_query("What is the difference between BERT and GPT?") print(answer) `

Practice Exercises

  1. Implementation: Build a RAG system over a collection of 100 Wikipedia articles. Measure retrieval accuracy and answer quality.
  2. Chunking: Compare fixed-size chunking (256, 512, 1024 tokens) with semantic chunking. Which produces better retrieval?
  3. Embeddings: Compare 3 different embedding models on your RAG task. Which gives the best retrieval quality?
  4. Evaluation: Implement precision@k and recall@k metrics for your retrieval system. How does k affect performance?

Key Takeaways:

  • RAG combines retrieval with generation for grounded, up-to-date outputs
  • RAG is preferred over fine-tuning for frequently changing knowledge
  • Cosine similarity and dot product are the primary retrieval metrics
  • Embedding models convert text to dense vectors for similarity search
  • FAISS provides fast local vector search; cloud options available for scale
  • Document chunking strategy significantly affects retrieval quality

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement