LLM Systems

RAG — Combining LLMs with External Knowledge

Retrieval-Augmented Generation combines the power of large language models with external knowledge retrieval. This guide covers the architecture, components, and practical implementation of RAG systems for grounded, up-to-date outputs.

Knowledge Grounding — Reduce hallucination by anchoring answers in retrieved documents
Vector Search — Embedding models and FAISS enable fast semantic retrieval
Chunking Strategy — Document segmentation significantly affects retrieval quality

The best answers come from knowing where to look.

Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval. This tutorial covers the architecture, components, and practical implementation of RAG systems.

Why RAG Over Fine-tuning?

Factor	RAG	Fine-tuning
Knowledge updates	Real-time (update index)	Requires retraining
Hallucination	Grounded in retrieved docs	May hallucinate
Explainability	Citations to sources	Black box
Cost	Lower (no retraining)	Higher (compute + data)
Freshness	Always current	Snapshot at training time

RAG Architecture

Basic RAG Pipeline

Indexing: Process documents into chunks and create embeddings
Retrieval: Find relevant chunks given a query
Augmentation: Combine retrieved chunks with the query
Generation: Generate an answer using the LLM

Document Processing

Embedding Models

Embedding models convert text into dense vector representations for similarity search.

Popular Embedding Models

Model	Dimension	Max Tokens	Performance
all-MiniLM-L6-v2	384	256	Good, fast
BGE-large	1024	512	Excellent
OpenAI text-embedding-3	1536	8191	Excellent
Cohere embed-v3	1024	512	Excellent

Similarity Search

Vector Databases

Vector databases optimize storage and retrieval of embedding vectors.

Database	Type	Features
FAISS	Library	Fast, local, CPU/GPU
Pinecone	Cloud	Managed, scalable
Weaviate	Self-hosted	Hybrid search, GraphQL
ChromaDB	Library	Simple, local
Qdrant	Self-hosted	Filtering, high performance

HuggingFace + FAISS Example

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Sample documents
documents = [
    "Python is a high-level programming language.",
    "Machine learning is a subset of artificial intelligence.",
    "Deep learning uses neural networks with many layers.",
    "Transformers use self-attention mechanisms.",
    "BERT is an encoder-only transformer model.",
    "GPT is a decoder-only transformer model.",
    "Fine-tuning adapts pre-trained models to specific tasks.",
    "LoRA reduces the number of trainable parameters.",
]

# Create embeddings
embeddings = model.encode(documents)
embeddings = np.array(embeddings).astype("float32")

# Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner product (cosine for normalized vectors)
faiss.normalize_L2(embeddings)  # Normalize for cosine similarity
index.add(embeddings)

# Query
query = "How do transformers work?"
query_embedding = model.encode([query])
faiss.normalize_L2(query_embedding)

# Retrieve top-k results
k = 3
distances, indices = index.search(query_embedding, k)

for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
    print(f"Rank {i+1} (score={dist:.4f}): {documents[idx]}")

RAG Pipeline with LLM

from transformers import AutoModelForCausalLM, AutoTokenizer

llm = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

def rag_query(question, top_k=3):
    # Retrieve relevant documents
    query_emb = model.encode([question])
    faiss.normalize_L2(query_emb)
    _, indices = index.search(query_emb, top_k)
    
    # Build context
    context = "\n".join([documents[i] for i in indices[0]])
    
    # Generate answer
    prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {question}
Answer:"""
    
    inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True)
    output = llm.generate(**inputs, max_new_tokens=256, temperature=0.3)
    return tokenizer.decode(output[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)

answer = rag_query("What is the difference between BERT and GPT?")
print(answer)

Practice Exercises

Implementation: Build a RAG system over a collection of 100 Wikipedia articles. Measure retrieval accuracy and answer quality.
Chunking: Compare fixed-size chunking (256, 512, 1024 tokens) with semantic chunking. Which produces better retrieval?
Embeddings: Compare 3 different embedding models on your RAG task. Which gives the best retrieval quality?
Evaluation: Implement precision@k and recall@k metrics for your retrieval system. How does k affect performance?

What to Learn Next

-> RAG System Design Building production-ready retrieval systems with hybrid search and re-ranking.

-> Prompt Engineering Getting the most out of language models through effective input design.

-> In-Context Learning Teaching LLMs new tasks without training—purely through prompts.

-> Chain-of-Thought Reasoning Making LLMs think step by step for complex reasoning problems.

-> LLM Agent Frameworks Building autonomous agents that reason, plan, and act.

-> Building Production LLM Apps From prototype to production: deploying LLMs at scale.

Retrieval-Augmented Generation

RAG — Combining LLMs with External Knowledge

Retrieval-Augmented Generation

Why RAG Over Fine-tuning?

RAG Architecture

Basic RAG Pipeline

Document Processing

Embedding Models

Popular Embedding Models

Similarity Search

Vector Databases

HuggingFace + FAISS Example

RAG Pipeline with LLM

Practice Exercises

What to Learn Next

Need Expert LLM Help?