Advanced RAG
Multi-Modal RAG — Beyond Text Retrieval
Real-world documents contain text, images, tables, charts, and code. Multi-modal RAG retrieves and reasons across all these modalities to answer questions that no single modality can address alone.
- Vision-Language Retrieval — Find relevant images and charts
- Table Understanding — Extract and reason over structured tables
- Document Layout — Use layout information for better retrieval
The answer is not always in the text — sometimes it is in the image, the table, or the chart.
Multi-Modal RAG
Documents are not just text. Research papers contain figures, financial reports contain tables, technical documentation contains code. Multi-modal RAG extends retrieval to handle all these modalities, providing comprehensive context for generation.
DfMulti-Modal RAG
Multi-modal RAG retrieves relevant information from multiple data types (text, images, tables, audio, video) and combines them into context for LLM generation. It uses modality-specific encoders and cross-modal alignment to find relevant content regardless of its format.
Multi-Modal Embeddings
Cross-Modal Alignment
DfCross-Modal Embedding
Cross-modal embeddings map different modalities into a shared vector space where semantically similar content is close together, regardless of its modality. An image of a cat and the text "cat" should have similar embeddings.
Contrastive Loss for Cross-Modal Alignment
Here,
- =Text embedding for sample i
- =Image embedding for sample i
- =Temperature parameter
- =Cosine similarity
from sentence_transformers import SentenceTransformer
import clip
import torch
class MultiModalEncoder:
def __init__(self):
self.text_encoder = SentenceTransformer("all-MiniLM-L6-v2")
self.clip_model, self.clip_preprocess = clip.load("ViT-B/32")
def encode_text(self, text):
return self.text_encoder.encode(text)
def encode_image(self, image):
image_input = self.clip_preprocess(image).unsqueeze(0)
with torch.no_grad():
embedding = self.clip_model.encode_image(image_input)
return embedding.numpy()
def encode_table(self, table_df):
# Convert table to text representation
text_repr = table_df.to_string(index=False)
return self.encode_text(text_repr)
Document Processing Pipeline
Layout Extraction
from PIL import Image
import pytesseract
def extract_document_layout(pdf_path):
"""Extract layout elements from a document."""
elements = []
# Extract text blocks with positions
text_blocks = extract_text_blocks(pdf_path)
for block in text_blocks:
elements.append({
"type": "text",
"content": block.text,
"bbox": block.bbox,
"page": block.page
})
# Extract images
images = extract_images(pdf_path)
for img in images:
elements.append({
"type": "image",
"content": img.image,
"bbox": img.bbox,
"page": img.page,
"caption": img.caption
})
# Extract tables
tables = extract_tables(pdf_path)
for table in tables:
elements.append({
"type": "table",
"content": table.dataframe,
"bbox": table.bbox,
"page": table.page,
"caption": table.caption
})
return elements
Multi-Modal Chunking
DfMulti-Modal Chunking
Multi-modal chunking creates chunks that preserve the relationship between text and associated visuals. Instead of separating text from images, chunks include both with explicit linking.
def multimodal_chunking(elements, chunk_size=512):
"""Create multi-modal chunks preserving text-image relationships."""
chunks = []
current_chunk = {"text": "", "images": [], "tables": []}
for element in elements:
if element["type"] == "text":
if len(current_chunk["text"]) + len(element["content"]) > chunk_size:
chunks.append(current_chunk)
current_chunk = {"text": "", "images": [], "tables": []}
current_chunk["text"] += element["content"] + "\n"
elif element["type"] == "image":
current_chunk["images"].append({
"image": element["content"],
"caption": element.get("caption", ""),
"context": current_chunk["text"][-200:] # Recent text as context
})
elif element["type"] == "table":
current_chunk["tables"].append({
"data": element["content"],
"caption": element.get("caption", ""),
"context": current_chunk["text"][-200:]
})
if current_chunk["text"] or current_chunk["images"] or current_chunk["tables"]:
chunks.append(current_chunk)
return chunks
Multi-Modal Retrieval
Modality-Specific Retrieval
class MultiModalRetriever:
def __init__(self, encoder, index):
self.encoder = encoder
self.index = index
def retrieve(self, query, modalities=["text", "image", "table"], top_k=5):
"""Retrieve from multiple modalities."""
results = []
# Text retrieval
if "text" in modalities:
text_results = self.index.search(query, modality="text", top_k=top_k)
results.extend(text_results)
# Image retrieval (using CLIP for cross-modal search)
if "image" in modalities:
query_embedding = self.encoder.encode_text(query)
image_results = self.index.search(query_embedding, modality="image", top_k=top_k)
results.extend(image_results)
# Table retrieval
if "table" in modalities:
table_results = self.index.search(query, modality="table", top_k=top_k)
results.extend(table_results)
# Rank by relevance across modalities
ranked = self.cross_modal_rank(query, results)
return ranked[:top_k]
def cross_modal_rank(self, query, results):
"""Rank results from different modalities."""
scored = []
for result in results:
if result["modality"] == "text":
score = self.text_relevance(query, result["content"])
elif result["modality"] == "image":
score = self.image_relevance(query, result["content"])
elif result["modality"] == "table":
score = self.table_relevance(query, result["content"])
scored.append((result, score))
scored.sort(key=lambda x: x[1], reverse=True)
return [r for r, s in scored]
Multi-Modal Generation
Context Assembly
def assemble_multimodal_context(retrieved_items):
"""Format multi-modal context for LLM input."""
context_parts = []
for item in retrieved_items:
if item["modality"] == "text":
context_parts.append(f"[Text]: {item['content']}")
elif item["modality"] == "image":
# For vision-language models, include image directly
context_parts.append(f"[Image]: {item['content']}")
if item.get("caption"):
context_parts.append(f"[Caption]: {item['caption']}")
elif item["modality"] == "table":
# Convert table to markdown
table_md = item["content"].to_markdown()
context_parts.append(f"[Table]:\n{table_md}")
return "\n\n".join(context_parts)
For vision-language models (GPT-4V, Claude 3), images can be included directly in the prompt. For text-only models, images must be converted to text descriptions using image captioning models.
Multi-Modal RAG Architecture
User Query
|
v
[Query Analysis] -> Identify required modalities
|
v
[Multi-Modal Retrieval] -> Search text, images, tables
|
v
[Cross-Modal Ranking] -> Rank results by relevance
|
v
[Context Assembly] -> Format multi-modal context
|
v
[Generation] -> LLM generates answer with multi-modal context
|
v
[Response] -> Text answer with references to source modalities
Use Cases
| Use Case | Modalities | Query Example |
|---|---|---|
| Research papers | Text, Figures, Tables | "What does Figure 3 show about model performance?" |
| Financial reports | Text, Tables, Charts | "What was the revenue growth shown in the bar chart?" |
| Medical records | Text, Images (X-rays), Labs | "Compare the X-ray findings with the lab results" |
| Technical docs | Text, Code, Diagrams | "How does the architecture diagram relate to the code?" |
| News articles | Text, Images, Videos | "What is shown in the image accompanying this article?" |
Practice Exercises
-
Layout Extraction: Build a document processing pipeline that extracts text blocks, images, and tables from a PDF while preserving their spatial relationships.
-
Cross-Modal Retrieval: Implement CLIP-based image retrieval that finds images relevant to a text query. Compare with text-only retrieval.
-
Table Reasoning: Build a system that retrieves relevant tables and answers questions about their contents.
-
Multi-Modal Generation: Implement a multi-modal RAG system that answers questions using both text and images as context.
Key Takeaways
Summary: Multi-Modal RAG
- Documents contain multiple modalities — text, images, tables, charts
- Cross-modal embeddings map different modalities to shared vector space
- Multi-modal chunking preserves text-image relationships
- Modality-specific retrieval handles each type appropriately
- Cross-modal ranking combines results from different modalities
- Vision-language models can process images directly in context
- Table understanding requires specialized parsing and reasoning
- Layout information improves retrieval relevance
What to Learn Next
-> Multimodal LLMs Models that understand text, images, and more.
-> Graph RAG and Knowledge Graphs Structured knowledge for better reasoning.
-> RAG System Design Advanced RAG architecture and design patterns.
-> Self-RAG and Adaptive Retrieval When to retrieve and when to rely on knowledge.
-> Diffusion Models Understanding image generation for multi-modal AI.
-> Vision Transformers Transformer architectures for image understanding.