CW

Multi-Modal RAG

Advanced RAGMulti-Modal RetrievalFree Lesson

Advertisement

Advanced RAG

Multi-Modal RAG — Beyond Text Retrieval

Real-world documents contain text, images, tables, charts, and code. Multi-modal RAG retrieves and reasons across all these modalities to answer questions that no single modality can address alone.

  • Vision-Language Retrieval — Find relevant images and charts
  • Table Understanding — Extract and reason over structured tables
  • Document Layout — Use layout information for better retrieval

The answer is not always in the text — sometimes it is in the image, the table, or the chart.

Multi-Modal RAG

Documents are not just text. Research papers contain figures, financial reports contain tables, technical documentation contains code. Multi-modal RAG extends retrieval to handle all these modalities, providing comprehensive context for generation.

DfMulti-Modal RAG

Multi-modal RAG retrieves relevant information from multiple data types (text, images, tables, audio, video) and combines them into context for LLM generation. It uses modality-specific encoders and cross-modal alignment to find relevant content regardless of its format.

Multi-Modal Embeddings

Cross-Modal Alignment

DfCross-Modal Embedding

Cross-modal embeddings map different modalities into a shared vector space where semantically similar content is close together, regardless of its modality. An image of a cat and the text "cat" should have similar embeddings.

Contrastive Loss for Cross-Modal Alignment

L=logexp(sim(zitext,ziimg)/τ)j=1Nexp(sim(zitext,zjimg)/τ)\mathcal{L} = -\log \frac{\exp(\text{sim}(z_i^{\text{text}}, z_i^{\text{img}}) / \tau)}{\sum_{j=1}^{N} \exp(\text{sim}(z_i^{\text{text}}, z_j^{\text{img}}) / \tau)}

Here,

  • zitextz_i^{\text{text}}=Text embedding for sample i
  • ziimgz_i^{\text{img}}=Image embedding for sample i
  • τ\tau=Temperature parameter
  • simsim=Cosine similarity
from sentence_transformers import SentenceTransformer
import clip
import torch

class MultiModalEncoder:
    def __init__(self):
        self.text_encoder = SentenceTransformer("all-MiniLM-L6-v2")
        self.clip_model, self.clip_preprocess = clip.load("ViT-B/32")
    
    def encode_text(self, text):
        return self.text_encoder.encode(text)
    
    def encode_image(self, image):
        image_input = self.clip_preprocess(image).unsqueeze(0)
        with torch.no_grad():
            embedding = self.clip_model.encode_image(image_input)
        return embedding.numpy()
    
    def encode_table(self, table_df):
        # Convert table to text representation
        text_repr = table_df.to_string(index=False)
        return self.encode_text(text_repr)

Document Processing Pipeline

Layout Extraction

from PIL import Image
import pytesseract

def extract_document_layout(pdf_path):
    """Extract layout elements from a document."""
    elements = []
    
    # Extract text blocks with positions
    text_blocks = extract_text_blocks(pdf_path)
    for block in text_blocks:
        elements.append({
            "type": "text",
            "content": block.text,
            "bbox": block.bbox,
            "page": block.page
        })
    
    # Extract images
    images = extract_images(pdf_path)
    for img in images:
        elements.append({
            "type": "image",
            "content": img.image,
            "bbox": img.bbox,
            "page": img.page,
            "caption": img.caption
        })
    
    # Extract tables
    tables = extract_tables(pdf_path)
    for table in tables:
        elements.append({
            "type": "table",
            "content": table.dataframe,
            "bbox": table.bbox,
            "page": table.page,
            "caption": table.caption
        })
    
    return elements

Multi-Modal Chunking

DfMulti-Modal Chunking

Multi-modal chunking creates chunks that preserve the relationship between text and associated visuals. Instead of separating text from images, chunks include both with explicit linking.

def multimodal_chunking(elements, chunk_size=512):
    """Create multi-modal chunks preserving text-image relationships."""
    chunks = []
    current_chunk = {"text": "", "images": [], "tables": []}
    
    for element in elements:
        if element["type"] == "text":
            if len(current_chunk["text"]) + len(element["content"]) > chunk_size:
                chunks.append(current_chunk)
                current_chunk = {"text": "", "images": [], "tables": []}
            current_chunk["text"] += element["content"] + "\n"
        
        elif element["type"] == "image":
            current_chunk["images"].append({
                "image": element["content"],
                "caption": element.get("caption", ""),
                "context": current_chunk["text"][-200:]  # Recent text as context
            })
        
        elif element["type"] == "table":
            current_chunk["tables"].append({
                "data": element["content"],
                "caption": element.get("caption", ""),
                "context": current_chunk["text"][-200:]
            })
    
    if current_chunk["text"] or current_chunk["images"] or current_chunk["tables"]:
        chunks.append(current_chunk)
    
    return chunks

Multi-Modal Retrieval

Modality-Specific Retrieval

class MultiModalRetriever:
    def __init__(self, encoder, index):
        self.encoder = encoder
        self.index = index
    
    def retrieve(self, query, modalities=["text", "image", "table"], top_k=5):
        """Retrieve from multiple modalities."""
        results = []
        
        # Text retrieval
        if "text" in modalities:
            text_results = self.index.search(query, modality="text", top_k=top_k)
            results.extend(text_results)
        
        # Image retrieval (using CLIP for cross-modal search)
        if "image" in modalities:
            query_embedding = self.encoder.encode_text(query)
            image_results = self.index.search(query_embedding, modality="image", top_k=top_k)
            results.extend(image_results)
        
        # Table retrieval
        if "table" in modalities:
            table_results = self.index.search(query, modality="table", top_k=top_k)
            results.extend(table_results)
        
        # Rank by relevance across modalities
        ranked = self.cross_modal_rank(query, results)
        return ranked[:top_k]
    
    def cross_modal_rank(self, query, results):
        """Rank results from different modalities."""
        scored = []
        for result in results:
            if result["modality"] == "text":
                score = self.text_relevance(query, result["content"])
            elif result["modality"] == "image":
                score = self.image_relevance(query, result["content"])
            elif result["modality"] == "table":
                score = self.table_relevance(query, result["content"])
            scored.append((result, score))
        
        scored.sort(key=lambda x: x[1], reverse=True)
        return [r for r, s in scored]

Multi-Modal Generation

Context Assembly

def assemble_multimodal_context(retrieved_items):
    """Format multi-modal context for LLM input."""
    context_parts = []
    
    for item in retrieved_items:
        if item["modality"] == "text":
            context_parts.append(f"[Text]: {item['content']}")
        
        elif item["modality"] == "image":
            # For vision-language models, include image directly
            context_parts.append(f"[Image]: {item['content']}")
            if item.get("caption"):
                context_parts.append(f"[Caption]: {item['caption']}")
        
        elif item["modality"] == "table":
            # Convert table to markdown
            table_md = item["content"].to_markdown()
            context_parts.append(f"[Table]:\n{table_md}")
    
    return "\n\n".join(context_parts)

For vision-language models (GPT-4V, Claude 3), images can be included directly in the prompt. For text-only models, images must be converted to text descriptions using image captioning models.

Multi-Modal RAG Architecture

Architecture Diagram
User Query
    |
    v
[Query Analysis] -> Identify required modalities
    |
    v
[Multi-Modal Retrieval] -> Search text, images, tables
    |
    v
[Cross-Modal Ranking] -> Rank results by relevance
    |
    v
[Context Assembly] -> Format multi-modal context
    |
    v
[Generation] -> LLM generates answer with multi-modal context
    |
    v
[Response] -> Text answer with references to source modalities

Use Cases

Use CaseModalitiesQuery Example
Research papersText, Figures, Tables"What does Figure 3 show about model performance?"
Financial reportsText, Tables, Charts"What was the revenue growth shown in the bar chart?"
Medical recordsText, Images (X-rays), Labs"Compare the X-ray findings with the lab results"
Technical docsText, Code, Diagrams"How does the architecture diagram relate to the code?"
News articlesText, Images, Videos"What is shown in the image accompanying this article?"

Practice Exercises

  1. Layout Extraction: Build a document processing pipeline that extracts text blocks, images, and tables from a PDF while preserving their spatial relationships.

  2. Cross-Modal Retrieval: Implement CLIP-based image retrieval that finds images relevant to a text query. Compare with text-only retrieval.

  3. Table Reasoning: Build a system that retrieves relevant tables and answers questions about their contents.

  4. Multi-Modal Generation: Implement a multi-modal RAG system that answers questions using both text and images as context.

Key Takeaways

Summary: Multi-Modal RAG

  • Documents contain multiple modalities — text, images, tables, charts
  • Cross-modal embeddings map different modalities to shared vector space
  • Multi-modal chunking preserves text-image relationships
  • Modality-specific retrieval handles each type appropriately
  • Cross-modal ranking combines results from different modalities
  • Vision-language models can process images directly in context
  • Table understanding requires specialized parsing and reasoning
  • Layout information improves retrieval relevance

What to Learn Next

-> Multimodal LLMs Models that understand text, images, and more.

-> Graph RAG and Knowledge Graphs Structured knowledge for better reasoning.

-> RAG System Design Advanced RAG architecture and design patterns.

-> Self-RAG and Adaptive Retrieval When to retrieve and when to rely on knowledge.

-> Diffusion Models Understanding image generation for multi-modal AI.

-> Vision Transformers Transformer architectures for image understanding.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement