LLM Applications

LLM for Question Answering — Machines That Understand and Answer

Question answering is a fundamental NLP task where LLMs excel, enabling systems to provide accurate, relevant answers to natural language questions. This guide covers the theoretical foundations, practical implementations, and evaluation of QA systems.

Open-Domain QA — Answering questions without restricting to a specific document
Extractive QA — Extracting answers directly from context passages
Conversational QA — Multi-turn question answering with context

The question is the key to understanding; the answer unlocks the knowledge.

LLM for Question Answering

Question answering (QA) systems aim to provide accurate answers to natural language questions. LLMs have transformed QA by enabling open-domain question answering, conversational QA, and complex reasoning over multiple information sources.

DfQuestion Answering

Question Answering (QA) is the task of automatically generating an answer to a natural language question. QA systems can be categorized by the source of answers (open-domain vs. closed-domain) and the type of answers generated (extractive vs. abstractive).

QA Taxonomy

By Answer Source

DfOpen-Domain QA

Open-domain question answering answers questions using knowledge from the entire web or large document collections, without restricting to a specific context passage.

DfClosed-Domain QA

Closed-domain question answering answers questions based on a specific document or set of documents provided as context.

By Answer Type

DfExtractive QA

Extractive QA extracts a span of text directly from the context as the answer. The answer is always a substring of the source document.

DfAbstractive QA

Abstractive QA generates an answer that may not appear verbatim in the source document. The model synthesizes information to produce a natural language response.

Type	Answer Source	Answer Generation	Example
Open-Domain	Entire web	Abstractive	"What is the capital of France?" → "Paris"
Closed-Domain	Specific document	Extractive	Span extraction from context
Conversational	Multi-turn context	Abstractive	Follow-up questions in dialogue

Mathematical Formulation

Extractive QA

Extractive QA as Span Prediction

P(\\text{start}, \\text{end} | q, c) = P(\\text{start} | q, c) \\cdot P(\\text{end} | \\text{start}, q, c)

Here,

$q$ =Question
$c$ =Context passage
$\text{start}$ =Start token index of answer span
$\text{end}$ =End token index of answer span

The model predicts probability distributions over start and end positions in the context.

Open-Domain QA

Open-Domain QA Pipeline

P(a|q) = \\sum_{d \\in D} P(a|q, d) P(d|q)

Here,

$q$ =Question
$a$ =Answer
$D$ =Collection of documents
$P(d|q)$ =Document relevance probability
$P(a|q, d)$ =Answer given question and document

The pipeline retrieves relevant documents and generates answers conditioned on them.

Retrieval-Augmented Generation (RAG)

DfRAG for QA

Retrieval-Augmented Generation (RAG) combines retrieval of relevant documents with generation of answers. The model retrieves context documents and generates answers conditioned on them.

RAG Pipeline

Query Processing: Reformulate the question for retrieval
Document Retrieval: Find relevant documents from knowledge base
Context Integration: Combine retrieved documents with the question
Answer Generation: Generate an answer using the LLM
Answer Verification: Validate the answer against source documents

RAG Score

P(a|q) = \\sum_{d \\in \\text{top-k}} P(d|q) P(a|q, d)

Here,

$q$ =User question
$a$ =Generated answer
$d$ =Retrieved document
$k$ =Number of retrieved documents

Conversational QA

DfConversational QA

Conversational question answering handles multi-turn dialogue where questions may depend on previous turns. The system must maintain context and resolve coreferences across the conversation.

Challenges in Conversational QA

Coreference Resolution: "Who is he?" → resolving to a previously mentioned entity
Ellipsis: "What about in 2020?" → implicit reference to previous topic
Topic Shift: Questions that change the topic mid-conversation
Context Maintenance: Tracking relevant information across turns

Conversational QA Example

Turn 1: "Who founded Tesla?" → "Elon Musk, Martin Eberhard, and Marc Tarpenning" Turn 2: "When did he start it?" → "2003" (resolving "he" to Elon Musk) Turn 3: "What about SpaceX?" → "Founded in 2002" (topic shift to new company)

Evaluation Metrics

Exact Match (EM)

Exact Match Score

\\text{EM} = \\frac{1}{|Q|} \\sum_{q \\in Q} \\mathbb{1}[\\text{normalize}(\\hat{a}_q) = \\text{normalize}(a_q)]

Here,

$Q$ =Set of questions
$\hat{a}_q$ =Predicted answer
$a_q$ =Ground truth answer
$normalize$ =Normalization function (lowercase, strip punctuation)

F1 Score

Token-Level F1 for QA

\\text{F1} = 2 \\cdot \\frac{\\text{Precision} \\cdot \\text{Recall}}{\\text{Precision} + \\text{Recall}}

Here,

$Precision$ =Fraction of predicted tokens in ground truth
$Recall$ =Fraction of ground truth tokens in prediction

Evaluation Comparison

Metric	Measures	Strengths	Weaknesses
Exact Match	Binary correctness	Simple, interpretable	Misses partial credit
F1 Score	Token overlap	Partial credit	Ignores word order
BLEU	N-gram precision	Captures fluency	Misses semantics
BERTScore	Semantic similarity	Meaning-aware	Expensive
Human Evaluation	Overall quality	Gold standard	Expensive, subjective

For open-domain QA, EM and F1 are standard metrics. For conversational QA, consider turn-level and dialogue-level metrics that capture context maintenance.

Practical Implementation

Basic QA with LLMs

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

context = """The Eiffel Tower is a wrought-iron lattice tower in Paris, France.
It was built in 1889 as the centerpiece of the 1889 World's Fair."""

question = "When was the Eiffel Tower built?"

prompt = f"""Context: {context}

Question: {question}

Answer:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
answer = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(answer.strip())  # "1889"

RAG Implementation with LangChain

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents and create vector store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(your_documents)

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents, embeddings)

# Create QA chain
llm = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Llama-3-8B-Instruct",
    task="text-generation",
    model_kwargs={"device_map": "auto"}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the capital of France?"})
print(result["result"])

Conversational QA

class ConversationalQA:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.history = []
    
    def answer(self, question):
        context = "\n".join([f"Q: {q}\nA: {a}" for q, a in self.history])
        
        prompt = f"""Context:\n{context}

Q: {question}
A:"""
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=100)
        answer = self.tokenizer.decode(
            outputs[0][inputs.shape[-1]:], skip_special_tokens=True
        )
        
        self.history.append((question, answer.strip()))
        return answer.strip()

For conversational QA, implement proper context windowing to handle long conversations. Keep the most recent turns and summarize older turns to maintain relevant context.

QA Challenges

Ambiguity

Questions can be ambiguous in multiple ways:

Lexical Ambiguity: "What is a bank?" (financial vs. river)
Structural Ambiguity: "I saw the man with the telescope" (who has the telescope?)
Referential Ambiguity: "Who is he?" (which person?)

Evidence Reasoning

Complex questions require reasoning over multiple pieces of evidence:

Multi-Hop Reasoning

Question: "Who was the president of the country where the 2016 Olympics were held?"

Step 1: "Where were the 2016 Olympics held?" → "Rio de Janeiro, Brazil" Step 2: "Who was president of Brazil in 2016?" → "Dilma Rousseff"

This requires multi-hop reasoning across multiple facts.

Calibration

DfQA Calibration

QA calibration ensures that the model's confidence scores accurately reflect the probability of being correct. Well-calibrated models are crucial for knowing when to abstain from answering.

Best Practices

Prompt Engineering

Clear instructions: "Answer based only on the provided context"
Format specification: "Provide a concise answer in one sentence"
Uncertainty handling: "If unsure, say 'I don't know'"
Citation requirements: "Cite the relevant passage"

Context Optimization

Relevant context: Retrieve the most relevant documents
Appropriate length: Balance context richness with model limits
Quality filtering: Remove noisy or irrelevant passages
Deduplication: Remove redundant information

Always validate QA system outputs with domain experts. LLMs can generate plausible-sounding but incorrect answers, especially for complex or domain-specific questions.

Practice Exercises

Evaluation: Compare EM and F1 scores for a QA system on a dataset. When does EM give a misleading picture of performance?
Implementation: Build a simple RAG system using a vector store and LLM. Evaluate the impact of chunk size on answer quality.
Analysis: Analyze failure modes of a QA system on a benchmark dataset. What types of questions does it struggle with?
Research: Investigate the impact of retrieval quality on QA performance. How does the number of retrieved documents affect answer accuracy?

Key Takeaways:

QA systems can be extractive (span extraction) or abstractive (generated answers)
RAG combines retrieval and generation for open-domain QA
Conversational QA requires context maintenance and coreference resolution
EM and F1 are standard metrics; consider task-specific evaluation
Prompt engineering and context optimization are crucial for performance

What to Learn Next

-> LLM for Information Extraction Named entity extraction, relation extraction, and structured output generation.

-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.

-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.

-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.

-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.

-> LLM Testing Strategies Unit testing, integration testing, and regression testing for LLM systems.

LLM for Question Answering

LLM for Question Answering — Machines That Understand and Answer

LLM for Question Answering

DfQuestion Answering

QA Taxonomy

By Answer Source

DfOpen-Domain QA

DfClosed-Domain QA

By Answer Type

DfExtractive QA

DfAbstractive QA

Mathematical Formulation

Extractive QA

Extractive QA as Span Prediction

Open-Domain QA

Open-Domain QA Pipeline

Retrieval-Augmented Generation (RAG)

DfRAG for QA

RAG Pipeline

RAG Score

Conversational QA

DfConversational QA

Challenges in Conversational QA

Conversational QA Example

Evaluation Metrics

Exact Match (EM)

Exact Match Score

F1 Score

Token-Level F1 for QA

Evaluation Comparison

Practical Implementation

Basic QA with LLMs

RAG Implementation with LangChain

Conversational QA

QA Challenges

Ambiguity

Evidence Reasoning

Multi-Hop Reasoning

Calibration

DfQA Calibration

Best Practices

Prompt Engineering

Context Optimization

Practice Exercises

What to Learn Next

Need Expert LLM Help?