LLM Applications
LLM for Question Answering — Machines That Understand and Answer
Question answering is a fundamental NLP task where LLMs excel, enabling systems to provide accurate, relevant answers to natural language questions. This guide covers the theoretical foundations, practical implementations, and evaluation of QA systems.
- Open-Domain QA — Answering questions without restricting to a specific document
- Extractive QA — Extracting answers directly from context passages
- Conversational QA — Multi-turn question answering with context
The question is the key to understanding; the answer unlocks the knowledge.
LLM for Question Answering
Question answering (QA) systems aim to provide accurate answers to natural language questions. LLMs have transformed QA by enabling open-domain question answering, conversational QA, and complex reasoning over multiple information sources.
DfQuestion Answering
Question Answering (QA) is the task of automatically generating an answer to a natural language question. QA systems can be categorized by the source of answers (open-domain vs. closed-domain) and the type of answers generated (extractive vs. abstractive).
QA Taxonomy
By Answer Source
DfOpen-Domain QA
Open-domain question answering answers questions using knowledge from the entire web or large document collections, without restricting to a specific context passage.
DfClosed-Domain QA
Closed-domain question answering answers questions based on a specific document or set of documents provided as context.
By Answer Type
DfExtractive QA
Extractive QA extracts a span of text directly from the context as the answer. The answer is always a substring of the source document.
DfAbstractive QA
Abstractive QA generates an answer that may not appear verbatim in the source document. The model synthesizes information to produce a natural language response.
| Type | Answer Source | Answer Generation | Example |
|---|---|---|---|
| Open-Domain | Entire web | Abstractive | "What is the capital of France?" → "Paris" |
| Closed-Domain | Specific document | Extractive | Span extraction from context |
| Conversational | Multi-turn context | Abstractive | Follow-up questions in dialogue |
Mathematical Formulation
Extractive QA
Extractive QA as Span Prediction
Here,
- =Question
- =Context passage
- =Start token index of answer span
- =End token index of answer span
The model predicts probability distributions over start and end positions in the context.
Open-Domain QA
Open-Domain QA Pipeline
Here,
- =Question
- =Answer
- =Collection of documents
- =Document relevance probability
- =Answer given question and document
The pipeline retrieves relevant documents and generates answers conditioned on them.
Retrieval-Augmented Generation (RAG)
DfRAG for QA
Retrieval-Augmented Generation (RAG) combines retrieval of relevant documents with generation of answers. The model retrieves context documents and generates answers conditioned on them.
RAG Pipeline
- Query Processing: Reformulate the question for retrieval
- Document Retrieval: Find relevant documents from knowledge base
- Context Integration: Combine retrieved documents with the question
- Answer Generation: Generate an answer using the LLM
- Answer Verification: Validate the answer against source documents
RAG Score
Here,
- =User question
- =Generated answer
- =Retrieved document
- =Number of retrieved documents
Conversational QA
DfConversational QA
Conversational question answering handles multi-turn dialogue where questions may depend on previous turns. The system must maintain context and resolve coreferences across the conversation.
Challenges in Conversational QA
- Coreference Resolution: "Who is he?" → resolving to a previously mentioned entity
- Ellipsis: "What about in 2020?" → implicit reference to previous topic
- Topic Shift: Questions that change the topic mid-conversation
- Context Maintenance: Tracking relevant information across turns
Conversational QA Example
Turn 1: "Who founded Tesla?" → "Elon Musk, Martin Eberhard, and Marc Tarpenning" Turn 2: "When did he start it?" → "2003" (resolving "he" to Elon Musk) Turn 3: "What about SpaceX?" → "Founded in 2002" (topic shift to new company)
Evaluation Metrics
Exact Match (EM)
Exact Match Score
Here,
- =Set of questions
- =Predicted answer
- =Ground truth answer
- =Normalization function (lowercase, strip punctuation)
F1 Score
Token-Level F1 for QA
Here,
- =Fraction of predicted tokens in ground truth
- =Fraction of ground truth tokens in prediction
Evaluation Comparison
| Metric | Measures | Strengths | Weaknesses |
|---|---|---|---|
| Exact Match | Binary correctness | Simple, interpretable | Misses partial credit |
| F1 Score | Token overlap | Partial credit | Ignores word order |
| BLEU | N-gram precision | Captures fluency | Misses semantics |
| BERTScore | Semantic similarity | Meaning-aware | Expensive |
| Human Evaluation | Overall quality | Gold standard | Expensive, subjective |
For open-domain QA, EM and F1 are standard metrics. For conversational QA, consider turn-level and dialogue-level metrics that capture context maintenance.
Practical Implementation
Basic QA with LLMs
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
context = """The Eiffel Tower is a wrought-iron lattice tower in Paris, France.
It was built in 1889 as the centerpiece of the 1889 World's Fair."""
question = "When was the Eiffel Tower built?"
prompt = f"""Context: {context}
Question: {question}
Answer:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
answer = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(answer.strip()) # "1889"
RAG Implementation with LangChain
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load documents and create vector store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(your_documents)
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents, embeddings)
# Create QA chain
llm = HuggingFacePipeline.from_model_id(
model_id="meta-llama/Llama-3-8B-Instruct",
task="text-generation",
model_kwargs={"device_map": "auto"}
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
return_source_documents=True
)
result = qa_chain.invoke({"query": "What is the capital of France?"})
print(result["result"])
Conversational QA
class ConversationalQA:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.history = []
def answer(self, question):
context = "\n".join([f"Q: {q}\nA: {a}" for q, a in self.history])
prompt = f"""Context:\n{context}
Q: {question}
A:"""
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
outputs = self.model.generate(**inputs, max_new_tokens=100)
answer = self.tokenizer.decode(
outputs[0][inputs.shape[-1]:], skip_special_tokens=True
)
self.history.append((question, answer.strip()))
return answer.strip()
For conversational QA, implement proper context windowing to handle long conversations. Keep the most recent turns and summarize older turns to maintain relevant context.
QA Challenges
Ambiguity
Questions can be ambiguous in multiple ways:
- Lexical Ambiguity: "What is a bank?" (financial vs. river)
- Structural Ambiguity: "I saw the man with the telescope" (who has the telescope?)
- Referential Ambiguity: "Who is he?" (which person?)
Evidence Reasoning
Complex questions require reasoning over multiple pieces of evidence:
Multi-Hop Reasoning
Question: "Who was the president of the country where the 2016 Olympics were held?"
Step 1: "Where were the 2016 Olympics held?" → "Rio de Janeiro, Brazil" Step 2: "Who was president of Brazil in 2016?" → "Dilma Rousseff"
This requires multi-hop reasoning across multiple facts.
Calibration
DfQA Calibration
QA calibration ensures that the model's confidence scores accurately reflect the probability of being correct. Well-calibrated models are crucial for knowing when to abstain from answering.
Best Practices
Prompt Engineering
- Clear instructions: "Answer based only on the provided context"
- Format specification: "Provide a concise answer in one sentence"
- Uncertainty handling: "If unsure, say 'I don't know'"
- Citation requirements: "Cite the relevant passage"
Context Optimization
- Relevant context: Retrieve the most relevant documents
- Appropriate length: Balance context richness with model limits
- Quality filtering: Remove noisy or irrelevant passages
- Deduplication: Remove redundant information
Always validate QA system outputs with domain experts. LLMs can generate plausible-sounding but incorrect answers, especially for complex or domain-specific questions.
Practice Exercises
-
Evaluation: Compare EM and F1 scores for a QA system on a dataset. When does EM give a misleading picture of performance?
-
Implementation: Build a simple RAG system using a vector store and LLM. Evaluate the impact of chunk size on answer quality.
-
Analysis: Analyze failure modes of a QA system on a benchmark dataset. What types of questions does it struggle with?
-
Research: Investigate the impact of retrieval quality on QA performance. How does the number of retrieved documents affect answer accuracy?
Key Takeaways:
- QA systems can be extractive (span extraction) or abstractive (generated answers)
- RAG combines retrieval and generation for open-domain QA
- Conversational QA requires context maintenance and coreference resolution
- EM and F1 are standard metrics; consider task-specific evaluation
- Prompt engineering and context optimization are crucial for performance
What to Learn Next
-> LLM for Information Extraction Named entity extraction, relation extraction, and structured output generation.
-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.
-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.
-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.
-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.
-> LLM Testing Strategies Unit testing, integration testing, and regression testing for LLM systems.