CW

LLM for Question Answering

ApplicationsQuestion AnsweringFree Lesson

Advertisement

LLM Applications

LLM for Question Answering — Machines That Understand and Answer

Question answering is a fundamental NLP task where LLMs excel, enabling systems to provide accurate, relevant answers to natural language questions. This guide covers the theoretical foundations, practical implementations, and evaluation of QA systems.

  • Open-Domain QA — Answering questions without restricting to a specific document
  • Extractive QA — Extracting answers directly from context passages
  • Conversational QA — Multi-turn question answering with context

The question is the key to understanding; the answer unlocks the knowledge.

LLM for Question Answering

Question answering (QA) systems aim to provide accurate answers to natural language questions. LLMs have transformed QA by enabling open-domain question answering, conversational QA, and complex reasoning over multiple information sources.

DfQuestion Answering

Question Answering (QA) is the task of automatically generating an answer to a natural language question. QA systems can be categorized by the source of answers (open-domain vs. closed-domain) and the type of answers generated (extractive vs. abstractive).

QA Taxonomy

By Answer Source

DfOpen-Domain QA

Open-domain question answering answers questions using knowledge from the entire web or large document collections, without restricting to a specific context passage.

DfClosed-Domain QA

Closed-domain question answering answers questions based on a specific document or set of documents provided as context.

By Answer Type

DfExtractive QA

Extractive QA extracts a span of text directly from the context as the answer. The answer is always a substring of the source document.

DfAbstractive QA

Abstractive QA generates an answer that may not appear verbatim in the source document. The model synthesizes information to produce a natural language response.

TypeAnswer SourceAnswer GenerationExample
Open-DomainEntire webAbstractive"What is the capital of France?" → "Paris"
Closed-DomainSpecific documentExtractiveSpan extraction from context
ConversationalMulti-turn contextAbstractiveFollow-up questions in dialogue

Mathematical Formulation

Extractive QA

Extractive QA as Span Prediction

P(textstart,textendq,c)=P(textstartq,c)cdotP(textendtextstart,q,c)P(\\text{start}, \\text{end} | q, c) = P(\\text{start} | q, c) \\cdot P(\\text{end} | \\text{start}, q, c)

Here,

  • qq=Question
  • cc=Context passage
  • start\text{start}=Start token index of answer span
  • end\text{end}=End token index of answer span

The model predicts probability distributions over start and end positions in the context.

Open-Domain QA

Open-Domain QA Pipeline

P(aq)=sumdinDP(aq,d)P(dq)P(a|q) = \\sum_{d \\in D} P(a|q, d) P(d|q)

Here,

  • qq=Question
  • aa=Answer
  • DD=Collection of documents
  • P(dq)P(d|q)=Document relevance probability
  • P(aq,d)P(a|q, d)=Answer given question and document

The pipeline retrieves relevant documents and generates answers conditioned on them.

Retrieval-Augmented Generation (RAG)

DfRAG for QA

Retrieval-Augmented Generation (RAG) combines retrieval of relevant documents with generation of answers. The model retrieves context documents and generates answers conditioned on them.

RAG Pipeline

  1. Query Processing: Reformulate the question for retrieval
  2. Document Retrieval: Find relevant documents from knowledge base
  3. Context Integration: Combine retrieved documents with the question
  4. Answer Generation: Generate an answer using the LLM
  5. Answer Verification: Validate the answer against source documents

RAG Score

P(aq)=sumdintexttopkP(dq)P(aq,d)P(a|q) = \\sum_{d \\in \\text{top-k}} P(d|q) P(a|q, d)

Here,

  • qq=User question
  • aa=Generated answer
  • dd=Retrieved document
  • kk=Number of retrieved documents

Conversational QA

DfConversational QA

Conversational question answering handles multi-turn dialogue where questions may depend on previous turns. The system must maintain context and resolve coreferences across the conversation.

Challenges in Conversational QA

  1. Coreference Resolution: "Who is he?" → resolving to a previously mentioned entity
  2. Ellipsis: "What about in 2020?" → implicit reference to previous topic
  3. Topic Shift: Questions that change the topic mid-conversation
  4. Context Maintenance: Tracking relevant information across turns

Conversational QA Example

Turn 1: "Who founded Tesla?" → "Elon Musk, Martin Eberhard, and Marc Tarpenning" Turn 2: "When did he start it?" → "2003" (resolving "he" to Elon Musk) Turn 3: "What about SpaceX?" → "Founded in 2002" (topic shift to new company)

Evaluation Metrics

Exact Match (EM)

Exact Match Score

textEM=frac1QsumqinQmathbb1[textnormalize(hataq)=textnormalize(aq)]\\text{EM} = \\frac{1}{|Q|} \\sum_{q \\in Q} \\mathbb{1}[\\text{normalize}(\\hat{a}_q) = \\text{normalize}(a_q)]

Here,

  • QQ=Set of questions
  • a^q\hat{a}_q=Predicted answer
  • aqa_q=Ground truth answer
  • normalizenormalize=Normalization function (lowercase, strip punctuation)

F1 Score

Token-Level F1 for QA

textF1=2cdotfractextPrecisioncdottextRecalltextPrecision+textRecall\\text{F1} = 2 \\cdot \\frac{\\text{Precision} \\cdot \\text{Recall}}{\\text{Precision} + \\text{Recall}}

Here,

  • PrecisionPrecision=Fraction of predicted tokens in ground truth
  • RecallRecall=Fraction of ground truth tokens in prediction

Evaluation Comparison

MetricMeasuresStrengthsWeaknesses
Exact MatchBinary correctnessSimple, interpretableMisses partial credit
F1 ScoreToken overlapPartial creditIgnores word order
BLEUN-gram precisionCaptures fluencyMisses semantics
BERTScoreSemantic similarityMeaning-awareExpensive
Human EvaluationOverall qualityGold standardExpensive, subjective

For open-domain QA, EM and F1 are standard metrics. For conversational QA, consider turn-level and dialogue-level metrics that capture context maintenance.

Practical Implementation

Basic QA with LLMs

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

context = """The Eiffel Tower is a wrought-iron lattice tower in Paris, France.
It was built in 1889 as the centerpiece of the 1889 World's Fair."""

question = "When was the Eiffel Tower built?"

prompt = f"""Context: {context}

Question: {question}

Answer:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
answer = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(answer.strip())  # "1889"

RAG Implementation with LangChain

from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load documents and create vector store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(your_documents)

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
vectorstore = Chroma.from_documents(documents, embeddings)

# Create QA chain
llm = HuggingFacePipeline.from_model_id(
    model_id="meta-llama/Llama-3-8B-Instruct",
    task="text-generation",
    model_kwargs={"device_map": "auto"}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
    return_source_documents=True
)

result = qa_chain.invoke({"query": "What is the capital of France?"})
print(result["result"])

Conversational QA

class ConversationalQA:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.history = []
    
    def answer(self, question):
        context = "\n".join([f"Q: {q}\nA: {a}" for q, a in self.history])
        
        prompt = f"""Context:\n{context}

Q: {question}
A:"""
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        outputs = self.model.generate(**inputs, max_new_tokens=100)
        answer = self.tokenizer.decode(
            outputs[0][inputs.shape[-1]:], skip_special_tokens=True
        )
        
        self.history.append((question, answer.strip()))
        return answer.strip()

For conversational QA, implement proper context windowing to handle long conversations. Keep the most recent turns and summarize older turns to maintain relevant context.

QA Challenges

Ambiguity

Questions can be ambiguous in multiple ways:

  1. Lexical Ambiguity: "What is a bank?" (financial vs. river)
  2. Structural Ambiguity: "I saw the man with the telescope" (who has the telescope?)
  3. Referential Ambiguity: "Who is he?" (which person?)

Evidence Reasoning

Complex questions require reasoning over multiple pieces of evidence:

Multi-Hop Reasoning

Question: "Who was the president of the country where the 2016 Olympics were held?"

Step 1: "Where were the 2016 Olympics held?" → "Rio de Janeiro, Brazil" Step 2: "Who was president of Brazil in 2016?" → "Dilma Rousseff"

This requires multi-hop reasoning across multiple facts.

Calibration

DfQA Calibration

QA calibration ensures that the model's confidence scores accurately reflect the probability of being correct. Well-calibrated models are crucial for knowing when to abstain from answering.

Best Practices

Prompt Engineering

  1. Clear instructions: "Answer based only on the provided context"
  2. Format specification: "Provide a concise answer in one sentence"
  3. Uncertainty handling: "If unsure, say 'I don't know'"
  4. Citation requirements: "Cite the relevant passage"

Context Optimization

  1. Relevant context: Retrieve the most relevant documents
  2. Appropriate length: Balance context richness with model limits
  3. Quality filtering: Remove noisy or irrelevant passages
  4. Deduplication: Remove redundant information

Always validate QA system outputs with domain experts. LLMs can generate plausible-sounding but incorrect answers, especially for complex or domain-specific questions.

Practice Exercises

  1. Evaluation: Compare EM and F1 scores for a QA system on a dataset. When does EM give a misleading picture of performance?

  2. Implementation: Build a simple RAG system using a vector store and LLM. Evaluate the impact of chunk size on answer quality.

  3. Analysis: Analyze failure modes of a QA system on a benchmark dataset. What types of questions does it struggle with?

  4. Research: Investigate the impact of retrieval quality on QA performance. How does the number of retrieved documents affect answer accuracy?

Key Takeaways:

  • QA systems can be extractive (span extraction) or abstractive (generated answers)
  • RAG combines retrieval and generation for open-domain QA
  • Conversational QA requires context maintenance and coreference resolution
  • EM and F1 are standard metrics; consider task-specific evaluation
  • Prompt engineering and context optimization are crucial for performance

What to Learn Next

-> LLM for Information Extraction Named entity extraction, relation extraction, and structured output generation.

-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.

-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.

-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.

-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.

-> LLM Testing Strategies Unit testing, integration testing, and regression testing for LLM systems.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement