LLM Applications
LLM for Summarization — Condensing Information Intelligently
Summarization is one of the most valuable applications of LLMs, enabling automatic condensation of long documents into concise, coherent summaries. This guide covers the theoretical foundations, evaluation methodologies, and practical techniques for building effective summarization systems.
- Abstractive vs Extractive — Two fundamental summarization paradigms
- Evaluation Metrics — ROUGE, BERTScore, and human evaluation
- Long-Document Summarization — Handling documents that exceed model context windows
The art of summarization is knowing what to leave out.
LLM for Summarization
Summarization aims to produce a concise version of a longer text while preserving key information and overall meaning. LLMs have achieved state-of-the-art performance on summarization tasks through large-scale pretraining and instruction tuning.
DfText Summarization
Text summarization is the process of generating a shorter version of a text document that captures the most important information. A good summary is fluent, coherent, and maintains the essential meaning of the source document.
Summarization Paradigms
Extractive Summarization
DfExtractive Summarization
Extractive summarization selects a subset of sentences from the source document to form a summary. The model learns to identify important sentences without generating new text.
Extractive approaches have the advantage of preserving original phrasing and factual accuracy, but may produce less coherent summaries.
Abstractive Summarization
DfAbstractive Summarization
Abstractive summarization generates new text that captures the essential meaning of the source document. The model may use different words and sentence structures than the original.
Abstractive approaches can produce more natural and coherent summaries but may introduce factual errors or hallucinations.
Hybrid Approaches
Hybrid Summarization
A hybrid approach combines extractive and abstractive methods:
- Extract key sentences from the document (extractive)
- Rewrite and condense using an LLM (abstractive)
- Verify factual accuracy against source (verification)
This approach balances factual accuracy with fluency.
Mathematical Formulation
Summarization as Conditional Generation
Here,
- =Source document
- =Generated summary
- =Source document length
- =Summary length
The model learns to generate a summary y that maximizes the conditional probability given the source document x.
Evaluation Metrics
ROUGE Scores
ROUGE-N Score
Here,
- =N-gram order (1 for unigrams, 2 for bigrams)
- =Reference summary
- =Number of n-grams in both candidate and reference
ROUGE-1 measures unigram overlap (individual words), ROUGE-2 measures bigram overlap (word pairs), and ROUGE-L measures longest common subsequence.
BERTScore
DfBERTScore
BERTScore computes semantic similarity between candidate and reference summaries using contextual embeddings from pre-trained language models. It captures semantic meaning beyond surface-level word matching.
BERTScore Calculation
Here,
- =Candidate summary tokens
- =Reference summary tokens
- =Cosine similarity between embeddings
Evaluation Comparison
| Metric | Measures | Strengths | Weaknesses |
|---|---|---|---|
| ROUGE-1 | Word overlap | Fast, interpretable | Misses semantics |
| ROUGE-2 | Phrase overlap | Captures fluency | Limited context |
| ROUGE-L | Sentence structure | Flexible matching | Limited semantics |
| BERTScore | Semantic similarity | Captures meaning | Computationally expensive |
| Human Evaluation | Overall quality | Gold standard | Expensive, subjective |
For production systems, use a combination of automatic metrics (ROUGE, BERTSpot) and human evaluation. Automatic metrics are useful for development and monitoring, but human evaluation is essential for final quality assessment.
Long-Document Summarization
Many real-world documents exceed the context window of LLMs (typically 4K-128K tokens). Several strategies address this challenge.
Chunking Strategies
DfDocument Chunking
Document chunking involves splitting a long document into smaller segments, summarizing each segment, and then combining the results. The challenge is maintaining coherence across chunks.
- Fixed-size chunking: Split document into equal-sized segments
- Semantic chunking: Split at paragraph or section boundaries
- Sliding window: Overlapping chunks to capture context
- Hierarchical summarization: Summarize sections, then summarize section summaries
Hierarchical Summarization
Hierarchical Summarization
Here,
- =Document chunk i
- =Chunk summarization function
- =Aggregation function
- =Number of chunks
The hierarchical approach first summarizes each chunk independently, then aggregates the chunk summaries into a final summary.
Map-Reduce Approach
Map-Reduce Summarization
For a 10,000-word document with 500-word chunks:
- Map phase: Summarize each of the 20 chunks (each ~50 words)
- Reduce phase: Combine 20 chunk summaries into final summary (~200 words)
This reduces the problem from processing 10,000 tokens to processing 20 × 500 = 10,000 tokens in map phase, then 20 × 50 = 1,000 tokens in reduce phase.
Practical Implementation
Basic Summarization with LLMs
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
document = """Your long document text here..."""
prompt = f"""Please provide a concise summary of the following document:
{document}
Summary:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=300,
temperature=0.3,
do_sample=True
)
summary = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(summary)
Long-Document Summarization with Chunking
def chunk_document(text, chunk_size=2000, overlap=200):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
return chunks
def hierarchical_summarize(document, model, tokenizer):
chunks = chunk_document(document, chunk_size=2000, overlap=200)
chunk_summaries = []
for chunk in chunks:
prompt = f"Summarize: {chunk}\nSummary:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=150)
summary = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
chunk_summaries.append(summary)
combined = " ".join(chunk_summaries)
final_prompt = f"Combine these summaries into a coherent summary:\n{combined}\nFinal Summary:"
inputs = tokenizer(final_prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=300)
return tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
When summarizing long documents, consider the trade-off between detail and brevity. For technical documents, preserve key findings and methodology. For narrative documents, focus on main plot points and character development.
Summarization Challenges
Hallucination
DfSummarization Hallucination
Summarization hallucination occurs when the model generates information not present in the source document. This is a critical issue for factual summarization tasks.
Mitigation strategies:
- Factual verification: Cross-reference generated claims with source
- Constrained decoding: Limit generation to phrases from the source
- Citation generation: Require citations for generated claims
Coherence and Flow
Ensuring that the summary reads naturally and maintains logical flow is challenging, especially for long documents.
Coherence Issues
Source: "The company reported record profits. However, the CEO announced plans to retire. The stock price rose following the announcement."
Poor summary: "Record profits were reported. The CEO will retire. Stock rose."
Better summary: "Following record profits and the CEO's retirement announcement, the company's stock price increased."
Faithfulness
DfSummary Faithfulness
Faithfulness measures whether the summary accurately represents the content of the source document. Unfaithful summaries may distort facts, omit important information, or add unsupported claims.
Best Practices
Prompt Engineering for Summarization
- Specify length: "Summarize in 3-5 sentences"
- Define focus: "Focus on the methodology and key findings"
- Set style: "Write a formal, objective summary"
- Provide examples: Include example summaries for style reference
Quality Control
- Multiple passes: Generate multiple summaries and select the best
- Fact-checking: Verify key claims against source document
- Readability testing: Ensure the summary is clear and concise
- User testing: Gather feedback from target audience
Always validate summarization quality with domain experts for high-stakes applications like medical, legal, or financial summarization. Automated metrics may not capture domain-specific accuracy requirements.
What to Learn Next
-> LLM for Question Answering Open-domain, extractive, and conversational QA with large language models.
-> LLM for Information Extraction Named entity extraction, relation extraction, and structured output generation.
-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.
-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.
-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.
-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.