LLM Applications
LLM for Information Extraction — Structuring Unstructured Data
Information extraction transforms unstructured text into structured data. LLMs excel at this task by understanding context, recognizing entities, and extracting relationships with minimal task-specific training.
- Named Entity Recognition — Identifying and classifying entities in text
- Relation Extraction — Discovering relationships between entities
- Structured Output — Generating structured data from natural language
Data is the new oil; information extraction is the refinery.
LLM for Information Extraction
Information extraction (IE) is the task of extracting structured information from unstructured text. LLMs have transformed IE by enabling zero-shot and few-shot extraction, cross-domain generalization, and complex schema understanding.
DfInformation Extraction
Information extraction is the task of automatically extracting structured information from unstructured text. This includes identifying entities, extracting relationships between entities, and filling predefined templates.
Information Extraction Tasks
Named Entity Recognition (NER)
DfNamed Entity Recognition
Named Entity Recognition (NER) identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and quantities.
Standard entity types:
- PER: Person names
- ORG: Organizations
- LOC: Locations
- DATE: Dates and times
- MISC: Miscellaneous entities
Relation Extraction
DfRelation Extraction
Relation extraction identifies semantic relationships between entities in text. For example, in "Apple was founded by Steve Jobs," the relation is "founded_by" between "Apple" and "Steve Jobs."
Event Extraction
DfEvent Extraction
Event extraction identifies events in text, including the event trigger, event type, and participating arguments (who, what, when, where).
Template Filling
DfTemplate Filling
Template filling extracts information to fill a predefined template with slots for specific types of information. For example, a job posting template might have slots for company, position, salary, and location.
Mathematical Formulation
Sequence Labeling for NER
NER as Sequence Labeling
Here,
- =Input token at position i
- =Label at position i (e.g., B-PER, I-PER, O)
- =Sequence length
NER uses BIO tagging: B- (beginning), I- (inside), O (outside) for entity boundaries.
Relation Extraction
Relation Classification
Here,
- =Relation type
- =Entity pair
- =Contextual representation
- =Classification parameters
LLM Approaches to IE
Zero-Shot Extraction
LLMs can extract information without task-specific training by using prompt engineering.
Zero-Shot NER with LLMs
Prompt: "Extract all person names from the following text: 'Barack Obama was born in Hawaii and served as the 44th President of the United States.'"
LLM Response: "Barack Obama"
The LLM recognizes "Barack Obama" as a person name based on its pretraining knowledge.
Few-Shot Extraction
Providing a few examples dramatically improves extraction quality.
Few-Shot NER
Prompt: "Extract entities from text.
Text: 'Apple Inc. was founded by Steve Jobs in California.' Entities: Apple Inc. (ORG), Steve Jobs (PER), California (LOC)
Text: 'Microsoft announced a partnership with OpenAI in San Francisco.' Entities:"
LLM Response: "Microsoft (ORG), OpenAI (ORG), San Francisco (LOC)"
Structured Output Generation
LLMs can generate structured output formats like JSON, YAML, or XML.
JSON Extraction
Prompt: "Extract information from the job posting and output as JSON:
Text: 'Senior ML Engineer at Google in Mountain View, CA. Salary: $180,000-$220,000.'"
LLM Output:
{
"position": "Senior ML Engineer",
"company": "Google",
"location": "Mountain View, CA",
"salary_range": {"min": 180000, "max": 220000}
}
Evaluation Metrics
For NER
NER Evaluation (Entity-Level F1)
Here,
- =Correctly extracted entities / Total extracted entities
- =Correctly extracted entities / Total ground truth entities
For Relation Extraction
| Metric | Description | Use Case |
|---|---|---|
| Precision | Correct relations / Extracted relations | High-precision applications |
| Recall | Correct relations / Total relations | Comprehensive extraction |
| F1 | Harmonic mean of precision and recall | Balanced evaluation |
| AUC-PR | Area under precision-recall curve | Imbalanced datasets |
For IE tasks, entity-level evaluation (exact match) is stricter than token-level evaluation. Consider the task requirements when choosing evaluation metrics.
Practical Implementation
NER with LLMs
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
text = """Apple Inc. announced today that CEO Tim Cook will visit
the European headquarters in Dublin, Ireland next week."""
prompt = f"""Extract all named entities from the following text and
return as JSON with entity type:
Text: {text}
Output:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(json.loads(result))
Relation Extraction
def extract_relations(text, entity1, entity2, model, tokenizer):
prompt = f"""What is the relationship between {entity1} and {entity2}
in the following text?
Text: {text}
Relationship:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50)
return tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
# Example
text = "Elon Musk founded SpaceX in 2002."
relation = extract_relations(text, "Elon Musk", "SpaceX", model, tokenizer)
# Returns: "founded" or "founder_of"
Structured Extraction with Pydantic
from pydantic import BaseModel
from typing import List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM
class CompanyInfo(BaseModel):
name: str
founded_year: Optional[int]
headquarters: Optional[str]
ceo: Optional[str]
industry: Optional[str]
def extract_company_info(text: str, model, tokenizer) -> CompanyInfo:
prompt = f"""Extract company information from the text and return as JSON:
Text: {text}
CompanyInfo:"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
return CompanyInfo.parse_raw(result)
For structured output generation, provide explicit JSON schemas in the prompt and validate the output against the schema. Use Pydantic models for type validation.
Advanced Techniques
Nested NER
Some entities contain other entities:
Nested NER
"University of California, Berkeley" contains:
- "University of California, Berkeley" (ORG)
- "California" (LOC)
- "Berkeley" (LOC)
Nested NER requires models that can handle overlapping entity spans.
Few-Shot with demonstrations
Selecting the right demonstrations significantly impacts performance:
- Similar examples: Choose demonstrations similar to the target text
- Diverse examples: Cover different entity types and patterns
- Balanced examples: Ensure equal representation of entity types
Schema-Guided Extraction
DfSchema-Guided Extraction
Schema-guided extraction uses a predefined schema to guide extraction. The schema defines entity types, relation types, and constraints, ensuring consistent and complete extraction.
Constrained Decoding
Here,
- =Logits at position t
- =Temperature
- =Mask based on schema constraints
Challenges and Solutions
Ambiguity in Entity Boundaries
Entity Boundary Ambiguity
"New York City" vs. "New York" "Dr. John Smith" vs. "John Smith" "IBM Corporation" vs. "IBM"
Boundary decisions depend on the task requirements and annotation guidelines.
Cross-Domain Generalization
LLMs often generalize better across domains than fine-tuned models, but may still struggle with domain-specific terminology.
Hallucination in Extraction
LLMs may extract information not present in the text or misclassify entities.
Always validate extracted information against the source text. Implement confidence scoring and allow the model to abstain when uncertain.
Best Practices
Prompt Design
- Clear entity definitions: Define each entity type clearly
- Output format specification: Specify exact output format
- Boundary guidelines: Clarify how to handle ambiguous boundaries
- Error handling: Specify behavior for missing or unclear information
Quality Assurance
- Human review: Sample and review extractions
- Consistency checks: Validate extractions against schema
- Confidence scoring: Track extraction confidence
- Active learning: Use uncertain cases to improve the system
For production IE systems, consider a hybrid approach: use LLMs for initial extraction and rule-based systems for validation and post-processing.
Practice Exercises
-
NER Evaluation: Compare zero-shot and few-shot NER performance on a benchmark dataset. How many examples are needed for competitive performance?
-
Relation Extraction: Build a relation extraction system for a specific domain (e.g., biomedical, financial). How does domain specificity affect performance?
-
Schema Design: Design a schema for extracting information from news articles. What entity and relation types are needed?
-
Error Analysis: Analyze common extraction errors in your system. What patterns emerge?
Key Takeaways:
- LLMs enable zero-shot and few-shot information extraction
- Structured output generation (JSON, YAML) enables integration with downstream systems
- Entity-level evaluation is stricter than token-level evaluation
- Schema-guided extraction ensures consistency and completeness
- Always validate extracted information against source text
What to Learn Next
-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.
-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.
-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.
-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.
-> LLM Testing Strategies Unit testing, integration testing, and regression testing for LLM systems.
-> LLM Capstone Project End-to-end LLM application project with design decisions and deployment.