LLM Applications

LLM for Information Extraction — Structuring Unstructured Data

Information extraction transforms unstructured text into structured data. LLMs excel at this task by understanding context, recognizing entities, and extracting relationships with minimal task-specific training.

Named Entity Recognition — Identifying and classifying entities in text
Relation Extraction — Discovering relationships between entities
Structured Output — Generating structured data from natural language

Data is the new oil; information extraction is the refinery.

LLM for Information Extraction

Information extraction (IE) is the task of extracting structured information from unstructured text. LLMs have transformed IE by enabling zero-shot and few-shot extraction, cross-domain generalization, and complex schema understanding.

DfInformation Extraction

Information extraction is the task of automatically extracting structured information from unstructured text. This includes identifying entities, extracting relationships between entities, and filling predefined templates.

Information Extraction Tasks

Named Entity Recognition (NER)

DfNamed Entity Recognition

Named Entity Recognition (NER) identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and quantities.

Standard entity types:

PER: Person names
ORG: Organizations
LOC: Locations
DATE: Dates and times
MISC: Miscellaneous entities

Relation Extraction

DfRelation Extraction

Relation extraction identifies semantic relationships between entities in text. For example, in "Apple was founded by Steve Jobs," the relation is "founded_by" between "Apple" and "Steve Jobs."

Event Extraction

DfEvent Extraction

Event extraction identifies events in text, including the event trigger, event type, and participating arguments (who, what, when, where).

Template Filling

DfTemplate Filling

Template filling extracts information to fill a predefined template with slots for specific types of information. For example, a job posting template might have slots for company, position, salary, and location.

Mathematical Formulation

Sequence Labeling for NER

NER as Sequence Labeling

P(y_1, \\ldots, y_n | x_1, \\ldots, x_n) = \\prod_{i=1}^{n} P(y_i | y_{i-1}, x_1, \\ldots, x_n)

Here,

$x_i$ =Input token at position i
$y_i$ =Label at position i (e.g., B-PER, I-PER, O)
$n$ =Sequence length

NER uses BIO tagging: B- (beginning), I- (inside), O (outside) for entity boundaries.

Relation Extraction

Relation Classification

P(r | e_1, e_2, \\text{context}) = \\text{softmax}(W h + b)

Here,

$r$ =Relation type
$e_1, e_2$ =Entity pair
$h$ =Contextual representation
$W, b$ =Classification parameters

LLM Approaches to IE

Zero-Shot Extraction

LLMs can extract information without task-specific training by using prompt engineering.

Zero-Shot NER with LLMs

Prompt: "Extract all person names from the following text: 'Barack Obama was born in Hawaii and served as the 44th President of the United States.'"

LLM Response: "Barack Obama"

The LLM recognizes "Barack Obama" as a person name based on its pretraining knowledge.

Few-Shot Extraction

Providing a few examples dramatically improves extraction quality.

Few-Shot NER

Prompt: "Extract entities from text.

Text: 'Apple Inc. was founded by Steve Jobs in California.' Entities: Apple Inc. (ORG), Steve Jobs (PER), California (LOC)

Text: 'Microsoft announced a partnership with OpenAI in San Francisco.' Entities:"

LLM Response: "Microsoft (ORG), OpenAI (ORG), San Francisco (LOC)"

Structured Output Generation

LLMs can generate structured output formats like JSON, YAML, or XML.

JSON Extraction

Prompt: "Extract information from the job posting and output as JSON:

Text: 'Senior ML Engineer at Google in Mountain View, CA. Salary: $180,000-$220,000.'"

LLM Output:

{
  "position": "Senior ML Engineer",
  "company": "Google",
  "location": "Mountain View, CA",
  "salary_range": {"min": 180000, "max": 220000}
}

Evaluation Metrics

For NER

NER Evaluation (Entity-Level F1)

\\text{F1} = 2 \\cdot \\frac{\\text{Precision} \\cdot \\text{Recall}}{\\text{Precision} + \\text{Recall}}

Here,

$Precision$ =Correctly extracted entities / Total extracted entities
$Recall$ =Correctly extracted entities / Total ground truth entities

For Relation Extraction

Metric	Description	Use Case
Precision	Correct relations / Extracted relations	High-precision applications
Recall	Correct relations / Total relations	Comprehensive extraction
F1	Harmonic mean of precision and recall	Balanced evaluation
AUC-PR	Area under precision-recall curve	Imbalanced datasets

For IE tasks, entity-level evaluation (exact match) is stricter than token-level evaluation. Consider the task requirements when choosing evaluation metrics.

Practical Implementation

NER with LLMs

from transformers import AutoTokenizer, AutoModelForCausalLM
import json

model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

text = """Apple Inc. announced today that CEO Tim Cook will visit 
the European headquarters in Dublin, Ireland next week."""

prompt = f"""Extract all named entities from the following text and 
return as JSON with entity type:

Text: {text}

Output:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(json.loads(result))

Relation Extraction

def extract_relations(text, entity1, entity2, model, tokenizer):
    prompt = f"""What is the relationship between {entity1} and {entity2} 
in the following text?

Text: {text}

Relationship:"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)

# Example
text = "Elon Musk founded SpaceX in 2002."
relation = extract_relations(text, "Elon Musk", "SpaceX", model, tokenizer)
# Returns: "founded" or "founder_of"

Structured Extraction with Pydantic

from pydantic import BaseModel
from typing import List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM

class CompanyInfo(BaseModel):
    name: str
    founded_year: Optional[int]
    headquarters: Optional[str]
    ceo: Optional[str]
    industry: Optional[str]

def extract_company_info(text: str, model, tokenizer) -> CompanyInfo:
    prompt = f"""Extract company information from the text and return as JSON:

Text: {text}

CompanyInfo:"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200)
    result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
    
    return CompanyInfo.parse_raw(result)

For structured output generation, provide explicit JSON schemas in the prompt and validate the output against the schema. Use Pydantic models for type validation.

Advanced Techniques

Nested NER

Some entities contain other entities:

Nested NER

"University of California, Berkeley" contains:

"University of California, Berkeley" (ORG)
"California" (LOC)
"Berkeley" (LOC)

Nested NER requires models that can handle overlapping entity spans.

Few-Shot with demonstrations

Selecting the right demonstrations significantly impacts performance:

Similar examples: Choose demonstrations similar to the target text
Diverse examples: Cover different entity types and patterns
Balanced examples: Ensure equal representation of entity types

Schema-Guided Extraction

DfSchema-Guided Extraction

Schema-guided extraction uses a predefined schema to guide extraction. The schema defines entity types, relation types, and constraints, ensuring consistent and complete extraction.

Constrained Decoding

P(x_t | x_{<t}) = \\text{softmax}(z_t / T) \\cdot m_t

Here,

$z_t$ =Logits at position t
$T$ =Temperature
$m_t$ =Mask based on schema constraints

Challenges and Solutions

Ambiguity in Entity Boundaries

Entity Boundary Ambiguity

"New York City" vs. "New York" "Dr. John Smith" vs. "John Smith" "IBM Corporation" vs. "IBM"

Boundary decisions depend on the task requirements and annotation guidelines.

Cross-Domain Generalization

LLMs often generalize better across domains than fine-tuned models, but may still struggle with domain-specific terminology.

Hallucination in Extraction

LLMs may extract information not present in the text or misclassify entities.

Always validate extracted information against the source text. Implement confidence scoring and allow the model to abstain when uncertain.

Best Practices

Prompt Design

Clear entity definitions: Define each entity type clearly
Output format specification: Specify exact output format
Boundary guidelines: Clarify how to handle ambiguous boundaries
Error handling: Specify behavior for missing or unclear information

Quality Assurance

Human review: Sample and review extractions
Consistency checks: Validate extractions against schema
Confidence scoring: Track extraction confidence
Active learning: Use uncertain cases to improve the system

For production IE systems, consider a hybrid approach: use LLMs for initial extraction and rule-based systems for validation and post-processing.

Practice Exercises

NER Evaluation: Compare zero-shot and few-shot NER performance on a benchmark dataset. How many examples are needed for competitive performance?
Relation Extraction: Build a relation extraction system for a specific domain (e.g., biomedical, financial). How does domain specificity affect performance?
Schema Design: Design a schema for extracting information from news articles. What entity and relation types are needed?
Error Analysis: Analyze common extraction errors in your system. What patterns emerge?

Key Takeaways:

LLMs enable zero-shot and few-shot information extraction
Structured output generation (JSON, YAML) enables integration with downstream systems
Entity-level evaluation is stricter than token-level evaluation
Schema-guided extraction ensures consistency and completeness
Always validate extracted information against source text

What to Learn Next

-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.

-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.

-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.

-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.

-> LLM Testing Strategies Unit testing, integration testing, and regression testing for LLM systems.

-> LLM Capstone Project End-to-end LLM application project with design decisions and deployment.

LLM for Information Extraction

LLM for Information Extraction — Structuring Unstructured Data

LLM for Information Extraction

DfInformation Extraction

Information Extraction Tasks

Named Entity Recognition (NER)

DfNamed Entity Recognition

Relation Extraction

DfRelation Extraction

Event Extraction

DfEvent Extraction

Template Filling

DfTemplate Filling

Mathematical Formulation

Sequence Labeling for NER

NER as Sequence Labeling

Relation Extraction

Relation Classification

LLM Approaches to IE

Zero-Shot Extraction

Zero-Shot NER with LLMs

Few-Shot Extraction

Few-Shot NER

Structured Output Generation

JSON Extraction

Evaluation Metrics

For NER

NER Evaluation (Entity-Level F1)

For Relation Extraction

Practical Implementation

NER with LLMs

Relation Extraction

Structured Extraction with Pydantic

Advanced Techniques

Nested NER

Nested NER

Few-Shot with demonstrations

Schema-Guided Extraction

DfSchema-Guided Extraction

Constrained Decoding

Challenges and Solutions

Ambiguity in Entity Boundaries

Entity Boundary Ambiguity

Cross-Domain Generalization

Hallucination in Extraction

Best Practices

Prompt Design

Quality Assurance

Practice Exercises

What to Learn Next

Need Expert LLM Help?