CW

LLM for Information Extraction

ApplicationsInformation ExtractionFree Lesson

Advertisement

LLM Applications

LLM for Information Extraction — Structuring Unstructured Data

Information extraction transforms unstructured text into structured data. LLMs excel at this task by understanding context, recognizing entities, and extracting relationships with minimal task-specific training.

  • Named Entity Recognition — Identifying and classifying entities in text
  • Relation Extraction — Discovering relationships between entities
  • Structured Output — Generating structured data from natural language

Data is the new oil; information extraction is the refinery.

LLM for Information Extraction

Information extraction (IE) is the task of extracting structured information from unstructured text. LLMs have transformed IE by enabling zero-shot and few-shot extraction, cross-domain generalization, and complex schema understanding.

DfInformation Extraction

Information extraction is the task of automatically extracting structured information from unstructured text. This includes identifying entities, extracting relationships between entities, and filling predefined templates.

Information Extraction Tasks

Named Entity Recognition (NER)

DfNamed Entity Recognition

Named Entity Recognition (NER) identifies and classifies named entities in text into predefined categories such as person names, organizations, locations, dates, and quantities.

Standard entity types:

  • PER: Person names
  • ORG: Organizations
  • LOC: Locations
  • DATE: Dates and times
  • MISC: Miscellaneous entities

Relation Extraction

DfRelation Extraction

Relation extraction identifies semantic relationships between entities in text. For example, in "Apple was founded by Steve Jobs," the relation is "founded_by" between "Apple" and "Steve Jobs."

Event Extraction

DfEvent Extraction

Event extraction identifies events in text, including the event trigger, event type, and participating arguments (who, what, when, where).

Template Filling

DfTemplate Filling

Template filling extracts information to fill a predefined template with slots for specific types of information. For example, a job posting template might have slots for company, position, salary, and location.

Mathematical Formulation

Sequence Labeling for NER

NER as Sequence Labeling

P(y1,ldots,ynx1,ldots,xn)=prodi=1nP(yiyi1,x1,ldots,xn)P(y_1, \\ldots, y_n | x_1, \\ldots, x_n) = \\prod_{i=1}^{n} P(y_i | y_{i-1}, x_1, \\ldots, x_n)

Here,

  • xix_i=Input token at position i
  • yiy_i=Label at position i (e.g., B-PER, I-PER, O)
  • nn=Sequence length

NER uses BIO tagging: B- (beginning), I- (inside), O (outside) for entity boundaries.

Relation Extraction

Relation Classification

P(re1,e2,textcontext)=textsoftmax(Wh+b)P(r | e_1, e_2, \\text{context}) = \\text{softmax}(W h + b)

Here,

  • rr=Relation type
  • e1,e2e_1, e_2=Entity pair
  • hh=Contextual representation
  • W,bW, b=Classification parameters

LLM Approaches to IE

Zero-Shot Extraction

LLMs can extract information without task-specific training by using prompt engineering.

Zero-Shot NER with LLMs

Prompt: "Extract all person names from the following text: 'Barack Obama was born in Hawaii and served as the 44th President of the United States.'"

LLM Response: "Barack Obama"

The LLM recognizes "Barack Obama" as a person name based on its pretraining knowledge.

Few-Shot Extraction

Providing a few examples dramatically improves extraction quality.

Few-Shot NER

Prompt: "Extract entities from text.

Text: 'Apple Inc. was founded by Steve Jobs in California.' Entities: Apple Inc. (ORG), Steve Jobs (PER), California (LOC)

Text: 'Microsoft announced a partnership with OpenAI in San Francisco.' Entities:"

LLM Response: "Microsoft (ORG), OpenAI (ORG), San Francisco (LOC)"

Structured Output Generation

LLMs can generate structured output formats like JSON, YAML, or XML.

JSON Extraction

Prompt: "Extract information from the job posting and output as JSON:

Text: 'Senior ML Engineer at Google in Mountain View, CA. Salary: $180,000-$220,000.'"

LLM Output:

{
  "position": "Senior ML Engineer",
  "company": "Google",
  "location": "Mountain View, CA",
  "salary_range": {"min": 180000, "max": 220000}
}

Evaluation Metrics

For NER

NER Evaluation (Entity-Level F1)

textF1=2cdotfractextPrecisioncdottextRecalltextPrecision+textRecall\\text{F1} = 2 \\cdot \\frac{\\text{Precision} \\cdot \\text{Recall}}{\\text{Precision} + \\text{Recall}}

Here,

  • PrecisionPrecision=Correctly extracted entities / Total extracted entities
  • RecallRecall=Correctly extracted entities / Total ground truth entities

For Relation Extraction

MetricDescriptionUse Case
PrecisionCorrect relations / Extracted relationsHigh-precision applications
RecallCorrect relations / Total relationsComprehensive extraction
F1Harmonic mean of precision and recallBalanced evaluation
AUC-PRArea under precision-recall curveImbalanced datasets

For IE tasks, entity-level evaluation (exact match) is stricter than token-level evaluation. Consider the task requirements when choosing evaluation metrics.

Practical Implementation

NER with LLMs

from transformers import AutoTokenizer, AutoModelForCausalLM
import json

model_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

text = """Apple Inc. announced today that CEO Tim Cook will visit 
the European headquarters in Dublin, Ireland next week."""

prompt = f"""Extract all named entities from the following text and 
return as JSON with entity type:

Text: {text}

Output:"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
print(json.loads(result))

Relation Extraction

def extract_relations(text, entity1, entity2, model, tokenizer):
    prompt = f"""What is the relationship between {entity1} and {entity2} 
in the following text?

Text: {text}

Relationship:"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50)
    return tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)

# Example
text = "Elon Musk founded SpaceX in 2002."
relation = extract_relations(text, "Elon Musk", "SpaceX", model, tokenizer)
# Returns: "founded" or "founder_of"

Structured Extraction with Pydantic

from pydantic import BaseModel
from typing import List, Optional
from transformers import AutoTokenizer, AutoModelForCausalLM

class CompanyInfo(BaseModel):
    name: str
    founded_year: Optional[int]
    headquarters: Optional[str]
    ceo: Optional[str]
    industry: Optional[str]

def extract_company_info(text: str, model, tokenizer) -> CompanyInfo:
    prompt = f"""Extract company information from the text and return as JSON:

Text: {text}

CompanyInfo:"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200)
    result = tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True)
    
    return CompanyInfo.parse_raw(result)

For structured output generation, provide explicit JSON schemas in the prompt and validate the output against the schema. Use Pydantic models for type validation.

Advanced Techniques

Nested NER

Some entities contain other entities:

Nested NER

"University of California, Berkeley" contains:

  • "University of California, Berkeley" (ORG)
  • "California" (LOC)
  • "Berkeley" (LOC)

Nested NER requires models that can handle overlapping entity spans.

Few-Shot with demonstrations

Selecting the right demonstrations significantly impacts performance:

  1. Similar examples: Choose demonstrations similar to the target text
  2. Diverse examples: Cover different entity types and patterns
  3. Balanced examples: Ensure equal representation of entity types

Schema-Guided Extraction

DfSchema-Guided Extraction

Schema-guided extraction uses a predefined schema to guide extraction. The schema defines entity types, relation types, and constraints, ensuring consistent and complete extraction.

Constrained Decoding

P(xtx<t)=textsoftmax(zt/T)cdotmtP(x_t | x_{<t}) = \\text{softmax}(z_t / T) \\cdot m_t

Here,

  • ztz_t=Logits at position t
  • TT=Temperature
  • mtm_t=Mask based on schema constraints

Challenges and Solutions

Ambiguity in Entity Boundaries

Entity Boundary Ambiguity

"New York City" vs. "New York" "Dr. John Smith" vs. "John Smith" "IBM Corporation" vs. "IBM"

Boundary decisions depend on the task requirements and annotation guidelines.

Cross-Domain Generalization

LLMs often generalize better across domains than fine-tuned models, but may still struggle with domain-specific terminology.

Hallucination in Extraction

LLMs may extract information not present in the text or misclassify entities.

Always validate extracted information against the source text. Implement confidence scoring and allow the model to abstain when uncertain.

Best Practices

Prompt Design

  1. Clear entity definitions: Define each entity type clearly
  2. Output format specification: Specify exact output format
  3. Boundary guidelines: Clarify how to handle ambiguous boundaries
  4. Error handling: Specify behavior for missing or unclear information

Quality Assurance

  1. Human review: Sample and review extractions
  2. Consistency checks: Validate extractions against schema
  3. Confidence scoring: Track extraction confidence
  4. Active learning: Use uncertain cases to improve the system

For production IE systems, consider a hybrid approach: use LLMs for initial extraction and rule-based systems for validation and post-processing.

Practice Exercises

  1. NER Evaluation: Compare zero-shot and few-shot NER performance on a benchmark dataset. How many examples are needed for competitive performance?

  2. Relation Extraction: Build a relation extraction system for a specific domain (e.g., biomedical, financial). How does domain specificity affect performance?

  3. Schema Design: Design a schema for extracting information from news articles. What entity and relation types are needed?

  4. Error Analysis: Analyze common extraction errors in your system. What patterns emerge?

Key Takeaways:

  • LLMs enable zero-shot and few-shot information extraction
  • Structured output generation (JSON, YAML) enables integration with downstream systems
  • Entity-level evaluation is stricter than token-level evaluation
  • Schema-guided extraction ensures consistency and completeness
  • Always validate extracted information against source text

What to Learn Next

-> LLM for Sentiment Analysis Aspect-based sentiment, emotion detection, and opinion mining.

-> LLM for Recommendation Systems Conversational recommenders, preference learning, and cold start solutions.

-> LLM for Content Creation Creative writing, marketing copy, and content generation at scale.

-> LLM Compliance and Governance Regulatory compliance, audit trails, and data governance for LLMs.

-> LLM Testing Strategies Unit testing, integration testing, and regression testing for LLM systems.

-> LLM Capstone Project End-to-end LLM application project with design decisions and deployment.

Advertisement

Need Expert LLM Help?

Get personalized tutoring, RAG system design, or production LLM consulting.

Advertisement