Table of Contents
Introduction
Google's LangExtract is a revolutionary open-source Python library that transforms unstructured text into structured, actionable data using large language models like Gemini and OpenAI's GPT models. Released in July 2025, this powerful tool addresses the critical challenge of extracting reliable, traceable information from complex documents—from clinical notes and legal contracts to research papers and customer feedback.
Unlike traditional NER tools that require extensive training data, LangExtract leverages LLMs to adapt to any domain with just 3-5 examples, achieving 99.9% accuracy while maintaining precise source grounding for every extraction.
In this comprehensive guide, we'll explore how to implement LangExtract in production environments, optimize performance, and leverage its capabilities for various AI applications.
What is LangExtract?
LangExtract is a Python library designed to programmatically extract structured information from unstructured text documents using LLMs. Unlike traditional Named Entity Recognition (NER) tools that require extensive training data and domain-specific fine-tuning, LangExtract leverages the natural language understanding capabilities of modern LLMs to adapt to any domain with just a few examples.
The library transforms chaotic, free-form text into clean, structured data formats while maintaining precise source grounding—mapping every extraction back to its exact location in the original document. This ensures transparency, traceability, and verification of extracted information.
- No training required - works with just 3-5 examples
- Character-level source grounding for verification
- Supports 100+ languages out of the box
How LangExtract Works
LangExtract operates through a sophisticated pipeline that combines prompt engineering, few-shot learning, and controlled generation to extract structured information from text.
Core Architecture
The extraction pipeline consists of several key steps:
- Input Processing: Accepts text documents, URLs, or file paths as input
- Prompt Engineering: Uses developer-defined extraction prompts with clear instructions
- Few-Shot Learning: Leverages example data to guide the model's understanding
- LLM Processing: Employs advanced language models (Gemini, GPT, or local models via Ollama) for extraction
- Source Grounding: Maps each extracted entity to its precise location in the source text
- Structured Output: Generates JSONL format data with consistent schema
Key Features
LangExtract provides several powerful features that set it apart from traditional extraction tools:
- Precise Source Grounding: Every extraction includes character-level mapping to the original text
- Controlled Generation: Uses schema constraints and few-shot examples to ensure consistent outputs
- Long Document Processing: Handles extensive documents through intelligent text chunking
- Multi-Model Support: Works with cloud-based models (Gemini, OpenAI) and local models via Ollama
Installation and Setup
Getting started with LangExtract is straightforward. First, install the library using pip:
# Standard installation
pip install langextract
# For OpenAI models
pip install "langextract[openai]"
# For development
pip install -e ".[dev]"
For cloud-based models, you'll need to configure API access. Set up your API key using environment variables:
# Option 1: Environment variable
export LANGEXTRACT_API_KEY="your-api-key-here"
# Option 2: .env file (recommended)
echo "LANGEXTRACT_API_KEY=your-api-key-here" > .env
echo ".env" >> .gitignore
Complete Code Examples
Basic Entity Extraction
Here's a simple example of extracting entities from text using LangExtract:
import langextract as lx
import textwrap
# Define extraction prompt
prompt = textwrap.dedent("""\
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.""")
# Provide few-shot examples
examples = [
lx.data.ExampleData(
text="ROMEO. But soft! What light through yonder window breaks?",
extractions=[
lx.data.Extraction(
extraction_class="character",
extraction_text="ROMEO",
attributes={"emotional_state": "wonder"}
),
lx.data.Extraction(
extraction_class="emotion",
extraction_text="But soft!",
attributes={"feeling": "gentle awe"}
),
]
)
]
# Input text to process
input_text = "Lady Juliet gazed longingly at the stars"
# Run extraction
result = lx.extract(
text_or_documents=input_text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash"
)
# Display results
for extraction in result.extractions:
print(f"Class: {extraction.extraction_class}")
print(f"Text: {extraction.extraction_text}")
print(f"Attributes: {extraction.attributes}")
print("---")
Advanced Document Processing
For more complex extraction tasks, you can optimize the extraction process with multiple passes and parallel processing:
import langextract as lx
import textwrap
# Complex extraction for business documents
prompt = textwrap.dedent("""\
Extract companies, financial metrics, dates, and market sentiment.
Use exact text for extractions. Include specific values and context.""")
examples = [
lx.data.ExampleData(
text="TechCorp reported Q3 revenue of $2.5B on October 15, 2024",
extractions=[
lx.data.Extraction(
extraction_class="company",
extraction_text="TechCorp",
attributes={"type": "public_company"}
),
lx.data.Extraction(
extraction_class="financial_metric",
extraction_text="Q3 revenue of $2.5B",
attributes={"metric_type": "revenue", "period": "Q3", "value": "$2.5B"}
),
]
)
]
# Process large document with optimization
result = lx.extract(
text_or_documents="path/to/large_document.txt",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash",
extraction_passes=3, # Multiple passes for better recall
max_workers=20, # Parallel processing
max_char_buffer=1000 # Optimal chunking size
)
print(f"Extracted {len(result.extractions)} entities")
Real-World Applications
LangExtract excels in various real-world applications where structured information extraction is critical. Here are some practical implementations:
- Healthcare: Extract medications, dosages, symptoms, and diagnoses from clinical notes with precise accuracy.
- Legal: Process contracts and legal documents to extract parties, terms, dates, and obligations.
- Finance: Analyze financial reports to extract metrics, companies, and market sentiment for investment analysis.
- Research: Extract findings, methodologies, and citations from academic papers for literature reviews.
- Customer Intelligence: Process customer feedback to extract sentiment, product mentions, and feature requests.
Interactive HTML Visualization
One of LangExtract's most powerful features is its ability to generate interactive HTML visualizations that highlight extracted entities directly in the source text:
import langextract as lx
# Run extraction
result = lx.extract(
text_or_documents="path/to/document.txt",
prompt_description="Extract key entities...",
examples=examples,
model_id="gemini-2.0-flash-exp"
)
# Generate interactive HTML visualization
result.to_html("extraction_visualization.html")
# The HTML file provides:
# - Color-coded entity highlighting
# - Hover tooltips with extraction details
# - Side panel with extraction list
# - Search and filter capabilities
# - Export options for further processing
# You can also get the HTML as a string
html_content = result.to_html()
# Or create a custom visualization
from langextract.visualization import create_custom_viz
custom_html = create_custom_viz(
result,
highlight_colors={"person": "#3B82F6", "location": "#10B981"},
show_confidence_scores=True,
enable_export=True
)
import langextract as lx
import textwrap
# Healthcare-specific extraction
prompt = textwrap.dedent("""\
Extract medications, dosages, symptoms, and diagnoses from clinical notes.
Include administration routes and frequencies where mentioned.
Use exact medical terminology from the text.""")
examples = [
lx.data.ExampleData(
text="Patient prescribed Metformin 500mg twice daily for Type 2 diabetes",
extractions=[
lx.data.Extraction(
extraction_class="medication",
extraction_text="Metformin",
attributes={"dosage": "500mg", "frequency": "twice daily"}
),
lx.data.Extraction(
extraction_class="diagnosis",
extraction_text="Type 2 diabetes",
attributes={"status": "ongoing_management"}
),
]
)
]
clinical_note = """
Patient presents with chest pain and shortness of breath.
Prescribed Lisinopril 10mg once daily for hypertension.
Follow-up recommended in 2 weeks.
"""
result = lx.extract(
text_or_documents=clinical_note,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.5-flash"
)
# Process results
medications = [e for e in result.extractions if e.extraction_class == "medication"]
for med in medications:
print(f"Medication: {med.extraction_text}")
print(f"Details: {med.attributes}")
Using Local Models with Ollama
For privacy-sensitive applications or when you need to process data offline, LangExtract supports running local models through Ollama integration:
# First, install and start Ollama
# brew install ollama # macOS
# ollama serve # Start the Ollama server
# Pull a model
# ollama pull llama2
# ollama pull mistral
import langextract as lx
# Configure for local model
result = lx.extract(
text_or_documents="sensitive_document.txt",
prompt_description="Extract PII and sensitive information",
examples=examples,
model_id="ollama:llama2", # Use local Llama 2
# model_id="ollama:mistral", # Or Mistral
extraction_passes=2,
max_workers=5 # Adjust based on local resources
)
# Benefits of local models:
# - Complete data privacy - no data leaves your infrastructure
# - No API costs or rate limits
# - Consistent latency without network dependencies
# - Compliance with strict data residency requirements
# Trade-offs:
# - Slightly lower accuracy than Gemini 2.0
# - Requires local compute resources
# - Model management and updates are manual
Performance Optimization
Optimize LangExtract performance for large-scale deployments with these techniques:
import langextract as lx
from concurrent.futures import ThreadPoolExecutor
import asyncio
# 1. Batch Processing for Multiple Documents
def batch_extract(documents, prompt, examples):
"""Process multiple documents in parallel"""
with ThreadPoolExecutor(max_workers=10) as executor:
futures = []
for doc in documents:
future = executor.submit(
lx.extract,
text_or_documents=doc,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
extraction_passes=2
)
futures.append(future)
results = [f.result() for f in futures]
return results
# 2. Optimize Chunk Size for Long Documents
result = lx.extract(
text_or_documents="very_long_document.pdf",
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp",
max_char_buffer=2000, # Optimal chunk size
max_workers=20, # Parallel chunk processing
extraction_passes=3 # Multiple passes for completeness
)
# 3. Cache Results for Repeated Extractions
from functools import lru_cache
import hashlib
@lru_cache(maxsize=100)
def cached_extract(text_hash, prompt, model_id):
"""Cache extraction results for repeated queries"""
return lx.extract(
text_or_documents=text_hash, # Use original text
prompt_description=prompt,
examples=examples,
model_id=model_id
)
# 4. Model Selection by Task Complexity
def smart_model_selection(text_length, complexity):
"""Choose optimal model based on task requirements"""
if text_length < 1000 and complexity == "simple":
return "gemini-1.5-flash" # Fastest, cheapest
elif complexity == "complex":
return "gemini-2.0-flash-exp" # Best accuracy
else:
return "ollama:mistral" # Local processing
# 5. Monitor and Log Performance Metrics
import time
import logging
logger = logging.getLogger(__name__)
def timed_extraction(text, prompt, examples):
"""Extract with performance monitoring"""
start_time = time.time()
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp"
)
elapsed = time.time() - start_time
tokens_processed = len(text.split())
logger.info(f"Extraction completed in {elapsed:.2f}s")
logger.info(f"Tokens/sec: {tokens_processed/elapsed:.0f}")
logger.info(f"Entities extracted: {len(result.extractions)}")
return result
Best Practices
Here are key best practices for implementing LangExtract in production environments:
- Prompt Engineering: Invest time in crafting clear, specific prompts with high-quality examples that cover edge cases.
- Model Selection: Use gemini-2.5-flash for speed and cost efficiency, or gemini-2.5-pro for complex extraction tasks requiring advanced reasoning.
- Error Handling: Implement robust retry logic and validation to handle API failures and ensure extraction quality.
- Performance Optimization: Use multiple extraction passes and parallel processing for large documents while managing costs.
- Monitoring: Track extraction performance, costs, and quality metrics over time to identify areas for improvement.
import langextract as lx
from tenacity import retry, stop_after_attempt, wait_exponential
import logging
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
def robust_extraction(text, prompt, examples):
"""Production-ready extraction with retry logic and monitoring"""
try:
result = lx.extract(
text_or_documents=text,
prompt_description=prompt,
examples=examples,
model_id="gemini-2.0-flash-exp", # Latest recommended model
extraction_passes=2, # Multiple passes for better recall
max_workers=10 # Parallel processing
)
# Validate results
if not result.extractions:
logger.warning("No extractions found")
raise ValueError("No extractions found")
# Log extraction metrics
logger.info(f"Extracted {len(result.extractions)} entities")
# Generate interactive visualization
result.to_html("extraction_results.html")
return result
except Exception as e:
logger.error(f"Extraction failed: {e}")
raise
Conclusion
LangExtract represents a paradigm shift in information extraction, democratizing access to sophisticated NLP capabilities while maintaining the precision and traceability required for production applications. For AI developers, it offers an unprecedented combination of simplicity, power, and reliability that makes structured data extraction accessible and scalable across diverse domains and use cases.
The library's key advantages include requiring no training data (just 3-5 examples), achieving 99.9% accuracy with precise source grounding, supporting 100+ languages, and working with both cloud and local models. This makes it an ideal choice for organizations looking to extract valuable insights from unstructured data efficiently.
As you implement LangExtract in your projects, remember to focus on clear prompt engineering, choose the right model for your use case, and implement proper error handling and monitoring for production deployments. With these practices in place, you'll be able to transform your unstructured data into actionable insights at scale.
Further Reading
Additional resources to deepen your understanding of LangExtract:
Key Resources
Official repository with documentation, examples, and source code
Get your Gemini API key and explore model capabilities
Install LangExtract and explore the Python package documentation