Data Extraction Module¶
This module provides a flexible framework for extracting structured data from text using language models. It supports both basic and augmented data extraction capabilities.
Overview¶
The module consists of two main components:
- Basic Data Extraction: Uses language models to extract structured data from text based on a provided schema.
- Augmented Data Extraction: Extends basic extraction by adding extra fields to the output schema while maintaining the original data structure.
Components¶
Base Classes¶
BaseDataExtractor: Abstract base class that defines the interface for data extractionBaseDataExtractionResult: Generic base class for extraction results
Basic Extraction¶
StructuredOutputDataExtractor: Implements basic data extraction using language modelsStructuredOutputDataExtractorConfig: Configuration for the basic extractor
Augmented Extraction¶
AugmentedDataExtractor: Extends basic extraction with additional fieldsAugmentedDataExtractionResult: Result type for augmented extraction
Usage Examples¶
Basic Data Extraction¶
#data-extraction-basic-usage
from pydantic import BaseModel
from unique_toolkit.data_extraction import StructuredOutputDataExtractor, StructuredOutputDataExtractorConfig
from unique_toolkit import LanguageModelService
# Define your schema
class PersonInfo(BaseModel):
name: str
age: int
occupation: str
# Create the extractor
config = StructuredOutputDataExtractorConfig()
lm_service = LanguageModelService() # Configure as needed
extractor = StructuredOutputDataExtractor(config, lm_service)
# Extract data
text = "John is 30 years old and works as a software engineer."
result = await extractor.extract_data_from_text(text, PersonInfo)
print(result.data) # PersonInfo(name="John", age=30, occupation="software engineer")
Augmented Data Extraction¶
#data-extraction-augmented-usage
from pydantic import BaseModel, Field
from _common.data_extraction import AugmentedDataExtractor, StructuredOutputDataExtractor
# Define your base schema
class PersonInfo(BaseModel):
name: str
age: int
# Create base extractor
base_extractor = StructuredOutputDataExtractor(...)
# Create augmented extractor with confidence scores
augmented_extractor = AugmentedDataExtractor(
base_extractor,
confidence=float,
source=("extracted", Field(description="Source of the information"))
)
# Extract data
text = "John is 30 years old."
result = await augmented_extractor.extract_data_from_text(text, PersonInfo)
print(result.data) # Original PersonInfo
print(result.augmented_data) # Contains additional fields
Configuration¶
The StructuredOutputDataExtractorConfig allows customization of:
- Language model selection
- System and user prompt templates
- Schema enforcement settings
Best Practices¶
- Always define clear Pydantic models for your extraction schemas
- Use augmented extraction when you need additional metadata
- Consider using strict mode for augmented extraction when you want to enforce schema compliance
- Customize prompts for better extraction results in specific domains