Data Extraction Module¶
This module provides a flexible framework for extracting structured data from text using language models. It supports both basic and augmented data extraction capabilities.
Overview¶
The module consists of two main components:
- Basic Data Extraction: Uses language models to extract structured data from text based on a provided schema.
- Augmented Data Extraction: Extends basic extraction by adding extra fields to the output schema while maintaining the original data structure.
Components¶
Base Classes¶
BaseDataExtractor: Abstract base class that defines the interface for data extractionBaseDataExtractionResult: Generic base class for extraction results
Basic Extraction¶
StructuredOutputDataExtractor: Implements basic data extraction using language modelsStructuredOutputDataExtractorConfig: Configuration for the basic extractor
Augmented Extraction¶
AugmentedDataExtractor: Extends basic extraction with additional fieldsAugmentedDataExtractionResult: Result type for augmented extraction
Usage Examples¶
Basic Data Extraction¶
Augmented Data Extraction¶
Configuration¶
The StructuredOutputDataExtractorConfig allows customization of:
- Language model selection
- System and user prompt templates
- Schema enforcement settings
Best Practices¶
- Always define clear Pydantic models for your extraction schemas
- Use augmented extraction when you need additional metadata
- Consider using strict mode for augmented extraction when you want to enforce schema compliance
- Customize prompts for better extraction results in specific domains