BaseValidator - Code Extractor

class BaseValidator

Maturity: 52

Abstract base class for validating extracted invoice data with entity-specific validation rules. Provides common validation functionality for required fields, field types, date consistency, and amount calculations.

File:
/tf/active/vicechatdev/invoice_extraction/validators/base_validator.py

Lines:
62 - 320

Complexity:
moderate

Purpose

BaseValidator serves as the foundation for entity-specific invoice validators (UK, BE, AU). It implements common validation logic including required field checks, type validation, date consistency verification, and amount calculations. Subclasses must implement entity-specific validation rules through the _entity_specific_validation abstract method. The class uses a configurable approach with customizable required fields, field types, date formats, and numeric thresholds.

Source Code

class BaseValidator(ABC):
    """
    Abstract base class for entity-specific invoice data validators.
    
    This class defines the common interface and shared validation
    functionality for all entity validators (UK, BE, AU).
    """
    
    def __init__(self, config=None):
        """
        Initialize the validator with configuration.
        
        Args:
            config: Dictionary containing configuration parameters
        """
        self.config = config or {}
        
        # Configure required fields and validation strictness
        self.required_fields = self.config.get('required_fields', {})
        if not self.required_fields:
            # Default required fields if not configured
            self.required_fields = {
                'invoice.number': 'critical',
                'invoice.issue_date': 'critical',
                'vendor.name': 'critical',
                'amounts.total': 'critical'
            }
            
        # Field type definitions for validation
        self.field_types = {
            'invoice.number': 'string',
            'invoice.issue_date': 'date',
            'invoice.due_date': 'date',
            'invoice.po_number': 'string',
            'vendor.name': 'string',
            'vendor.vat_number': 'string',
            'vendor.address': 'string',
            'amounts.subtotal': 'decimal',
            'amounts.total': 'decimal',
            'amounts.tax': 'decimal',
            'amounts.vat': 'decimal',
            'amounts.vat_rate': 'decimal',
            'amounts.currency': 'string'
        }
        
        # Date format for validation
        self.date_format = self.config.get('date_format', '%Y-%m-%d')
        
        # Customizable threshold for numeric comparison
        self.numeric_threshold = self.config.get('numeric_threshold', 0.01)
    
    def validate(self, extraction_result: Dict[str, Any]) -> ValidationResult:
        """
        Validate extracted invoice data.
        
        Args:
            extraction_result: Dictionary of extracted invoice fields
            
        Returns:
            ValidationResult object with issues and validity status
        """
        logger.info("Validating extracted invoice data")
        
        # Create new validation result
        result = ValidationResult()
        
        # Perform base validations
        self._validate_required_fields(extraction_result, result)
        self._validate_field_types(extraction_result, result)
        self._validate_date_consistency(extraction_result, result)
        self._validate_amount_consistency(extraction_result, result)
        
        # Entity-specific validation
        self._entity_specific_validation(extraction_result, result)
        
        # Log validation results
        if not result.is_valid:
            logger.warning(f"Validation failed with {len(result.issues)} issues")
            for issue in result.issues:
                logger.warning(f"  - {issue}")
        else:
            logger.info("Validation passed successfully")
            if result.warnings:
                logger.info(f"Validation passed with {len(result.warnings)} warnings")
                
        return result
    
    @abstractmethod
    def _entity_specific_validation(self, extraction_result: Dict[str, Any], 
                                   result: ValidationResult) -> None:
        """
        Perform entity-specific validation.
        
        Each entity validator must implement this method to check
        specific rules for that entity.
        
        Args:
            extraction_result: Dictionary of extracted invoice fields
            result: ValidationResult to add issues to
        """
        pass
    
    def _validate_required_fields(self, extraction_result: Dict[str, Any], 
                                 result: ValidationResult) -> None:
        """
        Validate that all required fields are present.
        
        Args:
            extraction_result: Dictionary of extracted invoice fields
            result: ValidationResult to add issues to
        """
        for field_path, importance in self.required_fields.items():
            # Skip validation fields
            if field_path.startswith('validation.'):
                continue
                
            # Get the field value using the path
            value = self._get_nested_field(extraction_result, field_path)
            
            if value is None or (isinstance(value, str) and not value.strip()):
                severity = 'error' if importance == 'critical' else 'warning'
                result.add_issue(field_path, "Required field is missing or empty", severity)
    
    def _validate_field_types(self, extraction_result: Dict[str, Any], 
                             result: ValidationResult) -> None:
        """
        Validate field types (string, number, date).
        
        Args:
            extraction_result: Dictionary of extracted invoice fields
            result: ValidationResult to add issues to
        """
        for field_path, expected_type in self.field_types.items():
            value = self._get_nested_field(extraction_result, field_path)
            
            # Skip None values (these are handled by required field validation)
            if value is None:
                continue
                
            if expected_type == 'string':
                if not isinstance(value, str):
                    result.add_issue(
                        field_path, 
                        f"Expected string, got {type(value).__name__}",
                        'warning'
                    )
            elif expected_type == 'decimal':
                if not isinstance(value, (int, float)):
                    try:
                        # Try to convert to float
                        float(value)
                    except (ValueError, TypeError):
                        result.add_issue(
                            field_path, 
                            f"Expected decimal number, got {type(value).__name__}",
                            'warning'
                        )
            elif expected_type == 'date':
                if isinstance(value, str):
                    try:
                        # Try to parse the date
                        datetime.strptime(value, self.date_format)
                    except ValueError:
                        result.add_issue(
                            field_path, 
                            f"Invalid date format, expected {self.date_format}",
                            'warning'
                        )
                else:
                    result.add_issue(
                        field_path,
                        f"Expected date string, got {type(value).__name__}",
                        'warning'
                    )
    
    def _validate_date_consistency(self, extraction_result: Dict[str, Any], 
                                  result: ValidationResult) -> None:
        """
        Validate date consistency (issue date before due date).
        
        Args:
            extraction_result: Dictionary of extracted invoice fields
            result: ValidationResult to add issues to
        """
        issue_date = self._get_nested_field(extraction_result, 'invoice.issue_date')
        due_date = self._get_nested_field(extraction_result, 'invoice.due_date')
        
        if issue_date and due_date:
            try:
                issue_date_obj = datetime.strptime(issue_date, self.date_format)
                due_date_obj = datetime.strptime(due_date, self.date_format)
                
                if issue_date_obj > due_date_obj:
                    result.add_issue(
                        'invoice.due_date',
                        f"Due date ({due_date}) is before issue date ({issue_date})",
                        'error'
                    )
            except ValueError:
                # Date format issues are reported by _validate_field_types
                pass
    
    def _validate_amount_consistency(self, extraction_result: Dict[str, Any], 
                                    result: ValidationResult) -> None:
        """
        Validate numeric amounts consistency.
        
        Args:
            extraction_result: Dictionary of extracted invoice fields
            result: ValidationResult to add issues to
        """
        subtotal = self._get_nested_field(extraction_result, 'amounts.subtotal')
        tax = self._get_nested_field(extraction_result, 'amounts.tax') or \
              self._get_nested_field(extraction_result, 'amounts.vat')
        total = self._get_nested_field(extraction_result, 'amounts.total')
        
        # Check if we have all values for comparison
        if subtotal is not None and tax is not None and total is not None:
            try:
                subtotal_val = float(subtotal)
                tax_val = float(tax)
                total_val = float(total)
                
                # Calculate expected total
                expected_total = subtotal_val + tax_val
                
                # Compare with threshold
                if abs(expected_total - total_val) > self.numeric_threshold:
                    result.add_issue(
                        'amounts.total',
                        f"Amount inconsistency: subtotal ({subtotal_val}) + tax ({tax_val}) " +
                        f"≠ total ({total_val})",
                        'warning'
                    )
            except (ValueError, TypeError):
                # Type conversion issues are reported by _validate_field_types
                pass
    
    def _get_nested_field(self, data: Dict[str, Any], field_path: str) -> Any:
        """
        Get a nested field value using a dot-notation path.
        
        Args:
            data: The data dictionary
            field_path: Dot-notation path to the field
            
        Returns:
            Field value or None if not found
        """
        keys = field_path.split('.')
        value = data
        
        for key in keys:
            if isinstance(value, dict) and key in value:
                value = value[key]
            else:
                return None
                
        return value

Parameters

Name	Type	Default	Kind
`bases`	ABC	-

Parameter Details

config: Optional dictionary containing configuration parameters. Supports 'required_fields' (dict mapping field paths to importance levels: 'critical' or other), 'date_format' (string, default '%Y-%m-%d'), and 'numeric_threshold' (float, default 0.01 for amount comparison tolerance). If not provided or empty, uses sensible defaults.

Return Value

Instantiation returns a BaseValidator instance (or subclass instance). The main validate() method returns a ValidationResult object containing validation issues, warnings, and an is_valid boolean status. Helper methods return None (they modify the ValidationResult in-place) or specific values (_get_nested_field returns Any or None).

Class Interface

Methods

`init(self, config=None)`

Purpose: Initialize the validator with configuration parameters and set up default field definitions

Parameters:

config: Optional dictionary with 'required_fields', 'date_format', and 'numeric_threshold' keys

Returns: None (constructor)

`validate(self, extraction_result: Dict[str, Any]) -> ValidationResult`

Purpose: Main validation method that orchestrates all validation checks and returns comprehensive results

Parameters:

extraction_result: Dictionary containing extracted invoice fields with nested structure (e.g., {'invoice': {...}, 'vendor': {...}, 'amounts': {...}})

Returns: ValidationResult object with issues list, warnings list, and is_valid boolean status

`_entity_specific_validation(self, extraction_result: Dict[str, Any], result: ValidationResult) -> None`

Purpose: Abstract method that subclasses must implement to perform entity-specific validation rules

Parameters:

extraction_result: Dictionary of extracted invoice fields
result: ValidationResult object to add issues to (modified in-place)