🔍 Code Extractor

class TestBaseExtractor

Maturity: 48

Unit test class for testing the BaseExtractor class, which provides comprehensive test coverage for document extraction functionality including initialization, structure extraction, bounding box text retrieval, and confidence calculation.

File:
/tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py
Lines:
17 - 112
Complexity:
moderate

Purpose

This test class validates the behavior of the BaseExtractor class, ensuring that it properly initializes with configuration, raises NotImplementedError for abstract methods, correctly extracts document structure, retrieves text within bounding boxes, and calculates confidence scores based on extraction completeness. It serves as a quality assurance component for the document extraction system.

Source Code

class TestBaseExtractor(unittest.TestCase):
    """Test cases for the BaseExtractor class."""
    
    def setUp(self):
        """Set up test environment before each test."""
        self.config = {
            'confidence_threshold': 0.7
        }
        self.extractor = BaseExtractor(self.config)
        
        # Sample document for testing
        self.sample_doc = {
            'text': 'Invoice #12345\nIssue Date: 2023-01-15\nTotal: $500.00',
            'pages': [
                {
                    'text': 'Invoice #12345\nIssue Date: 2023-01-15',
                    'width': 800,
                    'height': 1000,
                    'tables': []
                },
                {
                    'text': 'Total: $500.00',
                    'width': 800,
                    'height': 1000,
                    'tables': []
                }
            ]
        }
        
    def test_init(self):
        """Test initialization of BaseExtractor."""
        self.assertEqual(self.extractor.config['confidence_threshold'], 0.7)
        self.assertIsInstance(self.extractor, BaseExtractor)
    
    def test_extract_not_implemented(self):
        """Test that the base extract method raises NotImplementedError."""
        with self.assertRaises(NotImplementedError):
            self.extractor.extract(self.sample_doc, 'en')
    
    def test_extract_structure(self):
        """Test extracting structure from document."""
        structure = self.extractor.extract_structure(self.sample_doc)
        
        # Check that structure is a dictionary
        self.assertIsInstance(structure, dict)
        
        # Default should be unstructured
        self.assertEqual(structure.get('is_structured', False), False)
    
    def test_get_text_in_bbox(self):
        """Test getting text inside a bounding box."""
        # Define a bounding box [x0, y0, x1, y1]
        bbox = [0, 0, 800, 1000]
        
        text, pages = self.extractor.get_text_in_bbox(self.sample_doc, bbox)
        
        # Should include text from first page
        self.assertIn('Invoice #12345', text)
        self.assertEqual(pages, [0])
    
    def test_calculate_confidence(self):
        """Test confidence calculation based on extraction completeness."""
        # Create a sample extraction result with various fields
        extraction_result = {
            'invoice': {
                'number': '12345',
                'issue_date': '2023-01-15',
                # missing due_date
            },
            'vendor': {
                'name': 'Test Vendor',
                # missing address
                'vat_number': '123456789'
            },
            'amounts': {
                'total': 500.00,
                # missing subtotal
                'tax': 100.00
            }
        }
        
        confidence = self.extractor.calculate_confidence(extraction_result)
        
        # Check that confidence is a float between 0 and 1
        self.assertIsInstance(confidence, float)
        self.assertGreaterEqual(confidence, 0.0)
        self.assertLessEqual(confidence, 1.0)
        
        # Higher confidence with more fields
        more_complete = {**extraction_result}
        more_complete['invoice']['due_date'] = '2023-02-15'
        more_complete['vendor']['address'] = '123 Test St'
        more_complete['amounts']['subtotal'] = 400.00
        
        higher_confidence = self.extractor.calculate_confidence(more_complete)
        self.assertGreater(higher_confidence, confidence)

Parameters

Name Type Default Kind
bases unittest.TestCase -

Parameter Details

bases: Inherits from unittest.TestCase, which provides the testing framework and assertion methods for writing and running unit tests

Return Value

As a test class, it does not return values directly. Each test method performs assertions that either pass or fail. The unittest framework collects results and reports test outcomes (pass/fail/error) when the test suite is executed.

Class Interface

Methods

setUp(self) -> None

Purpose: Initializes test environment before each test method runs, creating a BaseExtractor instance with configuration and sample document data

Returns: None - sets up instance attributes self.config, self.extractor, and self.sample_doc

test_init(self) -> None

Purpose: Tests that BaseExtractor initializes correctly with the provided configuration and is an instance of the correct class

Returns: None - performs assertions on initialization

test_extract_not_implemented(self) -> None

Purpose: Verifies that the base extract method raises NotImplementedError, ensuring subclasses must implement their own extraction logic

Returns: None - asserts that NotImplementedError is raised

test_extract_structure(self) -> None

Purpose: Tests the extract_structure method to ensure it returns a dictionary with proper structure information, defaulting to unstructured

Returns: None - asserts structure is a dict with is_structured=False

test_get_text_in_bbox(self) -> None

Purpose: Tests the get_text_in_bbox method to verify it correctly extracts text within specified bounding box coordinates and returns the correct page numbers

Returns: None - asserts text content and page numbers are correct

test_calculate_confidence(self) -> None

Purpose: Tests the calculate_confidence method to ensure it returns a valid confidence score (0.0-1.0) and that more complete extractions yield higher confidence

Returns: None - asserts confidence is a float between 0 and 1, and increases with completeness

Attributes

Name Type Description Scope
config dict Configuration dictionary containing settings for the BaseExtractor, including confidence_threshold set to 0.7 instance
extractor BaseExtractor Instance of BaseExtractor initialized with self.config, used as the system under test instance
sample_doc dict Sample document structure containing text and pages with invoice data, used for testing extraction methods. Includes full document text and page-level details with dimensions and tables instance

Dependencies

  • unittest
  • unittest.mock
  • os
  • json
  • logging
  • pathlib
  • datetime

Required Imports

import unittest
from unittest.mock import patch
from unittest.mock import MagicMock
import os
import json
import logging
from pathlib import Path
import datetime
from extractors.base_extractor import BaseExtractor
from extractors.uk_extractor import UKExtractor
from extractors.be_extractor import BEExtractor
from extractors.au_extractor import AUExtractor

Usage Example

import unittest
from extractors.base_extractor import BaseExtractor

# Run a specific test
test_suite = unittest.TestLoader().loadTestsFromTestCase(TestBaseExtractor)
unittest.TextTestRunner(verbosity=2).run(test_suite)

# Or run from command line:
# python -m unittest test_module.TestBaseExtractor

# Run a specific test method:
# python -m unittest test_module.TestBaseExtractor.test_init

# The setUp method runs before each test, creating:
# - A config dictionary with confidence_threshold
# - A BaseExtractor instance
# - A sample document structure for testing

Best Practices

  • Each test method is independent and isolated, with setUp() creating fresh instances before each test
  • Test methods follow the naming convention test_<feature_being_tested> for clarity
  • Uses descriptive docstrings for each test method to explain what is being tested
  • Sample data is created in setUp() to ensure consistency across tests
  • Tests verify both positive cases (correct behavior) and negative cases (error handling)
  • Assertions check multiple aspects: type checking, value checking, and comparative checking
  • The test class should be run as part of a larger test suite to ensure BaseExtractor functionality
  • When extending tests, maintain the pattern of testing one specific behavior per test method

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class TestBaseValidator 77.7% similar

    Unit test class for testing the BaseValidator class functionality, including validation of extraction results, field types, date consistency, amount consistency, and entity-specific validation rules.

    From: /tf/active/vicechatdev/invoice_extraction/tests/test_validators.py
  • function test_document_extractor 69.6% similar

    A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.

    From: /tf/active/vicechatdev/leexi/test_document_extractor.py
  • class BaseExtractor 67.7% similar

    Abstract base class that defines the interface and shared functionality for entity-specific invoice data extractors (UK, BE, AU), providing a multi-stage extraction pipeline for invoice processing.

    From: /tf/active/vicechatdev/invoice_extraction/extractors/base_extractor.py
  • class TestBEExtractor 66.7% similar

    Unit test class for testing the BEExtractor class, which extracts structured data from Belgian invoices using LLM-based extraction.

    From: /tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py
  • class TestUKExtractor 62.3% similar

    Unit test class for testing the UKExtractor class, which extracts structured data from UK invoices including VAT numbers, dates, amounts, and line items.

    From: /tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py
← Back to Browse