class TestBaseExtractor
Unit test class for testing the BaseExtractor class, which provides comprehensive test coverage for document extraction functionality including initialization, structure extraction, bounding box text retrieval, and confidence calculation.
/tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py
17 - 112
moderate
Purpose
This test class validates the behavior of the BaseExtractor class, ensuring that it properly initializes with configuration, raises NotImplementedError for abstract methods, correctly extracts document structure, retrieves text within bounding boxes, and calculates confidence scores based on extraction completeness. It serves as a quality assurance component for the document extraction system.
Source Code
class TestBaseExtractor(unittest.TestCase):
"""Test cases for the BaseExtractor class."""
def setUp(self):
"""Set up test environment before each test."""
self.config = {
'confidence_threshold': 0.7
}
self.extractor = BaseExtractor(self.config)
# Sample document for testing
self.sample_doc = {
'text': 'Invoice #12345\nIssue Date: 2023-01-15\nTotal: $500.00',
'pages': [
{
'text': 'Invoice #12345\nIssue Date: 2023-01-15',
'width': 800,
'height': 1000,
'tables': []
},
{
'text': 'Total: $500.00',
'width': 800,
'height': 1000,
'tables': []
}
]
}
def test_init(self):
"""Test initialization of BaseExtractor."""
self.assertEqual(self.extractor.config['confidence_threshold'], 0.7)
self.assertIsInstance(self.extractor, BaseExtractor)
def test_extract_not_implemented(self):
"""Test that the base extract method raises NotImplementedError."""
with self.assertRaises(NotImplementedError):
self.extractor.extract(self.sample_doc, 'en')
def test_extract_structure(self):
"""Test extracting structure from document."""
structure = self.extractor.extract_structure(self.sample_doc)
# Check that structure is a dictionary
self.assertIsInstance(structure, dict)
# Default should be unstructured
self.assertEqual(structure.get('is_structured', False), False)
def test_get_text_in_bbox(self):
"""Test getting text inside a bounding box."""
# Define a bounding box [x0, y0, x1, y1]
bbox = [0, 0, 800, 1000]
text, pages = self.extractor.get_text_in_bbox(self.sample_doc, bbox)
# Should include text from first page
self.assertIn('Invoice #12345', text)
self.assertEqual(pages, [0])
def test_calculate_confidence(self):
"""Test confidence calculation based on extraction completeness."""
# Create a sample extraction result with various fields
extraction_result = {
'invoice': {
'number': '12345',
'issue_date': '2023-01-15',
# missing due_date
},
'vendor': {
'name': 'Test Vendor',
# missing address
'vat_number': '123456789'
},
'amounts': {
'total': 500.00,
# missing subtotal
'tax': 100.00
}
}
confidence = self.extractor.calculate_confidence(extraction_result)
# Check that confidence is a float between 0 and 1
self.assertIsInstance(confidence, float)
self.assertGreaterEqual(confidence, 0.0)
self.assertLessEqual(confidence, 1.0)
# Higher confidence with more fields
more_complete = {**extraction_result}
more_complete['invoice']['due_date'] = '2023-02-15'
more_complete['vendor']['address'] = '123 Test St'
more_complete['amounts']['subtotal'] = 400.00
higher_confidence = self.extractor.calculate_confidence(more_complete)
self.assertGreater(higher_confidence, confidence)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
unittest.TestCase | - |
Parameter Details
bases: Inherits from unittest.TestCase, which provides the testing framework and assertion methods for writing and running unit tests
Return Value
As a test class, it does not return values directly. Each test method performs assertions that either pass or fail. The unittest framework collects results and reports test outcomes (pass/fail/error) when the test suite is executed.
Class Interface
Methods
setUp(self) -> None
Purpose: Initializes test environment before each test method runs, creating a BaseExtractor instance with configuration and sample document data
Returns: None - sets up instance attributes self.config, self.extractor, and self.sample_doc
test_init(self) -> None
Purpose: Tests that BaseExtractor initializes correctly with the provided configuration and is an instance of the correct class
Returns: None - performs assertions on initialization
test_extract_not_implemented(self) -> None
Purpose: Verifies that the base extract method raises NotImplementedError, ensuring subclasses must implement their own extraction logic
Returns: None - asserts that NotImplementedError is raised
test_extract_structure(self) -> None
Purpose: Tests the extract_structure method to ensure it returns a dictionary with proper structure information, defaulting to unstructured
Returns: None - asserts structure is a dict with is_structured=False
test_get_text_in_bbox(self) -> None
Purpose: Tests the get_text_in_bbox method to verify it correctly extracts text within specified bounding box coordinates and returns the correct page numbers
Returns: None - asserts text content and page numbers are correct
test_calculate_confidence(self) -> None
Purpose: Tests the calculate_confidence method to ensure it returns a valid confidence score (0.0-1.0) and that more complete extractions yield higher confidence
Returns: None - asserts confidence is a float between 0 and 1, and increases with completeness
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
config |
dict | Configuration dictionary containing settings for the BaseExtractor, including confidence_threshold set to 0.7 | instance |
extractor |
BaseExtractor | Instance of BaseExtractor initialized with self.config, used as the system under test | instance |
sample_doc |
dict | Sample document structure containing text and pages with invoice data, used for testing extraction methods. Includes full document text and page-level details with dimensions and tables | instance |
Dependencies
unittestunittest.mockosjsonloggingpathlibdatetime
Required Imports
import unittest
from unittest.mock import patch
from unittest.mock import MagicMock
import os
import json
import logging
from pathlib import Path
import datetime
from extractors.base_extractor import BaseExtractor
from extractors.uk_extractor import UKExtractor
from extractors.be_extractor import BEExtractor
from extractors.au_extractor import AUExtractor
Usage Example
import unittest
from extractors.base_extractor import BaseExtractor
# Run a specific test
test_suite = unittest.TestLoader().loadTestsFromTestCase(TestBaseExtractor)
unittest.TextTestRunner(verbosity=2).run(test_suite)
# Or run from command line:
# python -m unittest test_module.TestBaseExtractor
# Run a specific test method:
# python -m unittest test_module.TestBaseExtractor.test_init
# The setUp method runs before each test, creating:
# - A config dictionary with confidence_threshold
# - A BaseExtractor instance
# - A sample document structure for testing
Best Practices
- Each test method is independent and isolated, with setUp() creating fresh instances before each test
- Test methods follow the naming convention test_<feature_being_tested> for clarity
- Uses descriptive docstrings for each test method to explain what is being tested
- Sample data is created in setUp() to ensure consistency across tests
- Tests verify both positive cases (correct behavior) and negative cases (error handling)
- Assertions check multiple aspects: type checking, value checking, and comparative checking
- The test class should be run as part of a larger test suite to ensure BaseExtractor functionality
- When extending tests, maintain the pattern of testing one specific behavior per test method
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class TestBaseValidator 77.7% similar
-
function test_document_extractor 69.6% similar
-
class BaseExtractor 67.7% similar
-
class TestBEExtractor 66.7% similar
-
class TestUKExtractor 62.3% similar