TestBaseExtractor - Code Extractor

class TestBaseExtractor

Maturity: 48

Unit test class for testing the BaseExtractor class, which provides comprehensive test coverage for document extraction functionality including initialization, structure extraction, bounding box text retrieval, and confidence calculation.

File:
/tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py

Lines:
17 - 112

Complexity:
moderate

Purpose

This test class validates the behavior of the BaseExtractor class, ensuring that it properly initializes with configuration, raises NotImplementedError for abstract methods, correctly extracts document structure, retrieves text within bounding boxes, and calculates confidence scores based on extraction completeness. It serves as a quality assurance component for the document extraction system.

Source Code

class TestBaseExtractor(unittest.TestCase):
    """Test cases for the BaseExtractor class."""
    
    def setUp(self):
        """Set up test environment before each test."""
        self.config = {
            'confidence_threshold': 0.7
        }
        self.extractor = BaseExtractor(self.config)
        
        # Sample document for testing
        self.sample_doc = {
            'text': 'Invoice #12345\nIssue Date: 2023-01-15\nTotal: $500.00',
            'pages': [
                {
                    'text': 'Invoice #12345\nIssue Date: 2023-01-15',
                    'width': 800,
                    'height': 1000,
                    'tables': []
                },
                {
                    'text': 'Total: $500.00',
                    'width': 800,
                    'height': 1000,
                    'tables': []
                }
            ]
        }
        
    def test_init(self):
        """Test initialization of BaseExtractor."""
        self.assertEqual(self.extractor.config['confidence_threshold'], 0.7)
        self.assertIsInstance(self.extractor, BaseExtractor)
    
    def test_extract_not_implemented(self):
        """Test that the base extract method raises NotImplementedError."""
        with self.assertRaises(NotImplementedError):
            self.extractor.extract(self.sample_doc, 'en')
    
    def test_extract_structure(self):
        """Test extracting structure from document."""
        structure = self.extractor.extract_structure(self.sample_doc)
        
        # Check that structure is a dictionary
        self.assertIsInstance(structure, dict)
        
        # Default should be unstructured
        self.assertEqual(structure.get('is_structured', False), False)
    
    def test_get_text_in_bbox(self):
        """Test getting text inside a bounding box."""
        # Define a bounding box [x0, y0, x1, y1]
        bbox = [0, 0, 800, 1000]
        
        text, pages = self.extractor.get_text_in_bbox(self.sample_doc, bbox)
        
        # Should include text from first page
        self.assertIn('Invoice #12345', text)
        self.assertEqual(pages, [0])
    
    def test_calculate_confidence(self):
        """Test confidence calculation based on extraction completeness."""
        # Create a sample extraction result with various fields
        extraction_result = {
            'invoice': {
                'number': '12345',
                'issue_date': '2023-01-15',
                # missing due_date
            },
            'vendor': {
                'name': 'Test Vendor',
                # missing address
                'vat_number': '123456789'
            },
            'amounts': {
                'total': 500.00,
                # missing subtotal
                'tax': 100.00
            }
        }
        
        confidence = self.extractor.calculate_confidence(extraction_result)
        
        # Check that confidence is a float between 0 and 1
        self.assertIsInstance(confidence, float)
        self.assertGreaterEqual(confidence, 0.0)
        self.assertLessEqual(confidence, 1.0)
        
        # Higher confidence with more fields
        more_complete = {**extraction_result}
        more_complete['invoice']['due_date'] = '2023-02-15'
        more_complete['vendor']['address'] = '123 Test St'
        more_complete['amounts']['subtotal'] = 400.00
        
        higher_confidence = self.extractor.calculate_confidence(more_complete)
        self.assertGreater(higher_confidence, confidence)

Parameters

Name	Type	Default	Kind
`bases`	unittest.TestCase	-

Parameter Details

bases: Inherits from unittest.TestCase, which provides the testing framework and assertion methods for writing and running unit tests

Return Value

As a test class, it does not return values directly. Each test method performs assertions that either pass or fail. The unittest framework collects results and reports test outcomes (pass/fail/error) when the test suite is executed.

Class Interface

Methods

`setUp(self) -> None`

Purpose: Initializes test environment before each test method runs, creating a BaseExtractor instance with configuration and sample document data

Returns: None - sets up instance attributes self.config, self.extractor, and self.sample_doc

`test_init(self) -> None`

Purpose: Tests that BaseExtractor initializes correctly with the provided configuration and is an instance of the correct class

Returns: None - performs assertions on initialization

`test_extract_not_implemented(self) -> None`

Purpose: Verifies that the base extract method raises NotImplementedError, ensuring subclasses must implement their own extraction logic

Returns: None - asserts that NotImplementedError is raised

`test_extract_structure(self) -> None`

Purpose: Tests the extract_structure method to ensure it returns a dictionary with proper structure information, defaulting to unstructured

Returns: None - asserts structure is a dict with is_structured=False

`test_get_text_in_bbox(self) -> None`

Purpose: Tests the get_text_in_bbox method to verify it correctly extracts text within specified bounding box coordinates and returns the correct page numbers

Returns: None - asserts text content and page numbers are correct

`test_calculate_confidence(self) -> None`

Purpose: Tests the calculate_confidence method to ensure it returns a valid confidence score (0.0-1.0) and that more complete extractions yield higher confidence

Returns: None - asserts confidence is a float between 0 and 1, and increases with completeness

Attributes

Name	Type	Description	Scope
`config`	dict	Configuration dictionary containing settings for the BaseExtractor, including confidence_threshold set to 0.7	instance
`extractor`	BaseExtractor	Instance of BaseExtractor initialized with self.config, used as the system under test	instance
`sample_doc`	dict	Sample document structure containing text and pages with invoice data, used for testing extraction methods. Includes full document text and page-level details with dimensions and tables	instance

Dependencies

unittest
unittest.mock
os
json
logging
pathlib
datetime

Required Imports

import unittest
from unittest.mock import patch
from unittest.mock import MagicMock
import os
import json
import logging
from pathlib import Path
import datetime
from extractors.base_extractor import BaseExtractor
from extractors.uk_extractor import UKExtractor
from extractors.be_extractor import BEExtractor
from extractors.au_extractor import AUExtractor

Usage Example

import unittest
from extractors.base_extractor import BaseExtractor

# Run a specific test
test_suite = unittest.TestLoader().loadTestsFromTestCase(TestBaseExtractor)
unittest.TextTestRunner(verbosity=2).run(test_suite)

# Or run from command line:
# python -m unittest test_module.TestBaseExtractor

# Run a specific test method:
# python -m unittest test_module.TestBaseExtractor.test_init

# The setUp method runs before each test, creating:
# - A config dictionary with confidence_threshold
# - A BaseExtractor instance
# - A sample document structure for testing

Best Practices

Each test method is independent and isolated, with setUp() creating fresh instances before each test
Test methods follow the naming convention test_<feature_being_tested> for clarity
Uses descriptive docstrings for each test method to explain what is being tested
Sample data is created in setUp() to ensure consistency across tests
Tests verify both positive cases (correct behavior) and negative cases (error handling)
Assertions check multiple aspects: type checking, value checking, and comparative checking
The test class should be run as part of a larger test suite to ensure BaseExtractor functionality
When extending tests, maintain the pattern of testing one specific behavior per test method

Similar Components

AI-powered semantic similarity - components with related functionality:

class TestBaseValidator 77.7% similar

Unit test class for testing the BaseValidator class functionality, including validation of extraction results, field types, date consistency, amount consistency, and entity-specific validation rules.
From: /tf/active/vicechatdev/invoice_extraction/tests/test_validators.py
function test_document_extractor 69.6% similar

A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.
From: /tf/active/vicechatdev/leexi/test_document_extractor.py
class BaseExtractor 67.7% similar

Abstract base class that defines the interface and shared functionality for entity-specific invoice data extractors (UK, BE, AU), providing a multi-stage extraction pipeline for invoice processing.
From: /tf/active/vicechatdev/invoice_extraction/extractors/base_extractor.py
class TestBEExtractor 66.7% similar

Unit test class for testing the BEExtractor class, which extracts structured data from Belgian invoices using LLM-based extraction.
From: /tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py
class TestUKExtractor 62.3% similar

Unit test class for testing the UKExtractor class, which extracts structured data from UK invoices including VAT numbers, dates, amounts, and line items.
From: /tf/active/vicechatdev/invoice_extraction/tests/test_extractors.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class TestBaseExtractor(unittest.TestCase):
    """Test cases for the BaseExtractor class."""
    
    def setUp(self):
        """Set up test environment before each test."""
        self.config = {
            'confidence_threshold': 0.7
        }
        self.extractor = BaseExtractor(self.config)
        
        # Sample document for testing
        self.sample_doc = {
            'text': 'Invoice #12345\nIssue Date: 2023-01-15\nTotal: $500.00',
            'pages': [
                {
                    'text': 'Invoice #12345\nIssue Date: 2023-01-15',
                    'width': 800,
                    'height': 1000,
                    'tables': []
                },
                {
                    'text': 'Total: $500.00',
                    'width': 800,
                    'height': 1000,
                    'tables': []
                }
            ]
        }
        
    def test_init(self):
        """Test initialization of BaseExtractor."""
        self.assertEqual(self.extractor.config['confidence_threshold'], 0.7)
        self.assertIsInstance(self.extractor, BaseExtractor)
    
    def test_extract_not_implemented(self):
        """Test that the base extract method raises NotImplementedError."""
        with self.assertRaises(NotImplementedError):
            self.extractor.extract(self.sample_doc, 'en')
    
    def test_extract_structure(self):
        """Test extracting structure from document."""
        structure = self.extractor.extract_structure(self.sample_doc)
        
        # Check that structure is a dictionary
        self.assertIsInstance(structure, dict)
        
        # Default should be unstructured
        self.assertEqual(structure.get('is_structured', False), False)
    
    def test_get_text_in_bbox(self):
        """Test getting text inside a bounding box."""
        # Define a bounding box [x0, y0, x1, y1]
        bbox = [0, 0, 800, 1000]
        
        text, pages = self.extractor.get_text_in_bbox(self.sample_doc, bbox)
        
        # Should include text from first page
        self.assertIn('Invoice #12345', text)
        self.assertEqual(pages, [0])
    
    def test_calculate_confidence(self):
        """Test confidence calculation based on extraction completeness."""
        # Create a sample extraction result with various fields
        extraction_result = {
            'invoice': {
                'number': '12345',
                'issue_date': '2023-01-15',
                # missing due_date
            },
            'vendor': {
                'name': 'Test Vendor',
                # missing address
                'vat_number': '123456789'
            },
            'amounts': {
                'total': 500.00,
                # missing subtotal
                'tax': 100.00
            }
        }
        
        confidence = self.extractor.calculate_confidence(extraction_result)
        
        # Check that confidence is a float between 0 and 1
        self.assertIsInstance(confidence, float)
        self.assertGreaterEqual(confidence, 0.0)
        self.assertLessEqual(confidence, 1.0)
        
        # Higher confidence with more fields
        more_complete = {**extraction_result}
        more_complete['invoice']['due_date'] = '2023-02-15'
        more_complete['vendor']['address'] = '123 Test St'
        more_complete['amounts']['subtotal'] = 400.00
        
        higher_confidence = self.extractor.calculate_confidence(more_complete)
        self.assertGreater(higher_confidence, confidence)
                        

Improved Code

🔍 Code Extractor

class TestBaseExtractor

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`setUp(self) -> None`

`test_init(self) -> None`

`test_extract_not_implemented(self) -> None`

`test_extract_structure(self) -> None`

`test_get_text_in_bbox(self) -> None`

`test_calculate_confidence(self) -> None`

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class TestBaseValidator 77.7% similar

function test_document_extractor 69.6% similar

class BaseExtractor 67.7% similar

class TestBEExtractor 66.7% similar

class TestUKExtractor 62.3% similar

class TestBaseExtractor

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

setUp(self) -> None

test_init(self) -> None

test_extract_not_implemented(self) -> None

test_extract_structure(self) -> None

test_get_text_in_bbox(self) -> None

test_calculate_confidence(self) -> None

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class TestBaseValidator 77.7% similar

function test_document_extractor 69.6% similar

class BaseExtractor 67.7% similar

class TestBEExtractor 66.7% similar

class TestUKExtractor 62.3% similar

✨ Improve Code: TestBaseExtractor

Code Comparison

`setUp(self) -> None`

`test_init(self) -> None`

`test_extract_not_implemented(self) -> None`

`test_extract_structure(self) -> None`

`test_get_text_in_bbox(self) -> None`

`test_calculate_confidence(self) -> None`