TestDocumentProcessor - Code Extractor

class TestDocumentProcessor

Maturity: 46

A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py

Lines:
17 - 57

Complexity:
moderate

Purpose

This class extends DocumentProcessor to provide controlled testing of error handling and fallback mechanisms. It allows developers to simulate llmsherpa failures without actually breaking the llmsherpa service, enabling testing of OCR fallback logic (using pytesseract and easyocr) when the primary PDF processing method fails. This is particularly useful for integration testing, error handling validation, and ensuring graceful degradation of PDF processing capabilities.

Source Code

class TestDocumentProcessor(DocumentProcessor):
    """Test version that can simulate llmsherpa failures"""
    
    def __init__(self, config, force_llmsherpa_fail=False):
        super().__init__(config)
        self.force_llmsherpa_fail = force_llmsherpa_fail
    
    def _process_pdf_document(self, file_path):
        """Override to simulate llmsherpa failures for testing"""
        if self.force_llmsherpa_fail:
            logger = logging.getLogger(__name__)
            logger.info(f"Simulating llmsherpa failure for: {file_path}")
            # Simulate the exact error we've been seeing
            logger.error(f"Error processing PDF {file_path} with llmsherpa: 'return_dict'")
            
            # Now trigger the OCR fallback logic
            if hasattr(self, '_extract_text_with_ocr'):
                try:
                    from pdf2image import convert_from_path
                    from PIL import Image
                    import pytesseract
                    import easyocr
                    OCR_AVAILABLE = True
                except ImportError as e:
                    OCR_AVAILABLE = False
                
                if OCR_AVAILABLE:
                    logger.info("llmsherpa processing failed, attempting OCR as fallback")
                    ocr_text = self._extract_text_with_ocr(file_path)
                    if ocr_text:
                        logger.info("OCR fallback successful")
                        return ocr_text
                    else:
                        logger.warning("OCR fallback also failed")
                else:
                    logger.warning("OCR fallback needed but OCR libraries not available")
            
            return None
        else:
            # Use normal processing
            return super()._process_pdf_document(file_path)

Parameters

Name	Type	Default	Kind
`bases`	DocumentProcessor	-

Parameter Details

config: Configuration object required by the parent DocumentProcessor class. Contains settings for document processing, including paths, API keys, and processing options. This is passed directly to the parent class constructor.

force_llmsherpa_fail: Boolean flag that controls whether to simulate llmsherpa failures. When set to True, the class will skip normal llmsherpa processing and simulate a failure with the error message 'return_dict', then attempt OCR fallback. When False, normal processing through the parent class is used. Defaults to False.

Return Value

Instantiation returns a TestDocumentProcessor object. The _process_pdf_document method returns either: (1) extracted text as a string from OCR fallback when force_llmsherpa_fail is True and OCR is available, (2) None when forced failure occurs and OCR is unavailable or fails, or (3) the result from the parent class's _process_pdf_document method when force_llmsherpa_fail is False.

Class Interface

Methods

`init(self, config, force_llmsherpa_fail=False)`

Purpose: Initializes the TestDocumentProcessor with configuration and optional failure simulation flag

Parameters:

config: Configuration object for the parent DocumentProcessor class
force_llmsherpa_fail: Boolean flag to control whether to simulate llmsherpa failures (default: False)

Returns: None (constructor)

`_process_pdf_document(self, file_path) -> str | None`

Purpose: Overrides parent method to optionally simulate llmsherpa failures and trigger OCR fallback for testing

Parameters:

file_path: String path to the PDF file to be processed

Returns: String containing extracted text from OCR fallback if successful, None if processing fails, or result from parent class if not simulating failure

Attributes

Name	Type	Description	Scope
`force_llmsherpa_fail`	bool	Flag that determines whether to simulate llmsherpa processing failures. When True, causes _process_pdf_document to skip normal processing and trigger OCR fallback	instance
`config`	object	Configuration object inherited from parent DocumentProcessor class, contains settings for document processing	instance

Dependencies

logging
pdf2image
PIL
pytesseract
easyocr
utils.document_processor

Required Imports

from utils.document_processor import DocumentProcessor
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

from pdf2image import convert_from_path

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional

from PIL import Image

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional

import pytesseract

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional

import easyocr

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional

Usage Example

# Basic instantiation for normal testing
from utils.document_processor import DocumentProcessor
import logging

# Assuming config is already defined
config = {'some': 'configuration'}

# Create processor with normal behavior
processor = TestDocumentProcessor(config, force_llmsherpa_fail=False)
result = processor._process_pdf_document('/path/to/document.pdf')

# Create processor that simulates llmsherpa failure
test_processor = TestDocumentProcessor(config, force_llmsherpa_fail=True)
result_with_fallback = test_processor._process_pdf_document('/path/to/document.pdf')

# The second call will simulate failure and attempt OCR fallback
if result_with_fallback:
    print('OCR fallback succeeded')
else:
    print('Both llmsherpa and OCR failed')

Best Practices

Use force_llmsherpa_fail=True only in test environments to validate error handling and fallback mechanisms
Ensure OCR libraries are installed before testing with force_llmsherpa_fail=True, or be prepared to handle None returns
The class relies on the parent DocumentProcessor having a _extract_text_with_ocr method; verify this exists before use
Monitor logs when using this class as it generates informative logging messages about failure simulation and fallback attempts
This class is designed for testing purposes only and should not be used in production environments
The simulated error message 'return_dict' matches actual llmsherpa errors for realistic testing
Call _process_pdf_document method with valid PDF file paths to trigger the processing logic

Similar Components

AI-powered semantic similarity - components with related functionality:

function test_document_processor 81.7% similar

A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
function test_ocr_fallback 77.6% similar

A test function that validates OCR fallback functionality when the primary llmsherpa PDF text extraction method fails.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
class DocumentProcessor_v1 76.5% similar

A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.
From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_new.py
class DocumentProcessor_v2 76.4% similar

A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.
From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_old.py
function test_document_processing 72.7% similar

A test function that validates document processing functionality by creating a test PDF file, processing it through a DocumentProcessor, and verifying the extraction results or error handling.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class TestDocumentProcessor(DocumentProcessor):
    """Test version that can simulate llmsherpa failures"""
    
    def __init__(self, config, force_llmsherpa_fail=False):
        super().__init__(config)
        self.force_llmsherpa_fail = force_llmsherpa_fail
    
    def _process_pdf_document(self, file_path):
        """Override to simulate llmsherpa failures for testing"""
        if self.force_llmsherpa_fail:
            logger = logging.getLogger(__name__)
            logger.info(f"Simulating llmsherpa failure for: {file_path}")
            # Simulate the exact error we've been seeing
            logger.error(f"Error processing PDF {file_path} with llmsherpa: 'return_dict'")
            
            # Now trigger the OCR fallback logic
            if hasattr(self, '_extract_text_with_ocr'):
                try:
                    from pdf2image import convert_from_path
                    from PIL import Image
                    import pytesseract
                    import easyocr
                    OCR_AVAILABLE = True
                except ImportError as e:
                    OCR_AVAILABLE = False
                
                if OCR_AVAILABLE:
                    logger.info("llmsherpa processing failed, attempting OCR as fallback")
                    ocr_text = self._extract_text_with_ocr(file_path)
                    if ocr_text:
                        logger.info("OCR fallback successful")
                        return ocr_text
                    else:
                        logger.warning("OCR fallback also failed")
                else:
                    logger.warning("OCR fallback needed but OCR libraries not available")
            
            return None
        else:
            # Use normal processing
            return super()._process_pdf_document(file_path)
                        

Improved Code

🔍 Code Extractor

class TestDocumentProcessor

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, config, force_llmsherpa_fail=False)`

`_process_pdf_document(self, file_path) -> str | None`

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function test_document_processor 81.7% similar

function test_ocr_fallback 77.6% similar

class DocumentProcessor_v1 76.5% similar

class DocumentProcessor_v2 76.4% similar

function test_document_processing 72.7% similar

class TestDocumentProcessor

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, config, force_llmsherpa_fail=False)

_process_pdf_document(self, file_path) -> str | None

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function test_document_processor 81.7% similar

function test_ocr_fallback 77.6% similar

class DocumentProcessor_v1 76.5% similar

class DocumentProcessor_v2 76.4% similar

function test_document_processing 72.7% similar

✨ Improve Code: TestDocumentProcessor

Code Comparison

`init(self, config, force_llmsherpa_fail=False)`

`_process_pdf_document(self, file_path) -> str | None`