🔍 Code Extractor

class TestDocumentProcessor

Maturity: 46

A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
Lines:
17 - 57
Complexity:
moderate

Purpose

This class extends DocumentProcessor to provide controlled testing of error handling and fallback mechanisms. It allows developers to simulate llmsherpa failures without actually breaking the llmsherpa service, enabling testing of OCR fallback logic (using pytesseract and easyocr) when the primary PDF processing method fails. This is particularly useful for integration testing, error handling validation, and ensuring graceful degradation of PDF processing capabilities.

Source Code

class TestDocumentProcessor(DocumentProcessor):
    """Test version that can simulate llmsherpa failures"""
    
    def __init__(self, config, force_llmsherpa_fail=False):
        super().__init__(config)
        self.force_llmsherpa_fail = force_llmsherpa_fail
    
    def _process_pdf_document(self, file_path):
        """Override to simulate llmsherpa failures for testing"""
        if self.force_llmsherpa_fail:
            logger = logging.getLogger(__name__)
            logger.info(f"Simulating llmsherpa failure for: {file_path}")
            # Simulate the exact error we've been seeing
            logger.error(f"Error processing PDF {file_path} with llmsherpa: 'return_dict'")
            
            # Now trigger the OCR fallback logic
            if hasattr(self, '_extract_text_with_ocr'):
                try:
                    from pdf2image import convert_from_path
                    from PIL import Image
                    import pytesseract
                    import easyocr
                    OCR_AVAILABLE = True
                except ImportError as e:
                    OCR_AVAILABLE = False
                
                if OCR_AVAILABLE:
                    logger.info("llmsherpa processing failed, attempting OCR as fallback")
                    ocr_text = self._extract_text_with_ocr(file_path)
                    if ocr_text:
                        logger.info("OCR fallback successful")
                        return ocr_text
                    else:
                        logger.warning("OCR fallback also failed")
                else:
                    logger.warning("OCR fallback needed but OCR libraries not available")
            
            return None
        else:
            # Use normal processing
            return super()._process_pdf_document(file_path)

Parameters

Name Type Default Kind
bases DocumentProcessor -

Parameter Details

config: Configuration object required by the parent DocumentProcessor class. Contains settings for document processing, including paths, API keys, and processing options. This is passed directly to the parent class constructor.

force_llmsherpa_fail: Boolean flag that controls whether to simulate llmsherpa failures. When set to True, the class will skip normal llmsherpa processing and simulate a failure with the error message 'return_dict', then attempt OCR fallback. When False, normal processing through the parent class is used. Defaults to False.

Return Value

Instantiation returns a TestDocumentProcessor object. The _process_pdf_document method returns either: (1) extracted text as a string from OCR fallback when force_llmsherpa_fail is True and OCR is available, (2) None when forced failure occurs and OCR is unavailable or fails, or (3) the result from the parent class's _process_pdf_document method when force_llmsherpa_fail is False.

Class Interface

Methods

__init__(self, config, force_llmsherpa_fail=False)

Purpose: Initializes the TestDocumentProcessor with configuration and optional failure simulation flag

Parameters:

  • config: Configuration object for the parent DocumentProcessor class
  • force_llmsherpa_fail: Boolean flag to control whether to simulate llmsherpa failures (default: False)

Returns: None (constructor)

_process_pdf_document(self, file_path) -> str | None

Purpose: Overrides parent method to optionally simulate llmsherpa failures and trigger OCR fallback for testing

Parameters:

  • file_path: String path to the PDF file to be processed

Returns: String containing extracted text from OCR fallback if successful, None if processing fails, or result from parent class if not simulating failure

Attributes

Name Type Description Scope
force_llmsherpa_fail bool Flag that determines whether to simulate llmsherpa processing failures. When True, causes _process_pdf_document to skip normal processing and trigger OCR fallback instance
config object Configuration object inherited from parent DocumentProcessor class, contains settings for document processing instance

Dependencies

  • logging
  • pdf2image
  • PIL
  • pytesseract
  • easyocr
  • utils.document_processor

Required Imports

from utils.document_processor import DocumentProcessor
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

from pdf2image import convert_from_path

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional
from PIL import Image

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional
import pytesseract

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional
import easyocr

Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered

Optional

Usage Example

# Basic instantiation for normal testing
from utils.document_processor import DocumentProcessor
import logging

# Assuming config is already defined
config = {'some': 'configuration'}

# Create processor with normal behavior
processor = TestDocumentProcessor(config, force_llmsherpa_fail=False)
result = processor._process_pdf_document('/path/to/document.pdf')

# Create processor that simulates llmsherpa failure
test_processor = TestDocumentProcessor(config, force_llmsherpa_fail=True)
result_with_fallback = test_processor._process_pdf_document('/path/to/document.pdf')

# The second call will simulate failure and attempt OCR fallback
if result_with_fallback:
    print('OCR fallback succeeded')
else:
    print('Both llmsherpa and OCR failed')

Best Practices

  • Use force_llmsherpa_fail=True only in test environments to validate error handling and fallback mechanisms
  • Ensure OCR libraries are installed before testing with force_llmsherpa_fail=True, or be prepared to handle None returns
  • The class relies on the parent DocumentProcessor having a _extract_text_with_ocr method; verify this exists before use
  • Monitor logs when using this class as it generates informative logging messages about failure simulation and fallback attempts
  • This class is designed for testing purposes only and should not be used in production environments
  • The simulated error message 'return_dict' matches actual llmsherpa errors for realistic testing
  • Call _process_pdf_document method with valid PDF file paths to trigger the processing logic

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function test_document_processor 81.7% similar

    A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
  • function test_ocr_fallback 77.6% similar

    A test function that validates OCR fallback functionality when the primary llmsherpa PDF text extraction method fails.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
  • class DocumentProcessor_v1 76.5% similar

    A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.

    From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_new.py
  • class DocumentProcessor_v2 76.4% similar

    A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.

    From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_old.py
  • function test_document_processing 72.7% similar

    A test function that validates document processing functionality by creating a test PDF file, processing it through a DocumentProcessor, and verifying the extraction results or error handling.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
← Back to Browse