class TestDocumentProcessor
A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.
/tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
17 - 57
moderate
Purpose
This class extends DocumentProcessor to provide controlled testing of error handling and fallback mechanisms. It allows developers to simulate llmsherpa failures without actually breaking the llmsherpa service, enabling testing of OCR fallback logic (using pytesseract and easyocr) when the primary PDF processing method fails. This is particularly useful for integration testing, error handling validation, and ensuring graceful degradation of PDF processing capabilities.
Source Code
class TestDocumentProcessor(DocumentProcessor):
"""Test version that can simulate llmsherpa failures"""
def __init__(self, config, force_llmsherpa_fail=False):
super().__init__(config)
self.force_llmsherpa_fail = force_llmsherpa_fail
def _process_pdf_document(self, file_path):
"""Override to simulate llmsherpa failures for testing"""
if self.force_llmsherpa_fail:
logger = logging.getLogger(__name__)
logger.info(f"Simulating llmsherpa failure for: {file_path}")
# Simulate the exact error we've been seeing
logger.error(f"Error processing PDF {file_path} with llmsherpa: 'return_dict'")
# Now trigger the OCR fallback logic
if hasattr(self, '_extract_text_with_ocr'):
try:
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import easyocr
OCR_AVAILABLE = True
except ImportError as e:
OCR_AVAILABLE = False
if OCR_AVAILABLE:
logger.info("llmsherpa processing failed, attempting OCR as fallback")
ocr_text = self._extract_text_with_ocr(file_path)
if ocr_text:
logger.info("OCR fallback successful")
return ocr_text
else:
logger.warning("OCR fallback also failed")
else:
logger.warning("OCR fallback needed but OCR libraries not available")
return None
else:
# Use normal processing
return super()._process_pdf_document(file_path)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
DocumentProcessor | - |
Parameter Details
config: Configuration object required by the parent DocumentProcessor class. Contains settings for document processing, including paths, API keys, and processing options. This is passed directly to the parent class constructor.
force_llmsherpa_fail: Boolean flag that controls whether to simulate llmsherpa failures. When set to True, the class will skip normal llmsherpa processing and simulate a failure with the error message 'return_dict', then attempt OCR fallback. When False, normal processing through the parent class is used. Defaults to False.
Return Value
Instantiation returns a TestDocumentProcessor object. The _process_pdf_document method returns either: (1) extracted text as a string from OCR fallback when force_llmsherpa_fail is True and OCR is available, (2) None when forced failure occurs and OCR is unavailable or fails, or (3) the result from the parent class's _process_pdf_document method when force_llmsherpa_fail is False.
Class Interface
Methods
__init__(self, config, force_llmsherpa_fail=False)
Purpose: Initializes the TestDocumentProcessor with configuration and optional failure simulation flag
Parameters:
config: Configuration object for the parent DocumentProcessor classforce_llmsherpa_fail: Boolean flag to control whether to simulate llmsherpa failures (default: False)
Returns: None (constructor)
_process_pdf_document(self, file_path) -> str | None
Purpose: Overrides parent method to optionally simulate llmsherpa failures and trigger OCR fallback for testing
Parameters:
file_path: String path to the PDF file to be processed
Returns: String containing extracted text from OCR fallback if successful, None if processing fails, or result from parent class if not simulating failure
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
force_llmsherpa_fail |
bool | Flag that determines whether to simulate llmsherpa processing failures. When True, causes _process_pdf_document to skip normal processing and trigger OCR fallback | instance |
config |
object | Configuration object inherited from parent DocumentProcessor class, contains settings for document processing | instance |
Dependencies
loggingpdf2imagePILpytesseracteasyocrutils.document_processor
Required Imports
from utils.document_processor import DocumentProcessor
import logging
Conditional/Optional Imports
These imports are only needed under specific conditions:
from pdf2image import convert_from_path
Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered
Optionalfrom PIL import Image
Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered
Optionalimport pytesseract
Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered
Optionalimport easyocr
Condition: only when force_llmsherpa_fail is True and OCR fallback is triggered
OptionalUsage Example
# Basic instantiation for normal testing
from utils.document_processor import DocumentProcessor
import logging
# Assuming config is already defined
config = {'some': 'configuration'}
# Create processor with normal behavior
processor = TestDocumentProcessor(config, force_llmsherpa_fail=False)
result = processor._process_pdf_document('/path/to/document.pdf')
# Create processor that simulates llmsherpa failure
test_processor = TestDocumentProcessor(config, force_llmsherpa_fail=True)
result_with_fallback = test_processor._process_pdf_document('/path/to/document.pdf')
# The second call will simulate failure and attempt OCR fallback
if result_with_fallback:
print('OCR fallback succeeded')
else:
print('Both llmsherpa and OCR failed')
Best Practices
- Use force_llmsherpa_fail=True only in test environments to validate error handling and fallback mechanisms
- Ensure OCR libraries are installed before testing with force_llmsherpa_fail=True, or be prepared to handle None returns
- The class relies on the parent DocumentProcessor having a _extract_text_with_ocr method; verify this exists before use
- Monitor logs when using this class as it generates informative logging messages about failure simulation and fallback attempts
- This class is designed for testing purposes only and should not be used in production environments
- The simulated error message 'return_dict' matches actual llmsherpa errors for realistic testing
- Call _process_pdf_document method with valid PDF file paths to trigger the processing logic
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function test_document_processor 81.7% similar
-
function test_ocr_fallback 77.6% similar
-
class DocumentProcessor_v1 76.5% similar
-
class DocumentProcessor_v2 76.4% similar
-
function test_document_processing 72.7% similar