function test_ocr_fallback
A test function that validates OCR fallback functionality when the primary llmsherpa PDF text extraction method fails.
/tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
59 - 100
moderate
Purpose
This function tests the robustness of a document processing system by verifying that when llmsherpa (the primary PDF text extraction tool) fails, the system correctly falls back to OCR-based text extraction. It performs two tests: one with normal llmsherpa processing and one with simulated llmsherpa failure to ensure the OCR fallback mechanism works correctly. The function searches for PDF files in the '/tf/active/' directory and uses them for testing.
Source Code
def test_ocr_fallback():
"""Test OCR fallback when llmsherpa fails"""
processor_config = {
'supported_extensions': ['.pdf', '.doc', '.docx'],
'max_file_size_mb': 50,
'text_extraction_timeout': 300
}
print("Testing OCR fallback when llmsherpa fails")
print("=" * 50)
# Find a test PDF
import glob
pdf_files = glob.glob('/tf/active/*.pdf')
if not pdf_files:
print("No PDF files found for testing")
return
test_file = pdf_files[0]
print(f"Using test file: {os.path.basename(test_file)}")
# Test 1: Normal processing (should use llmsherpa)
print("\n1. Testing normal processing:")
processor_normal = TestDocumentProcessor(processor_config, force_llmsherpa_fail=False)
text_normal = processor_normal.extract_text(test_file)
if text_normal:
print(f"✓ Normal processing extracted {len(text_normal)} characters")
else:
print("✗ Normal processing failed")
# Test 2: Simulated llmsherpa failure (should trigger OCR)
print("\n2. Testing OCR fallback after simulated llmsherpa failure:")
processor_ocr = TestDocumentProcessor(processor_config, force_llmsherpa_fail=True)
text_ocr = processor_ocr.extract_text(test_file)
if text_ocr:
print(f"✓ OCR fallback extracted {len(text_ocr)} characters")
print("✓ OCR fallback is working correctly!")
else:
print("✗ OCR fallback failed")
print("\nTest completed")
Return Value
This function does not return any value (implicitly returns None). It prints test results to stdout, including success/failure indicators and character counts of extracted text.
Dependencies
sysosloggingtempfileshutilglobpdf2imagePILpytesseracteasyocr
Required Imports
import sys
import os
from utils.document_processor import DocumentProcessor
import logging
import tempfile
import shutil
import glob
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import easyocr
Conditional/Optional Imports
These imports are only needed under specific conditions:
import glob
Condition: imported inside the function to search for PDF test files
Required (conditional)Usage Example
# Ensure TestDocumentProcessor class is defined before calling
# Example TestDocumentProcessor implementation:
class TestDocumentProcessor(DocumentProcessor):
def __init__(self, config, force_llmsherpa_fail=False):
super().__init__(config)
self.force_llmsherpa_fail = force_llmsherpa_fail
def extract_text(self, file_path):
if self.force_llmsherpa_fail:
# Simulate llmsherpa failure, trigger OCR
return self._ocr_fallback(file_path)
return super().extract_text(file_path)
# Place test PDF files in /tf/active/ directory
# Run the test
test_ocr_fallback()
# Expected output:
# Testing OCR fallback when llmsherpa fails
# ==================================================
# Using test file: example.pdf
#
# 1. Testing normal processing:
# ✓ Normal processing extracted 5432 characters
#
# 2. Testing OCR fallback after simulated llmsherpa failure:
# ✓ OCR fallback extracted 5123 characters
# ✓ OCR fallback is working correctly!
#
# Test completed
Best Practices
- Ensure TestDocumentProcessor class is properly implemented with force_llmsherpa_fail parameter support
- Place test PDF files in the expected '/tf/active/' directory before running the test
- Install and configure Tesseract OCR system-wide before running OCR tests
- Monitor disk space as OCR processing may create temporary image files from PDF pages
- This is a test function and should not be used in production code - it's meant for validation during development
- The function uses hardcoded paths ('/tf/active/') which may need adjustment for different environments
- Consider adding error handling for cases where no PDF files are found or OCR dependencies are missing
- The function prints results to stdout - consider using logging for better integration with test frameworks
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class TestDocumentProcessor 77.6% similar
-
function test_ocr_retry_logic 73.1% similar
-
function test_document_processor 71.8% similar
-
function test_extraction_methods 68.3% similar
-
function test_enhanced_pdf_processing 66.6% similar