test_ocr_fallback - Code Extractor

function test_ocr_fallback

Maturity: 46

A test function that validates OCR fallback functionality when the primary llmsherpa PDF text extraction method fails.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py

Lines:
59 - 100

Complexity:
moderate

Purpose

This function tests the robustness of a document processing system by verifying that when llmsherpa (the primary PDF text extraction tool) fails, the system correctly falls back to OCR-based text extraction. It performs two tests: one with normal llmsherpa processing and one with simulated llmsherpa failure to ensure the OCR fallback mechanism works correctly. The function searches for PDF files in the '/tf/active/' directory and uses them for testing.

Source Code

def test_ocr_fallback():
    """Test OCR fallback when llmsherpa fails"""
    
    processor_config = {
        'supported_extensions': ['.pdf', '.doc', '.docx'],
        'max_file_size_mb': 50,
        'text_extraction_timeout': 300
    }
    
    print("Testing OCR fallback when llmsherpa fails")
    print("=" * 50)
    
    # Find a test PDF
    import glob
    pdf_files = glob.glob('/tf/active/*.pdf')
    if not pdf_files:
        print("No PDF files found for testing")
        return
    
    test_file = pdf_files[0]
    print(f"Using test file: {os.path.basename(test_file)}")
    
    # Test 1: Normal processing (should use llmsherpa)
    print("\n1. Testing normal processing:")
    processor_normal = TestDocumentProcessor(processor_config, force_llmsherpa_fail=False)
    text_normal = processor_normal.extract_text(test_file)
    if text_normal:
        print(f"✓ Normal processing extracted {len(text_normal)} characters")
    else:
        print("✗ Normal processing failed")
    
    # Test 2: Simulated llmsherpa failure (should trigger OCR)
    print("\n2. Testing OCR fallback after simulated llmsherpa failure:")
    processor_ocr = TestDocumentProcessor(processor_config, force_llmsherpa_fail=True)
    text_ocr = processor_ocr.extract_text(test_file)
    if text_ocr:
        print(f"✓ OCR fallback extracted {len(text_ocr)} characters")
        print("✓ OCR fallback is working correctly!")
    else:
        print("✗ OCR fallback failed")
    
    print("\nTest completed")

Return Value

This function does not return any value (implicitly returns None). It prints test results to stdout, including success/failure indicators and character counts of extracted text.

Dependencies

sys
os
logging
tempfile
shutil
glob
pdf2image
PIL
pytesseract
easyocr

Required Imports

import sys
import os
from utils.document_processor import DocumentProcessor
import logging
import tempfile
import shutil
import glob
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import easyocr

Conditional/Optional Imports

These imports are only needed under specific conditions:

import glob

Condition: imported inside the function to search for PDF test files

Required (conditional)

Usage Example

# Ensure TestDocumentProcessor class is defined before calling
# Example TestDocumentProcessor implementation:
class TestDocumentProcessor(DocumentProcessor):
    def __init__(self, config, force_llmsherpa_fail=False):
        super().__init__(config)
        self.force_llmsherpa_fail = force_llmsherpa_fail
    
    def extract_text(self, file_path):
        if self.force_llmsherpa_fail:
            # Simulate llmsherpa failure, trigger OCR
            return self._ocr_fallback(file_path)
        return super().extract_text(file_path)

# Place test PDF files in /tf/active/ directory
# Run the test
test_ocr_fallback()

# Expected output:
# Testing OCR fallback when llmsherpa fails
# ==================================================
# Using test file: example.pdf
# 
# 1. Testing normal processing:
# ✓ Normal processing extracted 5432 characters
# 
# 2. Testing OCR fallback after simulated llmsherpa failure:
# ✓ OCR fallback extracted 5123 characters
# ✓ OCR fallback is working correctly!
# 
# Test completed

Best Practices

Ensure TestDocumentProcessor class is properly implemented with force_llmsherpa_fail parameter support
Place test PDF files in the expected '/tf/active/' directory before running the test
Install and configure Tesseract OCR system-wide before running OCR tests
Monitor disk space as OCR processing may create temporary image files from PDF pages
This is a test function and should not be used in production code - it's meant for validation during development
The function uses hardcoded paths ('/tf/active/') which may need adjustment for different environments
Consider adding error handling for cases where no PDF files are found or OCR dependencies are missing
The function prints results to stdout - consider using logging for better integration with test frameworks

Similar Components

AI-powered semantic similarity - components with related functionality:

class TestDocumentProcessor 77.6% similar

A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
function test_ocr_retry_logic 73.1% similar

Tests the OCR retry logic for extracting contract end dates by first attempting normal text extraction, then falling back to OCR-based extraction if the end date is not found.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_retry.py
function test_document_processor 71.8% similar

A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
function test_extraction_methods 68.3% similar

A test function that compares two PDF text extraction methods (regular llmsherpa and OCR-based Tesseract) on a specific purchase order document from FileCloud, checking for vendor name detection.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_extraction_methods.py
function test_enhanced_pdf_processing 66.6% similar

A comprehensive test function that validates PDF processing capabilities, including text extraction, cleaning, chunking, and table detection across multiple PDF processing libraries.
From: /tf/active/vicechatdev/vice_ai/test_enhanced_pdf.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def test_ocr_fallback():
    """Test OCR fallback when llmsherpa fails"""
    
    processor_config = {
        'supported_extensions': ['.pdf', '.doc', '.docx'],
        'max_file_size_mb': 50,
        'text_extraction_timeout': 300
    }
    
    print("Testing OCR fallback when llmsherpa fails")
    print("=" * 50)
    
    # Find a test PDF
    import glob
    pdf_files = glob.glob('/tf/active/*.pdf')
    if not pdf_files:
        print("No PDF files found for testing")
        return
    
    test_file = pdf_files[0]
    print(f"Using test file: {os.path.basename(test_file)}")
    
    # Test 1: Normal processing (should use llmsherpa)
    print("\n1. Testing normal processing:")
    processor_normal = TestDocumentProcessor(processor_config, force_llmsherpa_fail=False)
    text_normal = processor_normal.extract_text(test_file)
    if text_normal:
        print(f"✓ Normal processing extracted {len(text_normal)} characters")
    else:
        print("✗ Normal processing failed")
    
    # Test 2: Simulated llmsherpa failure (should trigger OCR)
    print("\n2. Testing OCR fallback after simulated llmsherpa failure:")
    processor_ocr = TestDocumentProcessor(processor_config, force_llmsherpa_fail=True)
    text_ocr = processor_ocr.extract_text(test_file)
    if text_ocr:
        print(f"✓ OCR fallback extracted {len(text_ocr)} characters")
        print("✓ OCR fallback is working correctly!")
    else:
        print("✗ OCR fallback failed")
    
    print("\nTest completed")
                        

Improved Code

🔍 Code Extractor

function test_ocr_fallback

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class TestDocumentProcessor 77.6% similar

function test_ocr_retry_logic 73.1% similar

function test_document_processor 71.8% similar

function test_extraction_methods 68.3% similar

function test_enhanced_pdf_processing 66.6% similar

function test_ocr_fallback

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class TestDocumentProcessor 77.6% similar

function test_ocr_retry_logic 73.1% similar

function test_document_processor 71.8% similar

function test_extraction_methods 68.3% similar

function test_enhanced_pdf_processing 66.6% similar

✨ Improve Code: test_ocr_fallback

Code Comparison