🔍 Code Extractor

function test_ocr_fallback

Maturity: 46

A test function that validates OCR fallback functionality when the primary llmsherpa PDF text extraction method fails.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
Lines:
59 - 100
Complexity:
moderate

Purpose

This function tests the robustness of a document processing system by verifying that when llmsherpa (the primary PDF text extraction tool) fails, the system correctly falls back to OCR-based text extraction. It performs two tests: one with normal llmsherpa processing and one with simulated llmsherpa failure to ensure the OCR fallback mechanism works correctly. The function searches for PDF files in the '/tf/active/' directory and uses them for testing.

Source Code

def test_ocr_fallback():
    """Test OCR fallback when llmsherpa fails"""
    
    processor_config = {
        'supported_extensions': ['.pdf', '.doc', '.docx'],
        'max_file_size_mb': 50,
        'text_extraction_timeout': 300
    }
    
    print("Testing OCR fallback when llmsherpa fails")
    print("=" * 50)
    
    # Find a test PDF
    import glob
    pdf_files = glob.glob('/tf/active/*.pdf')
    if not pdf_files:
        print("No PDF files found for testing")
        return
    
    test_file = pdf_files[0]
    print(f"Using test file: {os.path.basename(test_file)}")
    
    # Test 1: Normal processing (should use llmsherpa)
    print("\n1. Testing normal processing:")
    processor_normal = TestDocumentProcessor(processor_config, force_llmsherpa_fail=False)
    text_normal = processor_normal.extract_text(test_file)
    if text_normal:
        print(f"✓ Normal processing extracted {len(text_normal)} characters")
    else:
        print("✗ Normal processing failed")
    
    # Test 2: Simulated llmsherpa failure (should trigger OCR)
    print("\n2. Testing OCR fallback after simulated llmsherpa failure:")
    processor_ocr = TestDocumentProcessor(processor_config, force_llmsherpa_fail=True)
    text_ocr = processor_ocr.extract_text(test_file)
    if text_ocr:
        print(f"✓ OCR fallback extracted {len(text_ocr)} characters")
        print("✓ OCR fallback is working correctly!")
    else:
        print("✗ OCR fallback failed")
    
    print("\nTest completed")

Return Value

This function does not return any value (implicitly returns None). It prints test results to stdout, including success/failure indicators and character counts of extracted text.

Dependencies

  • sys
  • os
  • logging
  • tempfile
  • shutil
  • glob
  • pdf2image
  • PIL
  • pytesseract
  • easyocr

Required Imports

import sys
import os
from utils.document_processor import DocumentProcessor
import logging
import tempfile
import shutil
import glob
from pdf2image import convert_from_path
from PIL import Image
import pytesseract
import easyocr

Conditional/Optional Imports

These imports are only needed under specific conditions:

import glob

Condition: imported inside the function to search for PDF test files

Required (conditional)

Usage Example

# Ensure TestDocumentProcessor class is defined before calling
# Example TestDocumentProcessor implementation:
class TestDocumentProcessor(DocumentProcessor):
    def __init__(self, config, force_llmsherpa_fail=False):
        super().__init__(config)
        self.force_llmsherpa_fail = force_llmsherpa_fail
    
    def extract_text(self, file_path):
        if self.force_llmsherpa_fail:
            # Simulate llmsherpa failure, trigger OCR
            return self._ocr_fallback(file_path)
        return super().extract_text(file_path)

# Place test PDF files in /tf/active/ directory
# Run the test
test_ocr_fallback()

# Expected output:
# Testing OCR fallback when llmsherpa fails
# ==================================================
# Using test file: example.pdf
# 
# 1. Testing normal processing:
# ✓ Normal processing extracted 5432 characters
# 
# 2. Testing OCR fallback after simulated llmsherpa failure:
# ✓ OCR fallback extracted 5123 characters
# ✓ OCR fallback is working correctly!
# 
# Test completed

Best Practices

  • Ensure TestDocumentProcessor class is properly implemented with force_llmsherpa_fail parameter support
  • Place test PDF files in the expected '/tf/active/' directory before running the test
  • Install and configure Tesseract OCR system-wide before running OCR tests
  • Monitor disk space as OCR processing may create temporary image files from PDF pages
  • This is a test function and should not be used in production code - it's meant for validation during development
  • The function uses hardcoded paths ('/tf/active/') which may need adjustment for different environments
  • Consider adding error handling for cases where no PDF files are found or OCR dependencies are missing
  • The function prints results to stdout - consider using logging for better integration with test frameworks

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class TestDocumentProcessor 77.6% similar

    A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
  • function test_ocr_retry_logic 73.1% similar

    Tests the OCR retry logic for extracting contract end dates by first attempting normal text extraction, then falling back to OCR-based extraction if the end date is not found.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_retry.py
  • function test_document_processor 71.8% similar

    A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
  • function test_extraction_methods 68.3% similar

    A test function that compares two PDF text extraction methods (regular llmsherpa and OCR-based Tesseract) on a specific purchase order document from FileCloud, checking for vendor name detection.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_extraction_methods.py
  • function test_enhanced_pdf_processing 66.6% similar

    A comprehensive test function that validates PDF processing capabilities, including text extraction, cleaning, chunking, and table detection across multiple PDF processing libraries.

    From: /tf/active/vicechatdev/vice_ai/test_enhanced_pdf.py
← Back to Browse