🔍 Code Extractor

function test_extraction_methods

Maturity: 49

A test function that compares two PDF text extraction methods (regular llmsherpa and OCR-based Tesseract) on a specific purchase order document from FileCloud, checking for vendor name detection.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_extraction_methods.py
Lines:
14 - 134
Complexity:
moderate

Purpose

This function serves as a diagnostic tool to evaluate and compare the effectiveness of two different PDF text extraction approaches. It downloads a specific purchase order PDF from FileCloud, processes it using both regular text extraction (llmsherpa) and OCR extraction (Tesseract), then compares the results to determine which method better extracts vendor information. The function provides detailed output including character counts, text samples, and vendor name detection results to help determine the optimal extraction method for contract processing.

Source Code

def test_extraction_methods():
    """Test both regular and OCR extraction on a sample contract"""
    print("="*80)
    print("Testing PDF Text Extraction Methods")
    print("="*80)
    print()
    
    # Initialize
    config = Config()
    doc_processor = DocumentProcessor(config)
    fc_client = FileCloudClient(config.get_section('filecloud'))
    
    # Connect to FileCloud
    print("Connecting to FileCloud...")
    if not fc_client.connect():
        print("❌ Failed to connect to FileCloud")
        return
    print("✅ Connected")
    print()
    
    # Download the purchase order
    file_path = "/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management/Third Parties/Promega Benelux B.V/4. PO/2025-09-05_PO-0027 - DNA extraction kit/Purchase Order PO-0027_signed.pdf"
    
    print(f"Downloading: {file_path}")
    content = fc_client.download_document(file_path)
    
    if not content:
        print("❌ Failed to download")
        return
    
    # Save to temp file
    import tempfile
    with tempfile.NamedTemporaryFile(mode='wb', suffix='.pdf', delete=False) as tmp:
        tmp.write(content)
        local_path = tmp.name
    
    print(f"✅ Downloaded to: {local_path}")
    print()
    
    # Test Method 1: Regular extraction (llmsherpa)
    print("-"*80)
    print("Method 1: Regular Extraction (llmsherpa)")
    print("-"*80)
    result1 = doc_processor.process_document(local_path)
    
    if result1.get('success'):
        text1 = result1.get('text', '')
        print(f"✅ Extracted {len(text1)} characters")
        print()
        print("First 2000 characters:")
        print(text1[:2000])
        print()
        print("Last 500 characters:")
        print(text1[-500:])
        print()
        
        # Check for vendor name
        if 'Promega' in text1:
            print("✅ VENDOR NAME FOUND: 'Promega' appears in text")
        else:
            print("❌ VENDOR NAME NOT FOUND: 'Promega' not in extracted text")
        print()
    else:
        print(f"❌ Extraction failed: {result1.get('error')}")
        print()
    
    # Test Method 2: OCR extraction
    print("-"*80)
    print("Method 2: OCR Extraction (Tesseract)")
    print("-"*80)
    result2 = doc_processor.process_document_with_ocr(local_path)
    
    if result2.get('success'):
        text2 = result2.get('text', '')
        print(f"✅ Extracted {len(text2)} characters")
        print()
        print("First 2000 characters:")
        print(text2[:2000])
        print()
        print("Last 500 characters:")
        print(text2[-500:])
        print()
        
        # Check for vendor name
        if 'Promega' in text2:
            print("✅ VENDOR NAME FOUND: 'Promega' appears in OCR text")
        else:
            print("❌ VENDOR NAME NOT FOUND: 'Promega' not in OCR text")
        print()
    else:
        print(f"❌ OCR extraction failed: {result2.get('error')}")
        print()
    
    # Comparison
    print("="*80)
    print("Comparison Summary")
    print("="*80)
    if result1.get('success') and result2.get('success'):
        text1 = result1.get('text', '')
        text2 = result2.get('text', '')
        print(f"Regular extraction: {len(text1)} chars")
        print(f"OCR extraction:     {len(text2)} chars")
        print()
        print(f"'Promega' in regular: {'YES ✅' if 'Promega' in text1 else 'NO ❌'}")
        print(f"'Promega' in OCR:     {'YES ✅' if 'Promega' in text2 else 'NO ❌'}")
        print()
        
        # Recommend method
        if 'Promega' in text2 and 'Promega' not in text1:
            print("🎯 RECOMMENDATION: Use OCR extraction for better vendor detection")
        elif 'Promega' in text1:
            print("✅ Regular extraction is sufficient")
        else:
            print("⚠️  Neither method found vendor name - may need manual review")
    
    # Cleanup
    import os
    os.remove(local_path)
    fc_client.disconnect()
    print()
    print("="*80)

Return Value

This function does not return any value (implicitly returns None). It outputs diagnostic information directly to stdout, including connection status, extraction results, text samples, vendor name detection results, and method comparison recommendations.

Dependencies

  • sys
  • pathlib
  • tempfile
  • os
  • json

Required Imports

import sys
from pathlib import Path
from utils.document_processor import DocumentProcessor
from utils.filecloud_client import FileCloudClient
from config.config import Config
import json
import tempfile
import os

Usage Example

# This is a standalone test function with no parameters
# Simply call it directly after ensuring all dependencies are configured

from test_extraction_methods import test_extraction_methods

# Run the test
test_extraction_methods()

# The function will:
# 1. Connect to FileCloud
# 2. Download the specified purchase order PDF
# 3. Extract text using regular method (llmsherpa)
# 4. Extract text using OCR method (Tesseract)
# 5. Compare results and check for vendor name 'Promega'
# 6. Print detailed diagnostic output
# 7. Provide recommendation on which method to use
# 8. Clean up temporary files and disconnect

Best Practices

  • This function is hardcoded to test a specific file path - modify the file_path variable to test different documents
  • Ensure FileCloud credentials are properly configured in the Config object before running
  • Verify that both llmsherpa and Tesseract OCR services are available and properly configured
  • The function creates temporary files - ensure sufficient disk space and write permissions
  • This is a diagnostic/testing function and should not be used in production code
  • The function performs cleanup by removing temporary files and disconnecting from FileCloud
  • Output is verbose and designed for human review - redirect stdout if running in automated tests
  • The vendor name check is case-sensitive and looks for exact string 'Promega'
  • Consider wrapping the function call in try-except blocks to handle potential connection or processing failures

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function test_ocr_fallback 68.3% similar

    A test function that validates OCR fallback functionality when the primary llmsherpa PDF text extraction method fails.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
  • function test_single_document 68.1% similar

    Tests end date extraction from a specific PDF document by downloading it from FileCloud, extracting text, and using LLM-based analysis to identify contract expiry dates.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_single_document.py
  • function test_ocr_retry_logic 66.5% similar

    Tests the OCR retry logic for extracting contract end dates by first attempting normal text extraction, then falling back to OCR-based extraction if the end date is not found.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_retry.py
  • function test_llm_extraction 64.3% similar

    A test function that validates LLM-based contract data extraction by processing a sample contract and verifying the extracted fields against expected values.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_extractor.py
  • function test_document_processor 64.3% similar

    A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
← Back to Browse