function test_extraction_methods
A test function that compares two PDF text extraction methods (regular llmsherpa and OCR-based Tesseract) on a specific purchase order document from FileCloud, checking for vendor name detection.
/tf/active/vicechatdev/contract_validity_analyzer/test_extraction_methods.py
14 - 134
moderate
Purpose
This function serves as a diagnostic tool to evaluate and compare the effectiveness of two different PDF text extraction approaches. It downloads a specific purchase order PDF from FileCloud, processes it using both regular text extraction (llmsherpa) and OCR extraction (Tesseract), then compares the results to determine which method better extracts vendor information. The function provides detailed output including character counts, text samples, and vendor name detection results to help determine the optimal extraction method for contract processing.
Source Code
def test_extraction_methods():
"""Test both regular and OCR extraction on a sample contract"""
print("="*80)
print("Testing PDF Text Extraction Methods")
print("="*80)
print()
# Initialize
config = Config()
doc_processor = DocumentProcessor(config)
fc_client = FileCloudClient(config.get_section('filecloud'))
# Connect to FileCloud
print("Connecting to FileCloud...")
if not fc_client.connect():
print("❌ Failed to connect to FileCloud")
return
print("✅ Connected")
print()
# Download the purchase order
file_path = "/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management/Third Parties/Promega Benelux B.V/4. PO/2025-09-05_PO-0027 - DNA extraction kit/Purchase Order PO-0027_signed.pdf"
print(f"Downloading: {file_path}")
content = fc_client.download_document(file_path)
if not content:
print("❌ Failed to download")
return
# Save to temp file
import tempfile
with tempfile.NamedTemporaryFile(mode='wb', suffix='.pdf', delete=False) as tmp:
tmp.write(content)
local_path = tmp.name
print(f"✅ Downloaded to: {local_path}")
print()
# Test Method 1: Regular extraction (llmsherpa)
print("-"*80)
print("Method 1: Regular Extraction (llmsherpa)")
print("-"*80)
result1 = doc_processor.process_document(local_path)
if result1.get('success'):
text1 = result1.get('text', '')
print(f"✅ Extracted {len(text1)} characters")
print()
print("First 2000 characters:")
print(text1[:2000])
print()
print("Last 500 characters:")
print(text1[-500:])
print()
# Check for vendor name
if 'Promega' in text1:
print("✅ VENDOR NAME FOUND: 'Promega' appears in text")
else:
print("❌ VENDOR NAME NOT FOUND: 'Promega' not in extracted text")
print()
else:
print(f"❌ Extraction failed: {result1.get('error')}")
print()
# Test Method 2: OCR extraction
print("-"*80)
print("Method 2: OCR Extraction (Tesseract)")
print("-"*80)
result2 = doc_processor.process_document_with_ocr(local_path)
if result2.get('success'):
text2 = result2.get('text', '')
print(f"✅ Extracted {len(text2)} characters")
print()
print("First 2000 characters:")
print(text2[:2000])
print()
print("Last 500 characters:")
print(text2[-500:])
print()
# Check for vendor name
if 'Promega' in text2:
print("✅ VENDOR NAME FOUND: 'Promega' appears in OCR text")
else:
print("❌ VENDOR NAME NOT FOUND: 'Promega' not in OCR text")
print()
else:
print(f"❌ OCR extraction failed: {result2.get('error')}")
print()
# Comparison
print("="*80)
print("Comparison Summary")
print("="*80)
if result1.get('success') and result2.get('success'):
text1 = result1.get('text', '')
text2 = result2.get('text', '')
print(f"Regular extraction: {len(text1)} chars")
print(f"OCR extraction: {len(text2)} chars")
print()
print(f"'Promega' in regular: {'YES ✅' if 'Promega' in text1 else 'NO ❌'}")
print(f"'Promega' in OCR: {'YES ✅' if 'Promega' in text2 else 'NO ❌'}")
print()
# Recommend method
if 'Promega' in text2 and 'Promega' not in text1:
print("🎯 RECOMMENDATION: Use OCR extraction for better vendor detection")
elif 'Promega' in text1:
print("✅ Regular extraction is sufficient")
else:
print("⚠️ Neither method found vendor name - may need manual review")
# Cleanup
import os
os.remove(local_path)
fc_client.disconnect()
print()
print("="*80)
Return Value
This function does not return any value (implicitly returns None). It outputs diagnostic information directly to stdout, including connection status, extraction results, text samples, vendor name detection results, and method comparison recommendations.
Dependencies
syspathlibtempfileosjson
Required Imports
import sys
from pathlib import Path
from utils.document_processor import DocumentProcessor
from utils.filecloud_client import FileCloudClient
from config.config import Config
import json
import tempfile
import os
Usage Example
# This is a standalone test function with no parameters
# Simply call it directly after ensuring all dependencies are configured
from test_extraction_methods import test_extraction_methods
# Run the test
test_extraction_methods()
# The function will:
# 1. Connect to FileCloud
# 2. Download the specified purchase order PDF
# 3. Extract text using regular method (llmsherpa)
# 4. Extract text using OCR method (Tesseract)
# 5. Compare results and check for vendor name 'Promega'
# 6. Print detailed diagnostic output
# 7. Provide recommendation on which method to use
# 8. Clean up temporary files and disconnect
Best Practices
- This function is hardcoded to test a specific file path - modify the file_path variable to test different documents
- Ensure FileCloud credentials are properly configured in the Config object before running
- Verify that both llmsherpa and Tesseract OCR services are available and properly configured
- The function creates temporary files - ensure sufficient disk space and write permissions
- This is a diagnostic/testing function and should not be used in production code
- The function performs cleanup by removing temporary files and disconnecting from FileCloud
- Output is verbose and designed for human review - redirect stdout if running in automated tests
- The vendor name check is case-sensitive and looks for exact string 'Promega'
- Consider wrapping the function call in try-except blocks to handle potential connection or processing failures
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function test_ocr_fallback 68.3% similar
-
function test_single_document 68.1% similar
-
function test_ocr_retry_logic 66.5% similar
-
function test_llm_extraction 64.3% similar
-
function test_document_processor 64.3% similar