🔍 Code Extractor

function test_document_processor

Maturity: 47

A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
Lines:
16 - 67
Complexity:
moderate

Purpose

This function serves as an integration test for the DocumentProcessor class. It initializes the processor with specific configuration parameters, searches for PDF files in designated directories, and attempts to extract text from up to 2 PDF files. The test provides detailed console output showing success/failure status, character counts, and text previews to verify the document processing pipeline is working correctly.

Source Code

def test_document_processor():
    """Test document processor with improved error handling"""
    
    # Load config
    config = Config()
    
    # Initialize document processor with manual config
    processor_config = {
        'supported_extensions': ['.pdf', '.doc', '.docx'],
        'max_file_size_mb': 50,
        'text_extraction_timeout': 300
    }
    processor = DocumentProcessor(processor_config)
    
    print("Testing Document Processor with improved llmsherpa error handling")
    print("=" * 70)
    
    # Find some PDF files to test
    test_files = []
    
    # Look for test files
    import glob
    pdf_files = glob.glob('/tf/active/contract_validity_analyzer/output/*.pdf')
    if not pdf_files:
        # Try to find any PDF files in the workspace
        pdf_files = glob.glob('/tf/active/*.pdf')[:3]  # Test with first 3 PDFs
    
    if not pdf_files:
        print("No PDF files found for testing")
        return
    
    for pdf_file in pdf_files[:2]:  # Test with first 2 files
        print(f"\nTesting file: {os.path.basename(pdf_file)}")
        print("-" * 50)
        
        try:
            # Test text extraction
            text = processor.extract_text(pdf_file)
            
            if text:
                print(f"✓ Successfully extracted {len(text)} characters")
                if len(text) > 200:
                    print(f"Preview: {text[:200]}...")
                else:
                    print(f"Full text: {text}")
            else:
                print("✗ Failed to extract text")
                
        except Exception as e:
            print(f"✗ Error testing file: {e}")
    
    print("\nTest completed")

Return Value

This function does not return any value (implicitly returns None). It performs testing operations and outputs results directly to the console via print statements.

Dependencies

  • sys
  • os
  • glob
  • logging
  • utils.document_processor
  • config.config

Required Imports

import sys
import os
import glob
import logging
from utils.document_processor import DocumentProcessor
from config.config import Config

Usage Example

# Direct execution
test_document_processor()

# Or as part of a test suite
if __name__ == '__main__':
    test_document_processor()

# Expected output:
# Testing Document Processor with improved llmsherpa error handling
# ======================================================================
# 
# Testing file: example.pdf
# --------------------------------------------------
# ✓ Successfully extracted 1523 characters
# Preview: This is the beginning of the document text...

Best Practices

  • Ensure the required directory paths exist before running the test
  • The function tests only the first 2 PDF files found to avoid long execution times
  • Error handling is implemented at the file level, so one failing file won't stop testing of subsequent files
  • The function provides visual feedback with checkmarks (✓) and crosses (✗) for easy result interpretation
  • Text previews are limited to 200 characters to keep console output manageable
  • This is a standalone test function and should not be used in production code
  • Consider running this in a test environment with known PDF files for consistent results
  • The processor_config dictionary can be modified to test different configuration scenarios

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function test_document_processing 85.4% similar

    A test function that validates document processing functionality by creating a test PDF file, processing it through a DocumentProcessor, and verifying the extraction results or error handling.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
  • class TestDocumentProcessor 81.7% similar

    A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
  • function test_enhanced_pdf_processing 81.2% similar

    A comprehensive test function that validates PDF processing capabilities, including text extraction, cleaning, chunking, and table detection across multiple PDF processing libraries.

    From: /tf/active/vicechatdev/vice_ai/test_enhanced_pdf.py
  • class DocumentProcessor_v1 76.4% similar

    A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.

    From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_new.py
  • class DocumentProcessor_v2 76.3% similar

    A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.

    From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_old.py
← Back to Browse