test_document_processor - Code Extractor

function test_document_processor

Maturity: 47

A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py

Lines:
16 - 67

Complexity:
moderate

Purpose

This function serves as an integration test for the DocumentProcessor class. It initializes the processor with specific configuration parameters, searches for PDF files in designated directories, and attempts to extract text from up to 2 PDF files. The test provides detailed console output showing success/failure status, character counts, and text previews to verify the document processing pipeline is working correctly.

Source Code

def test_document_processor():
    """Test document processor with improved error handling"""
    
    # Load config
    config = Config()
    
    # Initialize document processor with manual config
    processor_config = {
        'supported_extensions': ['.pdf', '.doc', '.docx'],
        'max_file_size_mb': 50,
        'text_extraction_timeout': 300
    }
    processor = DocumentProcessor(processor_config)
    
    print("Testing Document Processor with improved llmsherpa error handling")
    print("=" * 70)
    
    # Find some PDF files to test
    test_files = []
    
    # Look for test files
    import glob
    pdf_files = glob.glob('/tf/active/contract_validity_analyzer/output/*.pdf')
    if not pdf_files:
        # Try to find any PDF files in the workspace
        pdf_files = glob.glob('/tf/active/*.pdf')[:3]  # Test with first 3 PDFs
    
    if not pdf_files:
        print("No PDF files found for testing")
        return
    
    for pdf_file in pdf_files[:2]:  # Test with first 2 files
        print(f"\nTesting file: {os.path.basename(pdf_file)}")
        print("-" * 50)
        
        try:
            # Test text extraction
            text = processor.extract_text(pdf_file)
            
            if text:
                print(f"✓ Successfully extracted {len(text)} characters")
                if len(text) > 200:
                    print(f"Preview: {text[:200]}...")
                else:
                    print(f"Full text: {text}")
            else:
                print("✗ Failed to extract text")
                
        except Exception as e:
            print(f"✗ Error testing file: {e}")
    
    print("\nTest completed")

Return Value

This function does not return any value (implicitly returns None). It performs testing operations and outputs results directly to the console via print statements.

Dependencies

sys
os
glob
logging
utils.document_processor
config.config

Required Imports

import sys
import os
import glob
import logging
from utils.document_processor import DocumentProcessor
from config.config import Config

Usage Example

# Direct execution
test_document_processor()

# Or as part of a test suite
if __name__ == '__main__':
    test_document_processor()

# Expected output:
# Testing Document Processor with improved llmsherpa error handling
# ======================================================================
# 
# Testing file: example.pdf
# --------------------------------------------------
# ✓ Successfully extracted 1523 characters
# Preview: This is the beginning of the document text...

Best Practices

Ensure the required directory paths exist before running the test
The function tests only the first 2 PDF files found to avoid long execution times
Error handling is implemented at the file level, so one failing file won't stop testing of subsequent files
The function provides visual feedback with checkmarks (✓) and crosses (✗) for easy result interpretation
Text previews are limited to 200 characters to keep console output manageable
This is a standalone test function and should not be used in production code
Consider running this in a test environment with known PDF files for consistent results
The processor_config dictionary can be modified to test different configuration scenarios

Similar Components

AI-powered semantic similarity - components with related functionality:

function test_document_processing 85.4% similar

A test function that validates document processing functionality by creating a test PDF file, processing it through a DocumentProcessor, and verifying the extraction results or error handling.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
class TestDocumentProcessor 81.7% similar

A test subclass of DocumentProcessor that simulates llmsherpa PDF processing failures and triggers OCR fallback mechanisms for testing purposes.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_ocr_fallback.py
function test_enhanced_pdf_processing 81.2% similar

A comprehensive test function that validates PDF processing capabilities, including text extraction, cleaning, chunking, and table detection across multiple PDF processing libraries.
From: /tf/active/vicechatdev/vice_ai/test_enhanced_pdf.py
class DocumentProcessor_v1 76.4% similar

A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.
From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_new.py
class DocumentProcessor_v2 76.3% similar

A document processing class that extracts text from PDF and Word documents using llmsherpa as the primary method with fallback support for PyPDF2, pdfplumber, and python-docx.
From: /tf/active/vicechatdev/contract_validity_analyzer/utils/document_processor_old.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def test_document_processor():
    """Test document processor with improved error handling"""
    
    # Load config
    config = Config()
    
    # Initialize document processor with manual config
    processor_config = {
        'supported_extensions': ['.pdf', '.doc', '.docx'],
        'max_file_size_mb': 50,
        'text_extraction_timeout': 300
    }
    processor = DocumentProcessor(processor_config)
    
    print("Testing Document Processor with improved llmsherpa error handling")
    print("=" * 70)
    
    # Find some PDF files to test
    test_files = []
    
    # Look for test files
    import glob
    pdf_files = glob.glob('/tf/active/contract_validity_analyzer/output/*.pdf')
    if not pdf_files:
        # Try to find any PDF files in the workspace
        pdf_files = glob.glob('/tf/active/*.pdf')[:3]  # Test with first 3 PDFs
    
    if not pdf_files:
        print("No PDF files found for testing")
        return
    
    for pdf_file in pdf_files[:2]:  # Test with first 2 files
        print(f"\nTesting file: {os.path.basename(pdf_file)}")
        print("-" * 50)
        
        try:
            # Test text extraction
            text = processor.extract_text(pdf_file)
            
            if text:
                print(f"✓ Successfully extracted {len(text)} characters")
                if len(text) > 200:
                    print(f"Preview: {text[:200]}...")
                else:
                    print(f"Full text: {text}")
            else:
                print("✗ Failed to extract text")
                
        except Exception as e:
            print(f"✗ Error testing file: {e}")
    
    print("\nTest completed")
                        

Improved Code

🔍 Code Extractor

function test_document_processor

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_document_processing 85.4% similar

class TestDocumentProcessor 81.7% similar

function test_enhanced_pdf_processing 81.2% similar

class DocumentProcessor_v1 76.4% similar

class DocumentProcessor_v2 76.3% similar

function test_document_processor

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_document_processing 85.4% similar

class TestDocumentProcessor 81.7% similar

function test_enhanced_pdf_processing 81.2% similar

class DocumentProcessor_v1 76.4% similar

class DocumentProcessor_v2 76.3% similar

✨ Improve Code: test_document_processor

Code Comparison