test_document_extractor - Code Extractor

function test_document_extractor

Maturity: 45

A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.

File:
/tf/active/vicechatdev/leexi/test_document_extractor.py

Lines:
15 - 55

Complexity:
simple

Purpose

This function serves as a comprehensive test suite for the DocumentExtractor class. It verifies that the extractor can correctly identify supported file extensions, extract text content from different document types (Markdown, Word, PDF, PowerPoint), handle missing files gracefully, and detect file type compatibility. The function provides visual feedback through console output showing success/failure status for each operation.

Source Code

def test_document_extractor():
    """Test the document extractor with various file types"""
    
    # Initialize extractor
    extractor = DocumentExtractor()
    
    print("Document Extractor Test")
    print("=" * 50)
    
    # Test supported extensions
    supported_extensions = extractor.get_supported_extensions()
    print(f"Supported extensions: {supported_extensions}")
    print()
    
    # Test with existing files in the directory
    test_files = [
        "enhanced_meeting_minutes_2025-06-18.md",
        "leexi-20250618-transcript-development_team_meeting.md",
        "powerpoint_content_summary.md"
    ]
    
    for file_path in test_files:
        if os.path.exists(file_path):
            print(f"Testing file: {file_path}")
            try:
                content = extractor.extract_text(file_path)
                if content:
                    print(f"✓ Successfully extracted {len(content)} characters")
                    print(f"Preview: {content[:200]}...")
                else:
                    print("✗ No content extracted")
            except Exception as e:
                print(f"✗ Error: {str(e)}")
            print("-" * 40)
    
    # Test file type detection
    test_extensions = ['.docx', '.pdf', '.pptx', '.txt', '.md', '.doc', '.ppt']
    print("\nFile type detection test:")
    for ext in test_extensions:
        is_supported = extractor.is_supported_file(f"test{ext}")
        print(f"{ext}: {'✓ Supported' if is_supported else '✗ Not supported'}")

Return Value

This function does not return any value (implicitly returns None). It outputs test results directly to the console, including supported extensions, extraction success/failure status, character counts, content previews, and file type detection results.

Dependencies

os
sys
pathlib
document_extractor

Required Imports

import os
import sys
from pathlib import Path
from document_extractor import DocumentExtractor

Usage Example

# Ensure DocumentExtractor is available and test files exist
# Run the test function
test_document_extractor()

# Expected output:
# Document Extractor Test
# ==================================================
# Supported extensions: ['.md', '.txt', '.docx', '.pdf', '.pptx']
# 
# Testing file: enhanced_meeting_minutes_2025-06-18.md
# ✓ Successfully extracted 1234 characters
# Preview: # Meeting Minutes...
# ----------------------------------------
# ...

Best Practices

Ensure the DocumentExtractor class is properly implemented before running this test
Place test files in the same directory as the test script or update file paths accordingly
The function uses os.path.exists() to check for files before attempting extraction, preventing crashes on missing files
Consider wrapping the entire test in a try-except block for production use
This is a manual test function that prints to console; consider converting to pytest or unittest for automated testing
The function tests specific file names; modify the test_files list to match your actual test data
Character preview is limited to 200 characters to avoid cluttering console output

Similar Components

AI-powered semantic similarity - components with related functionality:

function test_multiple_files 81.7% similar

A test function that validates the extraction of text content from multiple document files using a DocumentExtractor instance, displaying extraction results and simulating combined content processing.
From: /tf/active/vicechatdev/leexi/test_multiple_files.py
function test_extraction_debugging 76.1% similar

A test function that validates the extraction debugging functionality of a DocumentProcessor by creating test files, simulating document extraction, and verifying debug log creation.
From: /tf/active/vicechatdev/vice_ai/test_extraction_debug.py
class DocumentExtractor 72.9% similar

A document text extraction class that supports multiple file formats including Word, PowerPoint, PDF, and plain text files, with automatic format detection and conversion capabilities.
From: /tf/active/vicechatdev/leexi/document_extractor.py
function test_mixed_previous_reports 72.6% similar

A test function that validates the DocumentExtractor's ability to extract text content from multiple file formats (TXT and Markdown) and combine them into a unified previous reports summary.
From: /tf/active/vicechatdev/leexi/test_enhanced_reports.py
function test_document_processor 72.5% similar

A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def test_document_extractor():
    """Test the document extractor with various file types"""
    
    # Initialize extractor
    extractor = DocumentExtractor()
    
    print("Document Extractor Test")
    print("=" * 50)
    
    # Test supported extensions
    supported_extensions = extractor.get_supported_extensions()
    print(f"Supported extensions: {supported_extensions}")
    print()
    
    # Test with existing files in the directory
    test_files = [
        "enhanced_meeting_minutes_2025-06-18.md",
        "leexi-20250618-transcript-development_team_meeting.md",
        "powerpoint_content_summary.md"
    ]
    
    for file_path in test_files:
        if os.path.exists(file_path):
            print(f"Testing file: {file_path}")
            try:
                content = extractor.extract_text(file_path)
                if content:
                    print(f"✓ Successfully extracted {len(content)} characters")
                    print(f"Preview: {content[:200]}...")
                else:
                    print("✗ No content extracted")
            except Exception as e:
                print(f"✗ Error: {str(e)}")
            print("-" * 40)
    
    # Test file type detection
    test_extensions = ['.docx', '.pdf', '.pptx', '.txt', '.md', '.doc', '.ppt']
    print("\nFile type detection test:")
    for ext in test_extensions:
        is_supported = extractor.is_supported_file(f"test{ext}")
        print(f"{ext}: {'✓ Supported' if is_supported else '✗ Not supported'}")
                        

Improved Code

🔍 Code Extractor

function test_document_extractor

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_multiple_files 81.7% similar

function test_extraction_debugging 76.1% similar

class DocumentExtractor 72.9% similar

function test_mixed_previous_reports 72.6% similar

function test_document_processor 72.5% similar

function test_document_extractor

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_multiple_files 81.7% similar

function test_extraction_debugging 76.1% similar

class DocumentExtractor 72.9% similar

function test_mixed_previous_reports 72.6% similar

function test_document_processor 72.5% similar

✨ Improve Code: test_document_extractor

Code Comparison