function test_document_extractor
A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.
/tf/active/vicechatdev/leexi/test_document_extractor.py
15 - 55
simple
Purpose
This function serves as a comprehensive test suite for the DocumentExtractor class. It verifies that the extractor can correctly identify supported file extensions, extract text content from different document types (Markdown, Word, PDF, PowerPoint), handle missing files gracefully, and detect file type compatibility. The function provides visual feedback through console output showing success/failure status for each operation.
Source Code
def test_document_extractor():
"""Test the document extractor with various file types"""
# Initialize extractor
extractor = DocumentExtractor()
print("Document Extractor Test")
print("=" * 50)
# Test supported extensions
supported_extensions = extractor.get_supported_extensions()
print(f"Supported extensions: {supported_extensions}")
print()
# Test with existing files in the directory
test_files = [
"enhanced_meeting_minutes_2025-06-18.md",
"leexi-20250618-transcript-development_team_meeting.md",
"powerpoint_content_summary.md"
]
for file_path in test_files:
if os.path.exists(file_path):
print(f"Testing file: {file_path}")
try:
content = extractor.extract_text(file_path)
if content:
print(f"✓ Successfully extracted {len(content)} characters")
print(f"Preview: {content[:200]}...")
else:
print("✗ No content extracted")
except Exception as e:
print(f"✗ Error: {str(e)}")
print("-" * 40)
# Test file type detection
test_extensions = ['.docx', '.pdf', '.pptx', '.txt', '.md', '.doc', '.ppt']
print("\nFile type detection test:")
for ext in test_extensions:
is_supported = extractor.is_supported_file(f"test{ext}")
print(f"{ext}: {'✓ Supported' if is_supported else '✗ Not supported'}")
Return Value
This function does not return any value (implicitly returns None). It outputs test results directly to the console, including supported extensions, extraction success/failure status, character counts, content previews, and file type detection results.
Dependencies
ossyspathlibdocument_extractor
Required Imports
import os
import sys
from pathlib import Path
from document_extractor import DocumentExtractor
Usage Example
# Ensure DocumentExtractor is available and test files exist
# Run the test function
test_document_extractor()
# Expected output:
# Document Extractor Test
# ==================================================
# Supported extensions: ['.md', '.txt', '.docx', '.pdf', '.pptx']
#
# Testing file: enhanced_meeting_minutes_2025-06-18.md
# ✓ Successfully extracted 1234 characters
# Preview: # Meeting Minutes...
# ----------------------------------------
# ...
Best Practices
- Ensure the DocumentExtractor class is properly implemented before running this test
- Place test files in the same directory as the test script or update file paths accordingly
- The function uses os.path.exists() to check for files before attempting extraction, preventing crashes on missing files
- Consider wrapping the entire test in a try-except block for production use
- This is a manual test function that prints to console; consider converting to pytest or unittest for automated testing
- The function tests specific file names; modify the test_files list to match your actual test data
- Character preview is limited to 200 characters to avoid cluttering console output
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function test_multiple_files 81.7% similar
-
function test_extraction_debugging 76.1% similar
-
class DocumentExtractor 72.9% similar
-
function test_mixed_previous_reports 72.6% similar
-
function test_document_processor 72.5% similar