🔍 Code Extractor

function test_docx_file

Maturity: 46

Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.

File:
/tf/active/vicechatdev/docchat/test_problematic_files.py
Lines:
57 - 97
Complexity:
simple

Purpose

This diagnostic function verifies that a Word document can be successfully opened and parsed using the python-docx library. It performs comprehensive checks including file existence validation, size reporting, document structure analysis (paragraph count), and text extraction from the first few paragraphs. The function provides detailed console output with visual indicators (✓/❌) for each validation step and returns a boolean indicating overall success or failure. It's primarily used for testing document accessibility and debugging file-related issues.

Source Code

def test_docx_file(file_path):
    """Test opening a Word document"""
    print(f"\n{'='*80}")
    print(f"Testing Word Document: {Path(file_path).name}")
    print(f"{'='*80}")
    
    try:
        from docx import Document as DocxDocument
        file_path_obj = Path(file_path)
        
        # Check if file exists
        if not file_path_obj.exists():
            print(f"❌ File does not exist!")
            return False
            
        # Check file size
        file_size = file_path_obj.stat().st_size
        print(f"✓ File exists, size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
        
        # Try to open with python-docx
        print(f"Attempting to open with python-docx...")
        doc = DocxDocument(str(file_path))
        print(f"✓ Successfully opened with python-docx")
        print(f"  - Number of paragraphs: {len(doc.paragraphs)}")
        
        # Try to extract some text
        text_sample = ""
        for para in doc.paragraphs[:5]:
            if para.text.strip():
                text_sample = para.text[:100]
                break
        if text_sample:
            print(f"  - Sample text: {text_sample}...")
        print(f"✓ Can read document content")
        return True
        
    except Exception as e:
        print(f"❌ Error: {type(e).__name__}: {e}")
        print(f"\nFull traceback:")
        traceback.print_exc()
        return False

Parameters

Name Type Default Kind
file_path - - positional_or_keyword

Parameter Details

file_path: Path to the Word document file to test. Can be a string or Path object representing an absolute or relative file path to a .docx file. The function will convert it to a Path object internally for validation and then to a string for python-docx compatibility.

Return Value

Returns a boolean value: True if the document was successfully opened, parsed, and content could be extracted; False if any error occurred during the process (file not found, corrupted file, parsing errors, etc.). The function also prints detailed diagnostic information to stdout regardless of success or failure.

Dependencies

  • python-docx
  • pathlib

Required Imports

from pathlib import Path
import traceback
from docx import Document as DocxDocument

Conditional/Optional Imports

These imports are only needed under specific conditions:

from docx import Document as DocxDocument

Condition: imported inside the function's try block, required for all executions of this function

Required (conditional)

Usage Example

from pathlib import Path
import traceback
from docx import Document as DocxDocument

def test_docx_file(file_path):
    """Test opening a Word document"""
    print(f"\n{'='*80}")
    print(f"Testing Word Document: {Path(file_path).name}")
    print(f"{'='*80}")
    
    try:
        from docx import Document as DocxDocument
        file_path_obj = Path(file_path)
        
        if not file_path_obj.exists():
            print(f"❌ File does not exist!")
            return False
            
        file_size = file_path_obj.stat().st_size
        print(f"✓ File exists, size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
        
        print(f"Attempting to open with python-docx...")
        doc = DocxDocument(str(file_path))
        print(f"✓ Successfully opened with python-docx")
        print(f"  - Number of paragraphs: {len(doc.paragraphs)}")
        
        text_sample = ""
        for para in doc.paragraphs[:5]:
            if para.text.strip():
                text_sample = para.text[:100]
                break
        if text_sample:
            print(f"  - Sample text: {text_sample}...")
        print(f"✓ Can read document content")
        return True
        
    except Exception as e:
        print(f"❌ Error: {type(e).__name__}: {e}")
        print(f"\nFull traceback:")
        traceback.print_exc()
        return False

# Example usage
result = test_docx_file('example_document.docx')
if result:
    print("Document test passed!")
else:
    print("Document test failed!")

# Test with Path object
from pathlib import Path
doc_path = Path('/path/to/documents/report.docx')
success = test_docx_file(doc_path)

Best Practices

  • Ensure the python-docx library is installed before calling this function
  • The function prints directly to stdout, so redirect output if needed for logging
  • Returns boolean for programmatic success checking, but detailed diagnostics are in console output
  • File path can be string or Path object, function handles both
  • Only extracts first 100 characters from first non-empty paragraph as sample to avoid performance issues with large documents
  • Catches all exceptions to prevent crashes and provides full traceback for debugging
  • File size is reported in both bytes and megabytes for convenience
  • The function checks only the first 5 paragraphs for sample text to optimize performance

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function main_v64 71.6% similar

    A test harness function that validates the ability to open and process PowerPoint and Word document files, with fallback to LibreOffice conversion for problematic files.

    From: /tf/active/vicechatdev/docchat/test_problematic_files.py
  • function test_pptx_file 70.3% similar

    Tests the ability to open and read a PowerPoint (.pptx) file using the python-pptx library, validating file existence, size, and basic slide iteration.

    From: /tf/active/vicechatdev/docchat/test_problematic_files.py
  • function test_document_extractor 66.8% similar

    A test function that validates the DocumentExtractor class by testing file type support detection, text extraction from various document formats, and error handling.

    From: /tf/active/vicechatdev/leexi/test_document_extractor.py
  • function validate_document_structure 63.8% similar

    Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function explore_documents 62.2% similar

    Explores and tests document accessibility across multiple FileCloud directory paths, attempting to download and validate document content from various locations in a hierarchical search pattern.

    From: /tf/active/vicechatdev/contract_validity_analyzer/explore_documents.py
← Back to Browse