function test_docx_file
Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.
/tf/active/vicechatdev/docchat/test_problematic_files.py
57 - 97
simple
Purpose
This diagnostic function verifies that a Word document can be successfully opened and parsed using the python-docx library. It performs comprehensive checks including file existence validation, size reporting, document structure analysis (paragraph count), and text extraction from the first few paragraphs. The function provides detailed console output with visual indicators (✓/❌) for each validation step and returns a boolean indicating overall success or failure. It's primarily used for testing document accessibility and debugging file-related issues.
Source Code
def test_docx_file(file_path):
"""Test opening a Word document"""
print(f"\n{'='*80}")
print(f"Testing Word Document: {Path(file_path).name}")
print(f"{'='*80}")
try:
from docx import Document as DocxDocument
file_path_obj = Path(file_path)
# Check if file exists
if not file_path_obj.exists():
print(f"❌ File does not exist!")
return False
# Check file size
file_size = file_path_obj.stat().st_size
print(f"✓ File exists, size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
# Try to open with python-docx
print(f"Attempting to open with python-docx...")
doc = DocxDocument(str(file_path))
print(f"✓ Successfully opened with python-docx")
print(f" - Number of paragraphs: {len(doc.paragraphs)}")
# Try to extract some text
text_sample = ""
for para in doc.paragraphs[:5]:
if para.text.strip():
text_sample = para.text[:100]
break
if text_sample:
print(f" - Sample text: {text_sample}...")
print(f"✓ Can read document content")
return True
except Exception as e:
print(f"❌ Error: {type(e).__name__}: {e}")
print(f"\nFull traceback:")
traceback.print_exc()
return False
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
- | - | positional_or_keyword |
Parameter Details
file_path: Path to the Word document file to test. Can be a string or Path object representing an absolute or relative file path to a .docx file. The function will convert it to a Path object internally for validation and then to a string for python-docx compatibility.
Return Value
Returns a boolean value: True if the document was successfully opened, parsed, and content could be extracted; False if any error occurred during the process (file not found, corrupted file, parsing errors, etc.). The function also prints detailed diagnostic information to stdout regardless of success or failure.
Dependencies
python-docxpathlib
Required Imports
from pathlib import Path
import traceback
from docx import Document as DocxDocument
Conditional/Optional Imports
These imports are only needed under specific conditions:
from docx import Document as DocxDocument
Condition: imported inside the function's try block, required for all executions of this function
Required (conditional)Usage Example
from pathlib import Path
import traceback
from docx import Document as DocxDocument
def test_docx_file(file_path):
"""Test opening a Word document"""
print(f"\n{'='*80}")
print(f"Testing Word Document: {Path(file_path).name}")
print(f"{'='*80}")
try:
from docx import Document as DocxDocument
file_path_obj = Path(file_path)
if not file_path_obj.exists():
print(f"❌ File does not exist!")
return False
file_size = file_path_obj.stat().st_size
print(f"✓ File exists, size: {file_size:,} bytes ({file_size/1024/1024:.2f} MB)")
print(f"Attempting to open with python-docx...")
doc = DocxDocument(str(file_path))
print(f"✓ Successfully opened with python-docx")
print(f" - Number of paragraphs: {len(doc.paragraphs)}")
text_sample = ""
for para in doc.paragraphs[:5]:
if para.text.strip():
text_sample = para.text[:100]
break
if text_sample:
print(f" - Sample text: {text_sample}...")
print(f"✓ Can read document content")
return True
except Exception as e:
print(f"❌ Error: {type(e).__name__}: {e}")
print(f"\nFull traceback:")
traceback.print_exc()
return False
# Example usage
result = test_docx_file('example_document.docx')
if result:
print("Document test passed!")
else:
print("Document test failed!")
# Test with Path object
from pathlib import Path
doc_path = Path('/path/to/documents/report.docx')
success = test_docx_file(doc_path)
Best Practices
- Ensure the python-docx library is installed before calling this function
- The function prints directly to stdout, so redirect output if needed for logging
- Returns boolean for programmatic success checking, but detailed diagnostics are in console output
- File path can be string or Path object, function handles both
- Only extracts first 100 characters from first non-empty paragraph as sample to avoid performance issues with large documents
- Catches all exceptions to prevent crashes and provides full traceback for debugging
- File size is reported in both bytes and megabytes for convenience
- The function checks only the first 5 paragraphs for sample text to optimize performance
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v64 71.6% similar
-
function test_pptx_file 70.3% similar
-
function test_document_extractor 66.8% similar
-
function validate_document_structure 63.8% similar
-
function explore_documents 62.2% similar