function extract_metadata_docx
Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
167 - 218
moderate
Purpose
This function reads DOCX files and extracts metadata such as title, author, creation/modification dates, revision information, and document statistics (page count, word count). It implements multiple fallback strategies for title extraction: first from document properties, then from the first heading paragraph, and finally from the filename. The function gracefully handles missing dependencies and file errors by returning minimal metadata with at least a title derived from the filename.
Source Code
def extract_metadata_docx(file_path: str) -> Dict[str, Any]:
"""
Extract metadata from a DOCX file.
Args:
file_path: Path to DOCX file
Returns:
Dictionary with extracted metadata
"""
if not DOCX_AVAILABLE:
logger.warning("python-docx library not available. Cannot extract DOCX metadata.")
return {'title': os.path.splitext(os.path.basename(file_path))[0]}
try:
doc = DocxDocument(file_path)
properties = doc.core_properties
metadata = {
'title': properties.title or os.path.splitext(os.path.basename(file_path))[0],
'author': properties.author or '',
'created': properties.created,
'modified': properties.modified,
'lastModifiedBy': properties.last_modified_by or '',
'revision': properties.revision or '1',
'category': properties.category or '',
'keywords': properties.keywords or '',
'comments': properties.comments or '',
'contentStatus': properties.content_status or '',
'pageCount': len(doc.sections),
'wordCount': sum(len(p.text.split()) for p in doc.paragraphs)
}
# Try to extract title from document if not in properties
if not metadata['title'] and len(doc.paragraphs) > 0:
# Use first paragraph as title if it's styled as a heading
first_para = doc.paragraphs[0]
if first_para.style and 'heading' in first_para.style.name.lower():
metadata['title'] = first_para.text
# If still no title, use filename
if not metadata['title']:
metadata['title'] = os.path.splitext(os.path.basename(file_path))[0]
return metadata
except PackageNotFoundError:
logger.error(f"Not a valid DOCX file: {file_path}")
return {'title': os.path.splitext(os.path.basename(file_path))[0]}
except Exception as e:
logger.error(f"Error extracting DOCX metadata: {e}")
return {'title': os.path.splitext(os.path.basename(file_path))[0]}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
str | - | positional_or_keyword |
Parameter Details
file_path: String path to the DOCX file to be analyzed. Must be a valid file system path pointing to a Microsoft Word document in DOCX format (not DOC). The file should be readable and properly formatted as a valid DOCX package.
Return Value
Type: Dict[str, Any]
Returns a dictionary (Dict[str, Any]) containing extracted metadata. Keys include: 'title' (str), 'author' (str), 'created' (datetime or None), 'modified' (datetime or None), 'lastModifiedBy' (str), 'revision' (str), 'category' (str), 'keywords' (str), 'comments' (str), 'contentStatus' (str), 'pageCount' (int - number of sections), 'wordCount' (int - total words in all paragraphs). In error cases, returns a minimal dictionary with only 'title' key containing the filename without extension.
Dependencies
python-docxloggingostyping
Required Imports
import logging
import os
from typing import Dict, Any
from docx import Document as DocxDocument
from docx.opc.exceptions import PackageNotFoundError
Conditional/Optional Imports
These imports are only needed under specific conditions:
from docx import Document as DocxDocument
Condition: Required for DOCX file processing. The function checks DOCX_AVAILABLE flag and falls back to filename-only metadata if not available
Required (conditional)from docx.opc.exceptions import PackageNotFoundError
Condition: Required for handling invalid DOCX file errors
Required (conditional)Usage Example
import logging
import os
from typing import Dict, Any
from docx import Document as DocxDocument
from docx.opc.exceptions import PackageNotFoundError
# Setup logger
logger = logging.getLogger(__name__)
DOCX_AVAILABLE = True # Flag indicating python-docx is installed
def extract_metadata_docx(file_path: str) -> Dict[str, Any]:
# ... function code ...
pass
# Usage example
file_path = '/path/to/document.docx'
metadata = extract_metadata_docx(file_path)
print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")
print(f"Word Count: {metadata['wordCount']}")
print(f"Created: {metadata['created']}")
print(f"Modified: {metadata['modified']}")
Best Practices
- Always check if the file exists and is readable before calling this function
- Handle the case where only minimal metadata (title from filename) is returned due to errors or missing dependencies
- Be aware that 'pageCount' returns the number of sections, not actual pages, which may differ from the page count shown in Word
- The function returns datetime objects for 'created' and 'modified' fields, which may be None if not set in the document
- Ensure the DOCX_AVAILABLE flag is properly set based on whether python-docx is installed to avoid import errors
- The word count calculation splits on whitespace and may not match Word's exact count for documents with special formatting
- Consider wrapping calls in try-except blocks for additional error handling in production environments
- The function prioritizes heading-styled paragraphs for title extraction, which may not always reflect the intended document title
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_metadata 68.8% similar
-
function extract_metadata_pdf 68.4% similar
-
function process_document 67.9% similar
-
function extract_excel_metadata 63.6% similar
-
function extract_document_sections 62.8% similar