🔍 Code Extractor

function extract_metadata_docx

Maturity: 55

Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
167 - 218
Complexity:
moderate

Purpose

This function reads DOCX files and extracts metadata such as title, author, creation/modification dates, revision information, and document statistics (page count, word count). It implements multiple fallback strategies for title extraction: first from document properties, then from the first heading paragraph, and finally from the filename. The function gracefully handles missing dependencies and file errors by returning minimal metadata with at least a title derived from the filename.

Source Code

def extract_metadata_docx(file_path: str) -> Dict[str, Any]:
    """
    Extract metadata from a DOCX file.
    
    Args:
        file_path: Path to DOCX file
        
    Returns:
        Dictionary with extracted metadata
    """
    if not DOCX_AVAILABLE:
        logger.warning("python-docx library not available. Cannot extract DOCX metadata.")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}
        
    try:
        doc = DocxDocument(file_path)
        properties = doc.core_properties
        
        metadata = {
            'title': properties.title or os.path.splitext(os.path.basename(file_path))[0],
            'author': properties.author or '',
            'created': properties.created,
            'modified': properties.modified,
            'lastModifiedBy': properties.last_modified_by or '',
            'revision': properties.revision or '1',
            'category': properties.category or '',
            'keywords': properties.keywords or '',
            'comments': properties.comments or '',
            'contentStatus': properties.content_status or '',
            'pageCount': len(doc.sections),
            'wordCount': sum(len(p.text.split()) for p in doc.paragraphs)
        }
        
        # Try to extract title from document if not in properties
        if not metadata['title'] and len(doc.paragraphs) > 0:
            # Use first paragraph as title if it's styled as a heading
            first_para = doc.paragraphs[0]
            if first_para.style and 'heading' in first_para.style.name.lower():
                metadata['title'] = first_para.text
                
        # If still no title, use filename
        if not metadata['title']:
            metadata['title'] = os.path.splitext(os.path.basename(file_path))[0]
            
        return metadata
        
    except PackageNotFoundError:
        logger.error(f"Not a valid DOCX file: {file_path}")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}
    except Exception as e:
        logger.error(f"Error extracting DOCX metadata: {e}")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}

Parameters

Name Type Default Kind
file_path str - positional_or_keyword

Parameter Details

file_path: String path to the DOCX file to be analyzed. Must be a valid file system path pointing to a Microsoft Word document in DOCX format (not DOC). The file should be readable and properly formatted as a valid DOCX package.

Return Value

Type: Dict[str, Any]

Returns a dictionary (Dict[str, Any]) containing extracted metadata. Keys include: 'title' (str), 'author' (str), 'created' (datetime or None), 'modified' (datetime or None), 'lastModifiedBy' (str), 'revision' (str), 'category' (str), 'keywords' (str), 'comments' (str), 'contentStatus' (str), 'pageCount' (int - number of sections), 'wordCount' (int - total words in all paragraphs). In error cases, returns a minimal dictionary with only 'title' key containing the filename without extension.

Dependencies

  • python-docx
  • logging
  • os
  • typing

Required Imports

import logging
import os
from typing import Dict, Any
from docx import Document as DocxDocument
from docx.opc.exceptions import PackageNotFoundError

Conditional/Optional Imports

These imports are only needed under specific conditions:

from docx import Document as DocxDocument

Condition: Required for DOCX file processing. The function checks DOCX_AVAILABLE flag and falls back to filename-only metadata if not available

Required (conditional)
from docx.opc.exceptions import PackageNotFoundError

Condition: Required for handling invalid DOCX file errors

Required (conditional)

Usage Example

import logging
import os
from typing import Dict, Any
from docx import Document as DocxDocument
from docx.opc.exceptions import PackageNotFoundError

# Setup logger
logger = logging.getLogger(__name__)
DOCX_AVAILABLE = True  # Flag indicating python-docx is installed

def extract_metadata_docx(file_path: str) -> Dict[str, Any]:
    # ... function code ...
    pass

# Usage example
file_path = '/path/to/document.docx'
metadata = extract_metadata_docx(file_path)

print(f"Title: {metadata['title']}")
print(f"Author: {metadata['author']}")
print(f"Word Count: {metadata['wordCount']}")
print(f"Created: {metadata['created']}")
print(f"Modified: {metadata['modified']}")

Best Practices

  • Always check if the file exists and is readable before calling this function
  • Handle the case where only minimal metadata (title from filename) is returned due to errors or missing dependencies
  • Be aware that 'pageCount' returns the number of sections, not actual pages, which may differ from the page count shown in Word
  • The function returns datetime objects for 'created' and 'modified' fields, which may be None if not set in the document
  • Ensure the DOCX_AVAILABLE flag is properly set based on whether python-docx is installed to avoid import errors
  • The word count calculation splits on whitespace and may not match Word's exact count for documents with special formatting
  • Consider wrapping calls in try-except blocks for additional error handling in production environments
  • The function prioritizes heading-styled paragraphs for title extraction, which may not always reflect the intended document title

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_metadata 68.8% similar

    Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_metadata_pdf 68.4% similar

    Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function process_document 67.9% similar

    Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_excel_metadata 63.6% similar

    Extracts comprehensive metadata from Excel files including cell comments, merged regions, named ranges, document properties, and sheet-level information that standard pandas operations miss.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function extract_document_sections 62.8% similar

    Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
← Back to Browse