🔍 Code Extractor

function process_document

Maturity: 67

Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
41 - 101
Complexity:
moderate

Purpose

This function serves as a document ingestion pipeline component that validates file existence, determines file type, extracts format-specific metadata, calculates SHA-256 hash for integrity verification, and consolidates all information into a structured metadata dictionary. It's designed for document management systems that need to catalog and track documents with versioning and departmental organization.

Source Code

def process_document(file_path: str, doc_type: str, department: str, 
                    version: str = "1.0") -> Dict[str, Any]:
    """
    Process a document file and extract relevant metadata.
    
    Args:
        file_path: Path to document file
        doc_type: Document type code
        department: Department code
        version: Version string
        
    Returns:
        Dictionary with extracted metadata
        
    Raises:
        DocumentProcessingError: If document processing fails
    """
    try:
        # Check file existence
        if not os.path.exists(file_path):
            raise DocumentProcessingError(f"File not found: {file_path}")
            
        # Get file extension
        _, ext = os.path.splitext(file_path)
        ext = ext.lower()
        
        # Extract metadata based on file type
        metadata = {}
        if ext in ['.docx', '.doc']:
            metadata = extract_metadata_docx(file_path)
        elif ext == '.pdf':
            metadata = extract_metadata_pdf(file_path)
        else:
            raise DocumentProcessingError(f"Unsupported file type: {ext}")
            
        # Calculate file hash
        with open(file_path, 'rb') as f:
            file_content = f.read()
            file_hash = hashlib.sha256(file_content).hexdigest()
            
        # Add file metadata
        file_size = os.path.getsize(file_path)
        file_info = {
            'fileName': os.path.basename(file_path),
            'fileSize': file_size,
            'filePath': file_path,
            'fileType': ext[1:],  # Remove leading dot
            'fileHash': file_hash,
            'docType': doc_type,
            'department': department,
            'version': version,
            'processedDate': datetime.now()
        }
        
        metadata.update(file_info)
        
        return metadata
        
    except Exception as e:
        logger.error(f"Error processing document: {e}")
        raise DocumentProcessingError(f"Document processing failed: {e}")

Parameters

Name Type Default Kind
file_path str - positional_or_keyword
doc_type str - positional_or_keyword
department str - positional_or_keyword
version str '1.0' positional_or_keyword

Parameter Details

file_path: Absolute or relative path to the document file to be processed. Must point to an existing file with .docx, .doc, or .pdf extension. The function will validate file existence before processing.

doc_type: Document type classification code used for categorizing the document within the system. This is a custom identifier that should match the organization's document taxonomy (e.g., 'INVOICE', 'CONTRACT', 'REPORT').

department: Department code identifying which organizational unit owns or is responsible for the document. Should align with the organization's department coding system (e.g., 'HR', 'FIN', 'ENG').

version: Version string for document versioning. Defaults to '1.0'. Should follow semantic versioning or organizational versioning conventions (e.g., '1.0', '2.1', 'draft-v3').

Return Value

Type: Dict[str, Any]

Returns a dictionary containing comprehensive document metadata. The dictionary includes: 'fileName' (basename of file), 'fileSize' (size in bytes), 'filePath' (original path), 'fileType' (extension without dot), 'fileHash' (SHA-256 hexdigest), 'docType' (provided type code), 'department' (provided department code), 'version' (provided version string), 'processedDate' (datetime object of processing time), plus additional format-specific metadata extracted by extract_metadata_docx() or extract_metadata_pdf() functions (which may include author, title, creation date, etc.).

Dependencies

  • os
  • hashlib
  • datetime
  • typing
  • logging
  • docx
  • PyPDF2
  • CDocs

Required Imports

import os
import hashlib
from datetime import datetime
from typing import Dict, Any
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

from docx import Document as DocxDocument

Condition: Required when processing .docx or .doc files - used by extract_metadata_docx() helper function

Required (conditional)
import PyPDF2

Condition: Required when processing .pdf files - used by extract_metadata_pdf() helper function

Required (conditional)
from CDocs.models.document import DocumentVersion

Condition: May be required by the CDocs application context for database operations

Optional
from CDocs.config import settings

Condition: May be required for application-specific configuration settings

Optional

Usage Example

import os
import hashlib
from datetime import datetime
from typing import Dict, Any
import logging

# Setup logger
logger = logging.getLogger(__name__)

# Define custom exception
class DocumentProcessingError(Exception):
    pass

# Define helper functions (simplified examples)
def extract_metadata_docx(file_path):
    from docx import Document
    doc = Document(file_path)
    return {'title': doc.core_properties.title, 'author': doc.core_properties.author}

def extract_metadata_pdf(file_path):
    import PyPDF2
    with open(file_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        return {'pages': len(reader.pages), 'title': reader.metadata.get('/Title', '')}

# Use the function
try:
    metadata = process_document(
        file_path='/path/to/document.pdf',
        doc_type='REPORT',
        department='FINANCE',
        version='2.1'
    )
    
    print(f"Processed: {metadata['fileName']}")
    print(f"File Hash: {metadata['fileHash']}")
    print(f"Size: {metadata['fileSize']} bytes")
    print(f"Processed on: {metadata['processedDate']}")
    
except DocumentProcessingError as e:
    print(f"Error: {e}")

Best Practices

  • Always handle DocumentProcessingError exceptions when calling this function to gracefully manage processing failures
  • Ensure file_path points to an accessible file with appropriate read permissions before calling
  • Use consistent doc_type and department codes across your application to maintain data integrity
  • The function reads the entire file into memory for hashing - be cautious with very large files (>1GB)
  • The returned processedDate is a datetime object, serialize it appropriately if storing in JSON or databases
  • Helper functions extract_metadata_docx() and extract_metadata_pdf() must be implemented and available in scope
  • Consider implementing file size limits to prevent memory issues with extremely large documents
  • The file hash (SHA-256) can be used for deduplication and integrity verification
  • Supported file types are limited to .docx, .doc, and .pdf - validate input files before calling if accepting user uploads

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_metadata_docx 67.9% similar

    Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_metadata 66.6% similar

    Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_metadata_pdf 60.6% similar

    Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_document_sections 60.0% similar

    Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function import_document_from_filecloud 58.3% similar

    Imports a document from FileCloud into the system by extracting metadata, creating a controlled document record, downloading the file content, creating a document version, and uploading it back to FileCloud with proper folder structure.

    From: /tf/active/vicechatdev/CDocs/FC_sync.py
← Back to Browse