process_document - Code Extractor

function process_document

Maturity: 67

Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py

Lines:
41 - 101

Complexity:
moderate

Purpose

This function serves as a document ingestion pipeline component that validates file existence, determines file type, extracts format-specific metadata, calculates SHA-256 hash for integrity verification, and consolidates all information into a structured metadata dictionary. It's designed for document management systems that need to catalog and track documents with versioning and departmental organization.

Source Code

def process_document(file_path: str, doc_type: str, department: str, 
                    version: str = "1.0") -> Dict[str, Any]:
    """
    Process a document file and extract relevant metadata.
    
    Args:
        file_path: Path to document file
        doc_type: Document type code
        department: Department code
        version: Version string
        
    Returns:
        Dictionary with extracted metadata
        
    Raises:
        DocumentProcessingError: If document processing fails
    """
    try:
        # Check file existence
        if not os.path.exists(file_path):
            raise DocumentProcessingError(f"File not found: {file_path}")
            
        # Get file extension
        _, ext = os.path.splitext(file_path)
        ext = ext.lower()
        
        # Extract metadata based on file type
        metadata = {}
        if ext in ['.docx', '.doc']:
            metadata = extract_metadata_docx(file_path)
        elif ext == '.pdf':
            metadata = extract_metadata_pdf(file_path)
        else:
            raise DocumentProcessingError(f"Unsupported file type: {ext}")
            
        # Calculate file hash
        with open(file_path, 'rb') as f:
            file_content = f.read()
            file_hash = hashlib.sha256(file_content).hexdigest()
            
        # Add file metadata
        file_size = os.path.getsize(file_path)
        file_info = {
            'fileName': os.path.basename(file_path),
            'fileSize': file_size,
            'filePath': file_path,
            'fileType': ext[1:],  # Remove leading dot
            'fileHash': file_hash,
            'docType': doc_type,
            'department': department,
            'version': version,
            'processedDate': datetime.now()
        }
        
        metadata.update(file_info)
        
        return metadata
        
    except Exception as e:
        logger.error(f"Error processing document: {e}")
        raise DocumentProcessingError(f"Document processing failed: {e}")

Parameters

Name	Type	Default	Kind
`file_path`	str	-	positional_or_keyword
`doc_type`	str	-	positional_or_keyword
`department`	str	-	positional_or_keyword
`version`	str	'1.0'	positional_or_keyword

Parameter Details

file_path: Absolute or relative path to the document file to be processed. Must point to an existing file with .docx, .doc, or .pdf extension. The function will validate file existence before processing.

doc_type: Document type classification code used for categorizing the document within the system. This is a custom identifier that should match the organization's document taxonomy (e.g., 'INVOICE', 'CONTRACT', 'REPORT').

department: Department code identifying which organizational unit owns or is responsible for the document. Should align with the organization's department coding system (e.g., 'HR', 'FIN', 'ENG').

version: Version string for document versioning. Defaults to '1.0'. Should follow semantic versioning or organizational versioning conventions (e.g., '1.0', '2.1', 'draft-v3').

Return Value

Type: Dict[str, Any]

Returns a dictionary containing comprehensive document metadata. The dictionary includes: 'fileName' (basename of file), 'fileSize' (size in bytes), 'filePath' (original path), 'fileType' (extension without dot), 'fileHash' (SHA-256 hexdigest), 'docType' (provided type code), 'department' (provided department code), 'version' (provided version string), 'processedDate' (datetime object of processing time), plus additional format-specific metadata extracted by extract_metadata_docx() or extract_metadata_pdf() functions (which may include author, title, creation date, etc.).

Dependencies

os
hashlib
datetime
typing
logging
docx
PyPDF2
CDocs

Required Imports

import os
import hashlib
from datetime import datetime
from typing import Dict, Any
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

from docx import Document as DocxDocument

Condition: Required when processing .docx or .doc files - used by extract_metadata_docx() helper function

Required (conditional)

import PyPDF2

Condition: Required when processing .pdf files - used by extract_metadata_pdf() helper function

Required (conditional)

from CDocs.models.document import DocumentVersion

Condition: May be required by the CDocs application context for database operations

Optional

from CDocs.config import settings

Condition: May be required for application-specific configuration settings

Optional

Usage Example

import os
import hashlib
from datetime import datetime
from typing import Dict, Any
import logging

# Setup logger
logger = logging.getLogger(__name__)

# Define custom exception
class DocumentProcessingError(Exception):
    pass

# Define helper functions (simplified examples)
def extract_metadata_docx(file_path):
    from docx import Document
    doc = Document(file_path)
    return {'title': doc.core_properties.title, 'author': doc.core_properties.author}

def extract_metadata_pdf(file_path):
    import PyPDF2
    with open(file_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        return {'pages': len(reader.pages), 'title': reader.metadata.get('/Title', '')}

# Use the function
try:
    metadata = process_document(
        file_path='/path/to/document.pdf',
        doc_type='REPORT',
        department='FINANCE',
        version='2.1'
    )
    
    print(f"Processed: {metadata['fileName']}")
    print(f"File Hash: {metadata['fileHash']}")
    print(f"Size: {metadata['fileSize']} bytes")
    print(f"Processed on: {metadata['processedDate']}")
    
except DocumentProcessingError as e:
    print(f"Error: {e}")

Best Practices

Always handle DocumentProcessingError exceptions when calling this function to gracefully manage processing failures
Ensure file_path points to an accessible file with appropriate read permissions before calling
Use consistent doc_type and department codes across your application to maintain data integrity
The function reads the entire file into memory for hashing - be cautious with very large files (>1GB)
The returned processedDate is a datetime object, serialize it appropriately if storing in JSON or databases
Helper functions extract_metadata_docx() and extract_metadata_pdf() must be implemented and available in scope
Consider implementing file size limits to prevent memory issues with extremely large documents
The file hash (SHA-256) can be used for deduplication and integrity verification
Supported file types are limited to .docx, .doc, and .pdf - validate input files before calling if accepting user uploads

Similar Components

AI-powered semantic similarity - components with related functionality:

function extract_metadata_docx 67.9% similar

Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_metadata 66.6% similar

Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_metadata_pdf 60.6% similar

Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_document_sections 60.0% similar

Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function import_document_from_filecloud 58.3% similar

Imports a document from FileCloud into the system by extracting metadata, creating a controlled document record, downloading the file content, creating a document version, and uploading it back to FileCloud with proper folder structure.
From: /tf/active/vicechatdev/CDocs/FC_sync.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def process_document(file_path: str, doc_type: str, department: str, 
                    version: str = "1.0") -> Dict[str, Any]:
    """
    Process a document file and extract relevant metadata.
    
    Args:
        file_path: Path to document file
        doc_type: Document type code
        department: Department code
        version: Version string
        
    Returns:
        Dictionary with extracted metadata
        
    Raises:
        DocumentProcessingError: If document processing fails
    """
    try:
        # Check file existence
        if not os.path.exists(file_path):
            raise DocumentProcessingError(f"File not found: {file_path}")
            
        # Get file extension
        _, ext = os.path.splitext(file_path)
        ext = ext.lower()
        
        # Extract metadata based on file type
        metadata = {}
        if ext in ['.docx', '.doc']:
            metadata = extract_metadata_docx(file_path)
        elif ext == '.pdf':
            metadata = extract_metadata_pdf(file_path)
        else:
            raise DocumentProcessingError(f"Unsupported file type: {ext}")
            
        # Calculate file hash
        with open(file_path, 'rb') as f:
            file_content = f.read()
            file_hash = hashlib.sha256(file_content).hexdigest()
            
        # Add file metadata
        file_size = os.path.getsize(file_path)
        file_info = {
            'fileName': os.path.basename(file_path),
            'fileSize': file_size,
            'filePath': file_path,
            'fileType': ext[1:],  # Remove leading dot
            'fileHash': file_hash,
            'docType': doc_type,
            'department': department,
            'version': version,
            'processedDate': datetime.now()
        }
        
        metadata.update(file_info)
        
        return metadata
        
    except Exception as e:
        logger.error(f"Error processing document: {e}")
        raise DocumentProcessingError(f"Document processing failed: {e}")
                        

Improved Code

🔍 Code Extractor

function process_document

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_metadata_docx 67.9% similar

function extract_metadata 66.6% similar

function extract_metadata_pdf 60.6% similar

function extract_document_sections 60.0% similar

function import_document_from_filecloud 58.3% similar

function process_document

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_metadata_docx 67.9% similar

function extract_metadata 66.6% similar

function extract_metadata_pdf 60.6% similar

function extract_document_sections 60.0% similar

function import_document_from_filecloud 58.3% similar

✨ Improve Code: process_document

Code Comparison