function process_document
Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
41 - 101
moderate
Purpose
This function serves as a document ingestion pipeline component that validates file existence, determines file type, extracts format-specific metadata, calculates SHA-256 hash for integrity verification, and consolidates all information into a structured metadata dictionary. It's designed for document management systems that need to catalog and track documents with versioning and departmental organization.
Source Code
def process_document(file_path: str, doc_type: str, department: str,
version: str = "1.0") -> Dict[str, Any]:
"""
Process a document file and extract relevant metadata.
Args:
file_path: Path to document file
doc_type: Document type code
department: Department code
version: Version string
Returns:
Dictionary with extracted metadata
Raises:
DocumentProcessingError: If document processing fails
"""
try:
# Check file existence
if not os.path.exists(file_path):
raise DocumentProcessingError(f"File not found: {file_path}")
# Get file extension
_, ext = os.path.splitext(file_path)
ext = ext.lower()
# Extract metadata based on file type
metadata = {}
if ext in ['.docx', '.doc']:
metadata = extract_metadata_docx(file_path)
elif ext == '.pdf':
metadata = extract_metadata_pdf(file_path)
else:
raise DocumentProcessingError(f"Unsupported file type: {ext}")
# Calculate file hash
with open(file_path, 'rb') as f:
file_content = f.read()
file_hash = hashlib.sha256(file_content).hexdigest()
# Add file metadata
file_size = os.path.getsize(file_path)
file_info = {
'fileName': os.path.basename(file_path),
'fileSize': file_size,
'filePath': file_path,
'fileType': ext[1:], # Remove leading dot
'fileHash': file_hash,
'docType': doc_type,
'department': department,
'version': version,
'processedDate': datetime.now()
}
metadata.update(file_info)
return metadata
except Exception as e:
logger.error(f"Error processing document: {e}")
raise DocumentProcessingError(f"Document processing failed: {e}")
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
str | - | positional_or_keyword |
doc_type |
str | - | positional_or_keyword |
department |
str | - | positional_or_keyword |
version |
str | '1.0' | positional_or_keyword |
Parameter Details
file_path: Absolute or relative path to the document file to be processed. Must point to an existing file with .docx, .doc, or .pdf extension. The function will validate file existence before processing.
doc_type: Document type classification code used for categorizing the document within the system. This is a custom identifier that should match the organization's document taxonomy (e.g., 'INVOICE', 'CONTRACT', 'REPORT').
department: Department code identifying which organizational unit owns or is responsible for the document. Should align with the organization's department coding system (e.g., 'HR', 'FIN', 'ENG').
version: Version string for document versioning. Defaults to '1.0'. Should follow semantic versioning or organizational versioning conventions (e.g., '1.0', '2.1', 'draft-v3').
Return Value
Type: Dict[str, Any]
Returns a dictionary containing comprehensive document metadata. The dictionary includes: 'fileName' (basename of file), 'fileSize' (size in bytes), 'filePath' (original path), 'fileType' (extension without dot), 'fileHash' (SHA-256 hexdigest), 'docType' (provided type code), 'department' (provided department code), 'version' (provided version string), 'processedDate' (datetime object of processing time), plus additional format-specific metadata extracted by extract_metadata_docx() or extract_metadata_pdf() functions (which may include author, title, creation date, etc.).
Dependencies
oshashlibdatetimetypingloggingdocxPyPDF2CDocs
Required Imports
import os
import hashlib
from datetime import datetime
from typing import Dict, Any
import logging
Conditional/Optional Imports
These imports are only needed under specific conditions:
from docx import Document as DocxDocument
Condition: Required when processing .docx or .doc files - used by extract_metadata_docx() helper function
Required (conditional)import PyPDF2
Condition: Required when processing .pdf files - used by extract_metadata_pdf() helper function
Required (conditional)from CDocs.models.document import DocumentVersion
Condition: May be required by the CDocs application context for database operations
Optionalfrom CDocs.config import settings
Condition: May be required for application-specific configuration settings
OptionalUsage Example
import os
import hashlib
from datetime import datetime
from typing import Dict, Any
import logging
# Setup logger
logger = logging.getLogger(__name__)
# Define custom exception
class DocumentProcessingError(Exception):
pass
# Define helper functions (simplified examples)
def extract_metadata_docx(file_path):
from docx import Document
doc = Document(file_path)
return {'title': doc.core_properties.title, 'author': doc.core_properties.author}
def extract_metadata_pdf(file_path):
import PyPDF2
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
return {'pages': len(reader.pages), 'title': reader.metadata.get('/Title', '')}
# Use the function
try:
metadata = process_document(
file_path='/path/to/document.pdf',
doc_type='REPORT',
department='FINANCE',
version='2.1'
)
print(f"Processed: {metadata['fileName']}")
print(f"File Hash: {metadata['fileHash']}")
print(f"Size: {metadata['fileSize']} bytes")
print(f"Processed on: {metadata['processedDate']}")
except DocumentProcessingError as e:
print(f"Error: {e}")
Best Practices
- Always handle DocumentProcessingError exceptions when calling this function to gracefully manage processing failures
- Ensure file_path points to an accessible file with appropriate read permissions before calling
- Use consistent doc_type and department codes across your application to maintain data integrity
- The function reads the entire file into memory for hashing - be cautious with very large files (>1GB)
- The returned processedDate is a datetime object, serialize it appropriately if storing in JSON or databases
- Helper functions extract_metadata_docx() and extract_metadata_pdf() must be implemented and available in scope
- Consider implementing file size limits to prevent memory issues with extremely large documents
- The file hash (SHA-256) can be used for deduplication and integrity verification
- Supported file types are limited to .docx, .doc, and .pdf - validate input files before calling if accepting user uploads
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_metadata_docx 67.9% similar
-
function extract_metadata 66.6% similar
-
function extract_metadata_pdf 60.6% similar
-
function extract_document_sections 60.0% similar
-
function import_document_from_filecloud 58.3% similar