🔍 Code Extractor

function main_v48

Maturity: 40

Entry point function that demonstrates document processing workflow by creating an audited, watermarked, and protected PDF/A document from a DOCX file with audit trail data.

File:
/tf/active/vicechatdev/document_auditor/main.py
Lines:
23 - 114
Complexity:
moderate

Purpose

This function serves as a demonstration and testing entry point for the document processing system. It sets up necessary directories, validates input files (DOCX document, JSON audit data, watermark image), processes the document through the DocumentProcessor pipeline to create a compliant PDF/A output with watermarks and signatures, and performs verification checks on the resulting document including hash verification, PDF/A compliance, and protection status.

Source Code

def main():
    # Create sample directory structure if it doesn't exist
    signatures_dir = os.path.join(os.path.dirname(__file__), 'signatures')
    if not os.path.exists(signatures_dir):
        os.makedirs(signatures_dir)
        logger.info(f"Created signatures directory: {signatures_dir}")
    
    # Sample document and audit data
    sample_doc = os.path.join(os.path.dirname(__file__), './examples/test_document_original.docx')
    sample_json = os.path.join(os.path.dirname(__file__), './examples/sample_audit_data.json')
    output_pdf = os.path.join(os.path.dirname(__file__), './examples/audited_document.pdf')
    watermark_path = os.path.join(os.path.dirname(__file__), './examples/ViceBio_Logo_dark blue.png')
    
    # Check if files exist
    if not os.path.exists(sample_doc):
        logger.error(f"Sample document not found: {sample_doc}")
        return
    
    if not os.path.exists(sample_json):
        logger.error(f"Audit data JSON not found: {sample_json}")
        return
    
    if not os.path.exists(watermark_path):
        logger.warning(f"Watermark image not found: {watermark_path}")
        watermark_path = None
    
    # Initialize document processor
    processor = DocumentProcessor()
    
    # Process document
    try:
        output_path = processor.process_document(
            original_doc_path=sample_doc,
            json_path=sample_json,
            output_path=output_pdf,
            watermark_image=watermark_path,
            include_signatures=True,
            convert_to_pdfa=True,
            compliance_level='2b',
            finalize=True  # Add this parameter to lock the document
        )
        
        logger.info(f"Successfully created audited document: {output_path}")
        
        # Verify document hash using processor's stored hash if available
        if hasattr(processor, '_last_document_hash'):
            logger.info("Using stored document hash for verification")
            stored_hash = processor._last_document_hash
            extracted_hash = None
            
            try:
                with pikepdf.open(output_path) as pdf:
                    if "/DocumentHash" in pdf.docinfo:
                        hash_json = pdf.docinfo["/DocumentHash"]
                        hash_metadata = json.loads(str(hash_json))
                        extracted_hash = hash_metadata.get("hash")
            except Exception as e:
                logger.warning(f"Could not extract hash from PDF metadata: {e}")
            
            hash_verified = stored_hash == extracted_hash
            if hash_verified:
                logger.info(f"Document hash verification: Passed ✅")
            else:
                logger.warning(f"Document hash verification: Failed ❌")
        else:
            # Fall back to standard verification
            hash_verified = processor.hash_generator.verify_hash(output_path)
            if hash_verified:
                logger.info(f"Document hash verification: Passed ✅")
            else:
                logger.warning(f"Document hash verification: Failed ❌")
        
        # Verify PDF/A compliance
        pdfa_compliant = processor.pdfa_converter.validate_pdfa(output_path)
        if pdfa_compliant:
            logger.info(f"PDF/A compliance check: Passed ✅")
        else:
            logger.warning(f"PDF/A compliance check: Failed ❌")
        
        # Check if document is protected
        is_protected = hasattr(processor, 'document_protector') and hasattr(processor, '_last_owner_password')
        if is_protected:
            logger.info("🔒 Document is protected from editing")
            logger.info(f"Owner password: {getattr(processor, '_last_owner_password', 'Not available')}")
            logger.info("Keep this password in a secure location for administrative access")
        else:
            logger.info("⚠️ Document is not protected from editing")
            
        logger.info(f"Document processing complete. Output file: {output_path}")
        
    except Exception as e:
        logger.error(f"Error processing document: {e}", exc_info=True)

Return Value

This function does not return any value (implicitly returns None). It performs side effects including creating directories, processing documents, and logging results. The function may return early (None) if required input files are not found.

Dependencies

  • os
  • logging
  • json
  • sys
  • pikepdf

Required Imports

import os
import logging
import json
import sys
import pikepdf
from src.document_processor import DocumentProcessor

Usage Example

# Ensure required files exist in examples directory:
# - examples/test_document_original.docx
# - examples/sample_audit_data.json
# - examples/ViceBio_Logo_dark blue.png (optional)

import os
import logging
import json
import sys
import pikepdf
from src.document_processor import DocumentProcessor

# Configure logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Run the main function
if __name__ == '__main__':
    main()

# Output will be created at: ./examples/audited_document.pdf
# The function will log verification results for hash, PDF/A compliance, and protection status

Best Practices

  • Ensure all required input files exist before calling this function
  • Configure logging before calling main() to capture all log messages
  • The function creates a 'signatures' directory in the script's directory - ensure write permissions
  • Store the owner password logged by the function in a secure location for administrative access
  • The function expects specific file paths relative to __file__ - adjust paths if running from different locations
  • Handle exceptions at the caller level if using this as part of a larger application
  • The watermark image is optional - the function will continue without it if not found
  • Review logged verification results (hash, PDF/A compliance, protection status) to ensure document integrity
  • The finalize=True parameter locks the document - ensure this is desired behavior
  • The function uses compliance_level='2b' for PDF/A-2b standard - adjust if different compliance is needed

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class DocumentProcessor 72.7% similar

    A comprehensive document processing class that converts documents to PDF, adds audit trails, applies security features (watermarks, signatures, hashing), and optionally converts to PDF/A format with document protection.

    From: /tf/active/vicechatdev/document_auditor/src/document_processor.py
  • function main_v18 69.3% similar

    Main entry point function that reads a markdown file, converts it to an enhanced Word document with preserved heading structure, and saves it with a timestamped filename.

    From: /tf/active/vicechatdev/improved_word_converter.py
  • function test_document_processing 68.8% similar

    A test function that validates document processing functionality by creating a test PDF file, processing it through a DocumentProcessor, and verifying the extraction results or error handling.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
  • function main_v1 66.4% similar

    Main orchestration function that reads an improved markdown file and converts it to an enhanced Word document with comprehensive formatting, including table of contents, warranty sections, disclosures, and bibliography.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
  • function test_document_processor 65.3% similar

    A test function that validates the DocumentProcessor component's ability to extract text from PDF files with improved error handling and llmsherpa integration.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_improved_processor.py
← Back to Browse