🔍 Code Extractor

class DocumentMerger

Maturity: 46

A class that merges PDF documents with audit trail pages, combining an original PDF with an audit page and updating metadata to reflect the audit process.

File:
/tf/active/vicechatdev/document_auditor/src/document_merger.py
Lines:
5 - 72
Complexity:
moderate

Purpose

DocumentMerger is responsible for combining PDF documents with their corresponding audit trail pages. It handles the technical process of merging PDFs using PyMuPDF (fitz), validates file existence, updates document metadata to reflect the audit, and provides comprehensive error handling and logging throughout the merge process. This class is typically used in document auditing systems where audit information needs to be permanently attached to original documents.

Source Code

class DocumentMerger:
    """Merges documents with audit trails"""
    
    def __init__(self):
        self.logger = logging.getLogger(__name__)
    
    def merge_pdfs(self, original_pdf_path, audit_page_path, output_path):
        """
        Merge the original document with the audit page
        
        Args:
            original_pdf_path (str): Path to the original PDF document
            audit_page_path (str): Path to the audit page PDF
            output_path (str): Path where merged PDF will be saved
            
        Returns:
            str: Path to the merged document
        """
        if not os.path.exists(original_pdf_path):
            raise FileNotFoundError(f"Original PDF not found: {original_pdf_path}")
        
        if not os.path.exists(audit_page_path):
            raise FileNotFoundError(f"Audit page PDF not found: {audit_page_path}")
        
        try:
            # Open both PDFs
            original_pdf = fitz.open(original_pdf_path)
            audit_pdf = fitz.open(audit_page_path)
            
            # Append audit page to original document
            original_pdf.insert_pdf(audit_pdf)
            
            # Update metadata - only use standard metadata keys
            self._update_metadata(original_pdf)
            
            # Save the combined document
            original_pdf.save(output_path)
            original_pdf.close()
            audit_pdf.close()
            
            self.logger.info(f"Successfully merged PDFs to: {output_path}")
            return output_path
        
        except Exception as e:
            self.logger.error(f"Error merging PDFs: {e}")
            raise
    
    def _update_metadata(self, pdf_document):
        """Update the PDF metadata to reflect the audit"""
        try:
            # Get existing metadata
            metadata = pdf_document.metadata
            
            # Only update standard metadata fields
            # PyMuPDF only accepts these standard PDF metadata fields
            metadata["producer"] = "Document Auditor System"
            metadata["modDate"] = fitz.get_pdf_now()
            metadata["creator"] = "Document Auditor"
            # Add a note in the subject field
            metadata["subject"] = "Document with audit trail"
            
            # Update the document with only standard fields
            pdf_document.set_metadata(metadata)
            
        except Exception as e:
            self.logger.error(f"Error updating metadata: {e}")
            # Continue without metadata if it fails
            pass

Parameters

Name Type Default Kind
bases - -

Parameter Details

__init__: The constructor takes no parameters and initializes the class with a logger instance for tracking operations and errors.

Return Value

Instantiation returns a DocumentMerger object. The main method merge_pdfs returns a string containing the path to the successfully merged PDF document. If errors occur during merging, exceptions are raised rather than returning error values.

Class Interface

Methods

__init__(self)

Purpose: Initialize the DocumentMerger instance with a logger

Returns: None - initializes the instance

merge_pdfs(self, original_pdf_path: str, audit_page_path: str, output_path: str) -> str

Purpose: Merge an original PDF document with an audit page PDF and save the result

Parameters:

  • original_pdf_path: String path to the original PDF document that needs an audit trail attached
  • audit_page_path: String path to the PDF containing the audit trail page to be appended
  • output_path: String path where the merged PDF document will be saved

Returns: String containing the path to the successfully merged PDF document (same as output_path parameter)

_update_metadata(self, pdf_document: fitz.Document) -> None

Purpose: Update the PDF metadata to reflect that an audit trail has been added

Parameters:

  • pdf_document: A fitz.Document object representing the PDF whose metadata should be updated

Returns: None - modifies the pdf_document object in place by updating its metadata fields

Attributes

Name Type Description Scope
logger logging.Logger Logger instance used for recording informational messages, warnings, and errors during PDF merge operations instance

Dependencies

  • logging
  • fitz
  • os

Required Imports

import logging
import fitz
import os

Usage Example

import logging
import fitz
import os

# Configure logging
logging.basicConfig(level=logging.INFO)

# Instantiate the merger
merger = DocumentMerger()

# Define file paths
original_pdf = 'path/to/original_document.pdf'
audit_page = 'path/to/audit_trail.pdf'
output_pdf = 'path/to/merged_document.pdf'

# Merge the PDFs
try:
    result_path = merger.merge_pdfs(original_pdf, audit_page, output_pdf)
    print(f'Successfully merged document saved to: {result_path}')
except FileNotFoundError as e:
    print(f'File not found: {e}')
except Exception as e:
    print(f'Error during merge: {e}')

Best Practices

  • Always ensure input PDF files exist before calling merge_pdfs to avoid FileNotFoundError
  • Wrap merge_pdfs calls in try-except blocks to handle potential exceptions during PDF processing
  • Configure logging before instantiating DocumentMerger to capture operational logs
  • Ensure the output directory exists and has write permissions before calling merge_pdfs
  • The class is stateless between method calls, so a single instance can be reused for multiple merge operations
  • PDF files are properly closed after merging, even if errors occur, preventing file handle leaks
  • Metadata updates are non-critical; the merge will succeed even if metadata update fails
  • The audit page is appended to the end of the original document, preserving original page order

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function merge_pdfs_v1 69.8% similar

    Merges multiple PDF files into a single output PDF file with robust error handling and fallback mechanisms.

    From: /tf/active/vicechatdev/msg_to_eml.py
  • class DocumentProcessor 62.0% similar

    A comprehensive document processing class that converts documents to PDF, adds audit trails, applies security features (watermarks, signatures, hashing), and optionally converts to PDF/A format with document protection.

    From: /tf/active/vicechatdev/document_auditor/src/document_processor.py
  • class PDFManipulator 60.3% similar

    Manipulates existing PDF documents This class provides methods to add watermarks, merge PDFs, extract pages, and perform other manipulation operations.

    From: /tf/active/vicechatdev/CDocs/utils/pdf_utils.py
  • class ControlledDocumentConverter 60.2% similar

    A comprehensive document converter class that transforms controlled documents into archived PDFs with signature pages, audit trails, hash-based integrity verification, and PDF/A compliance for long-term archival.

    From: /tf/active/vicechatdev/CDocs/utils/document_converter.py
  • class AuditPageGenerator 59.2% similar

    A class that generates comprehensive PDF audit trail pages for documents, including document information, reviews, approvals, revision history, and event history with electronic signatures.

    From: /tf/active/vicechatdev/document_auditor/src/audit_page_generator.py
← Back to Browse