DocumentMerger - Code Extractor

class DocumentMerger

Maturity: 46

A class that merges PDF documents with audit trail pages, combining an original PDF with an audit page and updating metadata to reflect the audit process.

File:
/tf/active/vicechatdev/document_auditor/src/document_merger.py

Lines:
5 - 72

Complexity:
moderate

Purpose

DocumentMerger is responsible for combining PDF documents with their corresponding audit trail pages. It handles the technical process of merging PDFs using PyMuPDF (fitz), validates file existence, updates document metadata to reflect the audit, and provides comprehensive error handling and logging throughout the merge process. This class is typically used in document auditing systems where audit information needs to be permanently attached to original documents.

Source Code

class DocumentMerger:
    """Merges documents with audit trails"""
    
    def __init__(self):
        self.logger = logging.getLogger(__name__)
    
    def merge_pdfs(self, original_pdf_path, audit_page_path, output_path):
        """
        Merge the original document with the audit page
        
        Args:
            original_pdf_path (str): Path to the original PDF document
            audit_page_path (str): Path to the audit page PDF
            output_path (str): Path where merged PDF will be saved
            
        Returns:
            str: Path to the merged document
        """
        if not os.path.exists(original_pdf_path):
            raise FileNotFoundError(f"Original PDF not found: {original_pdf_path}")
        
        if not os.path.exists(audit_page_path):
            raise FileNotFoundError(f"Audit page PDF not found: {audit_page_path}")
        
        try:
            # Open both PDFs
            original_pdf = fitz.open(original_pdf_path)
            audit_pdf = fitz.open(audit_page_path)
            
            # Append audit page to original document
            original_pdf.insert_pdf(audit_pdf)
            
            # Update metadata - only use standard metadata keys
            self._update_metadata(original_pdf)
            
            # Save the combined document
            original_pdf.save(output_path)
            original_pdf.close()
            audit_pdf.close()
            
            self.logger.info(f"Successfully merged PDFs to: {output_path}")
            return output_path
        
        except Exception as e:
            self.logger.error(f"Error merging PDFs: {e}")
            raise
    
    def _update_metadata(self, pdf_document):
        """Update the PDF metadata to reflect the audit"""
        try:
            # Get existing metadata
            metadata = pdf_document.metadata
            
            # Only update standard metadata fields
            # PyMuPDF only accepts these standard PDF metadata fields
            metadata["producer"] = "Document Auditor System"
            metadata["modDate"] = fitz.get_pdf_now()
            metadata["creator"] = "Document Auditor"
            # Add a note in the subject field
            metadata["subject"] = "Document with audit trail"
            
            # Update the document with only standard fields
            pdf_document.set_metadata(metadata)
            
        except Exception as e:
            self.logger.error(f"Error updating metadata: {e}")
            # Continue without metadata if it fails
            pass

Parameters

Name	Type	Default	Kind
`bases`	-	-

Parameter Details

__init__: The constructor takes no parameters and initializes the class with a logger instance for tracking operations and errors.

Return Value

Instantiation returns a DocumentMerger object. The main method merge_pdfs returns a string containing the path to the successfully merged PDF document. If errors occur during merging, exceptions are raised rather than returning error values.

Class Interface

Methods

`init(self)`

Purpose: Initialize the DocumentMerger instance with a logger

Returns: None - initializes the instance

`merge_pdfs(self, original_pdf_path: str, audit_page_path: str, output_path: str) -> str`

Purpose: Merge an original PDF document with an audit page PDF and save the result

Parameters:

original_pdf_path: String path to the original PDF document that needs an audit trail attached
audit_page_path: String path to the PDF containing the audit trail page to be appended
output_path: String path where the merged PDF document will be saved

Returns: String containing the path to the successfully merged PDF document (same as output_path parameter)

`_update_metadata(self, pdf_document: fitz.Document) -> None`

Purpose: Update the PDF metadata to reflect that an audit trail has been added

Parameters:

pdf_document: A fitz.Document object representing the PDF whose metadata should be updated

Returns: None - modifies the pdf_document object in place by updating its metadata fields

Attributes

Name	Type	Description	Scope
`logger`	logging.Logger	Logger instance used for recording informational messages, warnings, and errors during PDF merge operations	instance

Dependencies

logging
fitz
os

Required Imports

import logging
import fitz
import os

Usage Example

import logging
import fitz
import os

# Configure logging
logging.basicConfig(level=logging.INFO)

# Instantiate the merger
merger = DocumentMerger()

# Define file paths
original_pdf = 'path/to/original_document.pdf'
audit_page = 'path/to/audit_trail.pdf'
output_pdf = 'path/to/merged_document.pdf'

# Merge the PDFs
try:
    result_path = merger.merge_pdfs(original_pdf, audit_page, output_pdf)
    print(f'Successfully merged document saved to: {result_path}')
except FileNotFoundError as e:
    print(f'File not found: {e}')
except Exception as e:
    print(f'Error during merge: {e}')

Best Practices

Always ensure input PDF files exist before calling merge_pdfs to avoid FileNotFoundError
Wrap merge_pdfs calls in try-except blocks to handle potential exceptions during PDF processing
Configure logging before instantiating DocumentMerger to capture operational logs
Ensure the output directory exists and has write permissions before calling merge_pdfs
The class is stateless between method calls, so a single instance can be reused for multiple merge operations
PDF files are properly closed after merging, even if errors occur, preventing file handle leaks
Metadata updates are non-critical; the merge will succeed even if metadata update fails
The audit page is appended to the end of the original document, preserving original page order

Similar Components

AI-powered semantic similarity - components with related functionality:

function merge_pdfs_v1 69.8% similar

Merges multiple PDF files into a single output PDF file with robust error handling and fallback mechanisms.
From: /tf/active/vicechatdev/msg_to_eml.py
class DocumentProcessor 62.0% similar

A comprehensive document processing class that converts documents to PDF, adds audit trails, applies security features (watermarks, signatures, hashing), and optionally converts to PDF/A format with document protection.
From: /tf/active/vicechatdev/document_auditor/src/document_processor.py
class PDFManipulator 60.3% similar

Manipulates existing PDF documents This class provides methods to add watermarks, merge PDFs, extract pages, and perform other manipulation operations.
From: /tf/active/vicechatdev/CDocs/utils/pdf_utils.py
class ControlledDocumentConverter 60.2% similar

A comprehensive document converter class that transforms controlled documents into archived PDFs with signature pages, audit trails, hash-based integrity verification, and PDF/A compliance for long-term archival.
From: /tf/active/vicechatdev/CDocs/utils/document_converter.py
class AuditPageGenerator 59.2% similar

A class that generates comprehensive PDF audit trail pages for documents, including document information, reviews, approvals, revision history, and event history with electronic signatures.
From: /tf/active/vicechatdev/document_auditor/src/audit_page_generator.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class DocumentMerger:
    """Merges documents with audit trails"""
    
    def __init__(self):
        self.logger = logging.getLogger(__name__)
    
    def merge_pdfs(self, original_pdf_path, audit_page_path, output_path):
        """
        Merge the original document with the audit page
        
        Args:
            original_pdf_path (str): Path to the original PDF document
            audit_page_path (str): Path to the audit page PDF
            output_path (str): Path where merged PDF will be saved
            
        Returns:
            str: Path to the merged document
        """
        if not os.path.exists(original_pdf_path):
            raise FileNotFoundError(f"Original PDF not found: {original_pdf_path}")
        
        if not os.path.exists(audit_page_path):
            raise FileNotFoundError(f"Audit page PDF not found: {audit_page_path}")
        
        try:
            # Open both PDFs
            original_pdf = fitz.open(original_pdf_path)
            audit_pdf = fitz.open(audit_page_path)
            
            # Append audit page to original document
            original_pdf.insert_pdf(audit_pdf)
            
            # Update metadata - only use standard metadata keys
            self._update_metadata(original_pdf)
            
            # Save the combined document
            original_pdf.save(output_path)
            original_pdf.close()
            audit_pdf.close()
            
            self.logger.info(f"Successfully merged PDFs to: {output_path}")
            return output_path
        
        except Exception as e:
            self.logger.error(f"Error merging PDFs: {e}")
            raise
    
    def _update_metadata(self, pdf_document):
        """Update the PDF metadata to reflect the audit"""
        try:
            # Get existing metadata
            metadata = pdf_document.metadata
            
            # Only update standard metadata fields
            # PyMuPDF only accepts these standard PDF metadata fields
            metadata["producer"] = "Document Auditor System"
            metadata["modDate"] = fitz.get_pdf_now()
            metadata["creator"] = "Document Auditor"
            # Add a note in the subject field
            metadata["subject"] = "Document with audit trail"
            
            # Update the document with only standard fields
            pdf_document.set_metadata(metadata)
            
        except Exception as e:
            self.logger.error(f"Error updating metadata: {e}")
            # Continue without metadata if it fails
            pass
                        

Improved Code

🔍 Code Extractor

class DocumentMerger

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self)`

`merge_pdfs(self, original_pdf_path: str, audit_page_path: str, output_path: str) -> str`

`_update_metadata(self, pdf_document: fitz.Document) -> None`

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function merge_pdfs_v1 69.8% similar

class DocumentProcessor 62.0% similar

class PDFManipulator 60.3% similar

class ControlledDocumentConverter 60.2% similar

class AuditPageGenerator 59.2% similar

class DocumentMerger

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self)

merge_pdfs(self, original_pdf_path: str, audit_page_path: str, output_path: str) -> str

_update_metadata(self, pdf_document: fitz.Document) -> None

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function merge_pdfs_v1 69.8% similar

class DocumentProcessor 62.0% similar

class PDFManipulator 60.3% similar

class ControlledDocumentConverter 60.2% similar

class AuditPageGenerator 59.2% similar

✨ Improve Code: DocumentMerger

Code Comparison

`init(self)`

`merge_pdfs(self, original_pdf_path: str, audit_page_path: str, output_path: str) -> str`

`_update_metadata(self, pdf_document: fitz.Document) -> None`