class DocumentMerger
A class that merges PDF documents with audit trail pages, combining an original PDF with an audit page and updating metadata to reflect the audit process.
/tf/active/vicechatdev/document_auditor/src/document_merger.py
5 - 72
moderate
Purpose
DocumentMerger is responsible for combining PDF documents with their corresponding audit trail pages. It handles the technical process of merging PDFs using PyMuPDF (fitz), validates file existence, updates document metadata to reflect the audit, and provides comprehensive error handling and logging throughout the merge process. This class is typically used in document auditing systems where audit information needs to be permanently attached to original documents.
Source Code
class DocumentMerger:
"""Merges documents with audit trails"""
def __init__(self):
self.logger = logging.getLogger(__name__)
def merge_pdfs(self, original_pdf_path, audit_page_path, output_path):
"""
Merge the original document with the audit page
Args:
original_pdf_path (str): Path to the original PDF document
audit_page_path (str): Path to the audit page PDF
output_path (str): Path where merged PDF will be saved
Returns:
str: Path to the merged document
"""
if not os.path.exists(original_pdf_path):
raise FileNotFoundError(f"Original PDF not found: {original_pdf_path}")
if not os.path.exists(audit_page_path):
raise FileNotFoundError(f"Audit page PDF not found: {audit_page_path}")
try:
# Open both PDFs
original_pdf = fitz.open(original_pdf_path)
audit_pdf = fitz.open(audit_page_path)
# Append audit page to original document
original_pdf.insert_pdf(audit_pdf)
# Update metadata - only use standard metadata keys
self._update_metadata(original_pdf)
# Save the combined document
original_pdf.save(output_path)
original_pdf.close()
audit_pdf.close()
self.logger.info(f"Successfully merged PDFs to: {output_path}")
return output_path
except Exception as e:
self.logger.error(f"Error merging PDFs: {e}")
raise
def _update_metadata(self, pdf_document):
"""Update the PDF metadata to reflect the audit"""
try:
# Get existing metadata
metadata = pdf_document.metadata
# Only update standard metadata fields
# PyMuPDF only accepts these standard PDF metadata fields
metadata["producer"] = "Document Auditor System"
metadata["modDate"] = fitz.get_pdf_now()
metadata["creator"] = "Document Auditor"
# Add a note in the subject field
metadata["subject"] = "Document with audit trail"
# Update the document with only standard fields
pdf_document.set_metadata(metadata)
except Exception as e:
self.logger.error(f"Error updating metadata: {e}")
# Continue without metadata if it fails
pass
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
__init__: The constructor takes no parameters and initializes the class with a logger instance for tracking operations and errors.
Return Value
Instantiation returns a DocumentMerger object. The main method merge_pdfs returns a string containing the path to the successfully merged PDF document. If errors occur during merging, exceptions are raised rather than returning error values.
Class Interface
Methods
__init__(self)
Purpose: Initialize the DocumentMerger instance with a logger
Returns: None - initializes the instance
merge_pdfs(self, original_pdf_path: str, audit_page_path: str, output_path: str) -> str
Purpose: Merge an original PDF document with an audit page PDF and save the result
Parameters:
original_pdf_path: String path to the original PDF document that needs an audit trail attachedaudit_page_path: String path to the PDF containing the audit trail page to be appendedoutput_path: String path where the merged PDF document will be saved
Returns: String containing the path to the successfully merged PDF document (same as output_path parameter)
_update_metadata(self, pdf_document: fitz.Document) -> None
Purpose: Update the PDF metadata to reflect that an audit trail has been added
Parameters:
pdf_document: A fitz.Document object representing the PDF whose metadata should be updated
Returns: None - modifies the pdf_document object in place by updating its metadata fields
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
logger |
logging.Logger | Logger instance used for recording informational messages, warnings, and errors during PDF merge operations | instance |
Dependencies
loggingfitzos
Required Imports
import logging
import fitz
import os
Usage Example
import logging
import fitz
import os
# Configure logging
logging.basicConfig(level=logging.INFO)
# Instantiate the merger
merger = DocumentMerger()
# Define file paths
original_pdf = 'path/to/original_document.pdf'
audit_page = 'path/to/audit_trail.pdf'
output_pdf = 'path/to/merged_document.pdf'
# Merge the PDFs
try:
result_path = merger.merge_pdfs(original_pdf, audit_page, output_pdf)
print(f'Successfully merged document saved to: {result_path}')
except FileNotFoundError as e:
print(f'File not found: {e}')
except Exception as e:
print(f'Error during merge: {e}')
Best Practices
- Always ensure input PDF files exist before calling merge_pdfs to avoid FileNotFoundError
- Wrap merge_pdfs calls in try-except blocks to handle potential exceptions during PDF processing
- Configure logging before instantiating DocumentMerger to capture operational logs
- Ensure the output directory exists and has write permissions before calling merge_pdfs
- The class is stateless between method calls, so a single instance can be reused for multiple merge operations
- PDF files are properly closed after merging, even if errors occur, preventing file handle leaks
- Metadata updates are non-critical; the merge will succeed even if metadata update fails
- The audit page is appended to the end of the original document, preserving original page order
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function merge_pdfs_v1 69.8% similar
-
class DocumentProcessor 62.0% similar
-
class PDFManipulator 60.3% similar
-
class ControlledDocumentConverter 60.2% similar
-
class AuditPageGenerator 59.2% similar