function compare_documents_v1
Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
280 - 362
complex
Purpose
This function performs comprehensive document comparison between two document sets (output_docs and wuxi2_docs). It matches documents by code, detects digital signatures (particularly DocuSign signatures from Vicebio and Wuxi signers), calculates content and filename similarity, and generates detailed comparison results including file metadata, signature status, and match quality. The function is designed for document verification workflows where signature presence and document matching are critical.
Source Code
def compare_documents(output_docs: Dict[str, Dict], wuxi2_docs: Dict[str, List[Dict]]) -> List[Dict]:
"""Compare documents and generate detailed results with signature detection"""
print("\n" + "="*80)
print("Comparing documents with signature detection...")
print("="*80 + "\n")
results = []
for code, output_doc in sorted(output_docs.items()):
result = {
'document_code': code,
'output_filename': output_doc['filename'],
'output_size': output_doc['size'],
'output_hash': output_doc['hash'],
'output_signed': output_doc['signature_info']['has_signature'],
'output_signature_confidence': output_doc['signature_info']['confidence'],
'output_vicebio_signers': ', '.join(output_doc['signature_info']['details']['vicebio_signers']),
'output_wuxi_signers': ', '.join(output_doc['signature_info']['details']['wuxi_signers']),
'status': 'ABSENT',
'match_type': 'N/A',
'wuxi2_filename': '',
'wuxi2_path': '',
'wuxi2_size': 0,
'wuxi2_hash': '',
'wuxi2_signed': False,
'wuxi2_signature_confidence': 'NONE',
'wuxi2_vicebio_signers': '',
'wuxi2_wuxi_signers': '',
'content_similarity': 0.0,
'filename_similarity': 0.0,
'notes': ''
}
# Look for matches in wuxi2
wuxi2_candidates = wuxi2_docs.get(code, [])
if wuxi2_candidates:
match_result = find_best_match(output_doc, wuxi2_candidates)
if match_result:
match_doc, match_type, confidence = match_result
# Lazy load signature info for wuxi2 file
if match_doc['signature_info'] is None:
print(f" Checking signatures for: {match_doc['filename'][:60]}...")
match_doc['signature_info'] = detect_signatures_in_pdf(match_doc['filepath'])
# Status based on Vicebio signature presence (DocuSign = fully signed)
result['status'] = 'PRESENT & SIGNED' if match_doc['signature_info']['has_signature'] else 'PRESENT BUT UNSIGNED'
result['match_type'] = match_type
result['wuxi2_filename'] = match_doc['filename']
result['wuxi2_path'] = match_doc['relative_path']
result['wuxi2_size'] = match_doc['size']
result['wuxi2_hash'] = match_doc['hash']
result['wuxi2_signed'] = match_doc['signature_info']['has_signature']
result['wuxi2_signature_confidence'] = match_doc['signature_info']['confidence']
result['wuxi2_vicebio_signers'] = ', '.join(match_doc['signature_info']['details']['vicebio_signers'])
result['wuxi2_wuxi_signers'] = ', '.join(match_doc['signature_info']['details']['wuxi_signers'])
result['filename_similarity'] = confidence
# Calculate content similarity for non-identical matches
if match_type != 'IDENTICAL':
result['content_similarity'] = compare_pdf_content(output_doc['filepath'], match_doc['filepath'])
else:
result['content_similarity'] = 1.0
# Add notes
if result['output_size'] != result['wuxi2_size']:
size_diff = result['output_size'] - result['wuxi2_size']
pct_diff = (size_diff / result['wuxi2_size'] * 100) if result['wuxi2_size'] > 0 else 0
result['notes'] = f"Size diff: {size_diff:+,d} bytes ({pct_diff:+.1f}%)"
# Status indicator
status_symbol = '✓' if result['status'] == 'PRESENT & SIGNED' else '⚠'
print(f"{status_symbol} {code:20s} {result['status']:20s} {result['wuxi2_filename'][:60]}")
else:
print(f"✗ {code:20s} ABSENT {output_doc['filename'][:60]}")
else:
print(f"✗ {code:20s} ABSENT {output_doc['filename'][:60]}")
results.append(result)
return results
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
output_docs |
Dict[str, Dict] | - | positional_or_keyword |
wuxi2_docs |
Dict[str, List[Dict]] | - | positional_or_keyword |
Parameter Details
output_docs: Dictionary mapping document codes (strings) to document metadata dictionaries. Each document dict must contain: 'filename' (str), 'size' (int), 'hash' (str), 'filepath' (str), and 'signature_info' (dict with keys: 'has_signature', 'confidence', 'details' containing 'vicebio_signers' and 'wuxi_signers' lists). These represent the reference/source documents to compare against.
wuxi2_docs: Dictionary mapping document codes (strings) to lists of document metadata dictionaries. Each document dict must contain: 'filename' (str), 'size' (int), 'hash' (str), 'filepath' (str), 'relative_path' (str), and 'signature_info' (dict or None for lazy loading). Multiple documents can share the same code. These represent the target documents to search for matches.
Return Value
Type: List[Dict]
Returns a list of dictionaries, one per document in output_docs. Each result dictionary contains: 'document_code' (str), 'output_filename' (str), 'output_size' (int), 'output_hash' (str), 'output_signed' (bool), 'output_signature_confidence' (str), 'output_vicebio_signers' (str), 'output_wuxi_signers' (str), 'status' (str: 'ABSENT', 'PRESENT & SIGNED', or 'PRESENT BUT UNSIGNED'), 'match_type' (str), 'wuxi2_filename' (str), 'wuxi2_path' (str), 'wuxi2_size' (int), 'wuxi2_hash' (str), 'wuxi2_signed' (bool), 'wuxi2_signature_confidence' (str), 'wuxi2_vicebio_signers' (str), 'wuxi2_wuxi_signers' (str), 'content_similarity' (float 0.0-1.0), 'filename_similarity' (float 0.0-1.0), and 'notes' (str).
Dependencies
PyPDF2
Required Imports
from typing import Dict, List
import PyPDF2
Conditional/Optional Imports
These imports are only needed under specific conditions:
from some_module import find_best_match
Condition: Required helper function that must be defined or imported - finds the best matching document from candidates
Required (conditional)from some_module import detect_signatures_in_pdf
Condition: Required helper function that must be defined or imported - detects signatures in PDF files and returns signature info dict
Required (conditional)from some_module import compare_pdf_content
Condition: Required helper function that must be defined or imported - compares content similarity between two PDF files
Required (conditional)Usage Example
# Assuming helper functions are defined
output_docs = {
'DOC001': {
'filename': 'contract_v1.pdf',
'size': 102400,
'hash': 'abc123',
'filepath': '/path/to/output/contract_v1.pdf',
'signature_info': {
'has_signature': True,
'confidence': 'HIGH',
'details': {
'vicebio_signers': ['john@vicebio.com'],
'wuxi_signers': []
}
}
}
}
wuxi2_docs = {
'DOC001': [
{
'filename': 'contract_signed.pdf',
'size': 105000,
'hash': 'def456',
'filepath': '/path/to/wuxi2/contract_signed.pdf',
'relative_path': 'contracts/contract_signed.pdf',
'signature_info': None
}
]
}
results = compare_documents(output_docs, wuxi2_docs)
for result in results:
print(f"Code: {result['document_code']}, Status: {result['status']}, Match: {result['match_type']}")
Best Practices
- Ensure all required helper functions (find_best_match, detect_signatures_in_pdf, compare_pdf_content) are properly implemented before calling this function
- The function uses lazy loading for signature detection on wuxi2 documents to improve performance - signature_info can be None initially
- Document codes should be consistent between output_docs and wuxi2_docs for proper matching
- The function prints progress to stdout - consider redirecting or capturing output in production environments
- Signature detection focuses on Vicebio signers (DocuSign) to determine 'SIGNED' status
- Content similarity calculation is skipped for IDENTICAL matches (hash match) to save processing time
- File paths in document dictionaries must be valid and accessible for PDF processing
- The function sorts output_docs by code for consistent ordering in results
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function compare_documents 77.5% similar
-
function main_v102 71.0% similar
-
function compare_pdf_content 70.4% similar
-
function print_summary_v1 69.3% similar
-
function main_v57 67.9% similar