🔍 Code Extractor

function compare_documents_v1

Maturity: 56

Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

File:
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
Lines:
280 - 362
Complexity:
complex

Purpose

This function performs comprehensive document comparison between two document sets (output_docs and wuxi2_docs). It matches documents by code, detects digital signatures (particularly DocuSign signatures from Vicebio and Wuxi signers), calculates content and filename similarity, and generates detailed comparison results including file metadata, signature status, and match quality. The function is designed for document verification workflows where signature presence and document matching are critical.

Source Code

def compare_documents(output_docs: Dict[str, Dict], wuxi2_docs: Dict[str, List[Dict]]) -> List[Dict]:
    """Compare documents and generate detailed results with signature detection"""
    print("\n" + "="*80)
    print("Comparing documents with signature detection...")
    print("="*80 + "\n")
    
    results = []
    
    for code, output_doc in sorted(output_docs.items()):
        result = {
            'document_code': code,
            'output_filename': output_doc['filename'],
            'output_size': output_doc['size'],
            'output_hash': output_doc['hash'],
            'output_signed': output_doc['signature_info']['has_signature'],
            'output_signature_confidence': output_doc['signature_info']['confidence'],
            'output_vicebio_signers': ', '.join(output_doc['signature_info']['details']['vicebio_signers']),
            'output_wuxi_signers': ', '.join(output_doc['signature_info']['details']['wuxi_signers']),
            'status': 'ABSENT',
            'match_type': 'N/A',
            'wuxi2_filename': '',
            'wuxi2_path': '',
            'wuxi2_size': 0,
            'wuxi2_hash': '',
            'wuxi2_signed': False,
            'wuxi2_signature_confidence': 'NONE',
            'wuxi2_vicebio_signers': '',
            'wuxi2_wuxi_signers': '',
            'content_similarity': 0.0,
            'filename_similarity': 0.0,
            'notes': ''
        }
        
        # Look for matches in wuxi2
        wuxi2_candidates = wuxi2_docs.get(code, [])
        
        if wuxi2_candidates:
            match_result = find_best_match(output_doc, wuxi2_candidates)
            
            if match_result:
                match_doc, match_type, confidence = match_result
                
                # Lazy load signature info for wuxi2 file
                if match_doc['signature_info'] is None:
                    print(f"  Checking signatures for: {match_doc['filename'][:60]}...")
                    match_doc['signature_info'] = detect_signatures_in_pdf(match_doc['filepath'])
                
                # Status based on Vicebio signature presence (DocuSign = fully signed)
                result['status'] = 'PRESENT & SIGNED' if match_doc['signature_info']['has_signature'] else 'PRESENT BUT UNSIGNED'
                result['match_type'] = match_type
                result['wuxi2_filename'] = match_doc['filename']
                result['wuxi2_path'] = match_doc['relative_path']
                result['wuxi2_size'] = match_doc['size']
                result['wuxi2_hash'] = match_doc['hash']
                result['wuxi2_signed'] = match_doc['signature_info']['has_signature']
                result['wuxi2_signature_confidence'] = match_doc['signature_info']['confidence']
                result['wuxi2_vicebio_signers'] = ', '.join(match_doc['signature_info']['details']['vicebio_signers'])
                result['wuxi2_wuxi_signers'] = ', '.join(match_doc['signature_info']['details']['wuxi_signers'])
                result['filename_similarity'] = confidence
                
                # Calculate content similarity for non-identical matches
                if match_type != 'IDENTICAL':
                    result['content_similarity'] = compare_pdf_content(output_doc['filepath'], match_doc['filepath'])
                else:
                    result['content_similarity'] = 1.0
                
                # Add notes
                if result['output_size'] != result['wuxi2_size']:
                    size_diff = result['output_size'] - result['wuxi2_size']
                    pct_diff = (size_diff / result['wuxi2_size'] * 100) if result['wuxi2_size'] > 0 else 0
                    result['notes'] = f"Size diff: {size_diff:+,d} bytes ({pct_diff:+.1f}%)"
                
                # Status indicator
                status_symbol = '✓' if result['status'] == 'PRESENT & SIGNED' else '⚠'
                print(f"{status_symbol} {code:20s} {result['status']:20s} {result['wuxi2_filename'][:60]}")
            else:
                print(f"✗ {code:20s} ABSENT                  {output_doc['filename'][:60]}")
        else:
            print(f"✗ {code:20s} ABSENT                  {output_doc['filename'][:60]}")
        
        results.append(result)
    
    return results

Parameters

Name Type Default Kind
output_docs Dict[str, Dict] - positional_or_keyword
wuxi2_docs Dict[str, List[Dict]] - positional_or_keyword

Parameter Details

output_docs: Dictionary mapping document codes (strings) to document metadata dictionaries. Each document dict must contain: 'filename' (str), 'size' (int), 'hash' (str), 'filepath' (str), and 'signature_info' (dict with keys: 'has_signature', 'confidence', 'details' containing 'vicebio_signers' and 'wuxi_signers' lists). These represent the reference/source documents to compare against.

wuxi2_docs: Dictionary mapping document codes (strings) to lists of document metadata dictionaries. Each document dict must contain: 'filename' (str), 'size' (int), 'hash' (str), 'filepath' (str), 'relative_path' (str), and 'signature_info' (dict or None for lazy loading). Multiple documents can share the same code. These represent the target documents to search for matches.

Return Value

Type: List[Dict]

Returns a list of dictionaries, one per document in output_docs. Each result dictionary contains: 'document_code' (str), 'output_filename' (str), 'output_size' (int), 'output_hash' (str), 'output_signed' (bool), 'output_signature_confidence' (str), 'output_vicebio_signers' (str), 'output_wuxi_signers' (str), 'status' (str: 'ABSENT', 'PRESENT & SIGNED', or 'PRESENT BUT UNSIGNED'), 'match_type' (str), 'wuxi2_filename' (str), 'wuxi2_path' (str), 'wuxi2_size' (int), 'wuxi2_hash' (str), 'wuxi2_signed' (bool), 'wuxi2_signature_confidence' (str), 'wuxi2_vicebio_signers' (str), 'wuxi2_wuxi_signers' (str), 'content_similarity' (float 0.0-1.0), 'filename_similarity' (float 0.0-1.0), and 'notes' (str).

Dependencies

  • PyPDF2

Required Imports

from typing import Dict, List
import PyPDF2

Conditional/Optional Imports

These imports are only needed under specific conditions:

from some_module import find_best_match

Condition: Required helper function that must be defined or imported - finds the best matching document from candidates

Required (conditional)
from some_module import detect_signatures_in_pdf

Condition: Required helper function that must be defined or imported - detects signatures in PDF files and returns signature info dict

Required (conditional)
from some_module import compare_pdf_content

Condition: Required helper function that must be defined or imported - compares content similarity between two PDF files

Required (conditional)

Usage Example

# Assuming helper functions are defined
output_docs = {
    'DOC001': {
        'filename': 'contract_v1.pdf',
        'size': 102400,
        'hash': 'abc123',
        'filepath': '/path/to/output/contract_v1.pdf',
        'signature_info': {
            'has_signature': True,
            'confidence': 'HIGH',
            'details': {
                'vicebio_signers': ['john@vicebio.com'],
                'wuxi_signers': []
            }
        }
    }
}

wuxi2_docs = {
    'DOC001': [
        {
            'filename': 'contract_signed.pdf',
            'size': 105000,
            'hash': 'def456',
            'filepath': '/path/to/wuxi2/contract_signed.pdf',
            'relative_path': 'contracts/contract_signed.pdf',
            'signature_info': None
        }
    ]
}

results = compare_documents(output_docs, wuxi2_docs)
for result in results:
    print(f"Code: {result['document_code']}, Status: {result['status']}, Match: {result['match_type']}")

Best Practices

  • Ensure all required helper functions (find_best_match, detect_signatures_in_pdf, compare_pdf_content) are properly implemented before calling this function
  • The function uses lazy loading for signature detection on wuxi2 documents to improve performance - signature_info can be None initially
  • Document codes should be consistent between output_docs and wuxi2_docs for proper matching
  • The function prints progress to stdout - consider redirecting or capturing output in production environments
  • Signature detection focuses on Vicebio signers (DocuSign) to determine 'SIGNED' status
  • Content similarity calculation is skipped for IDENTICAL matches (hash match) to save processing time
  • File paths in document dictionaries must be valid and accessible for PDF processing
  • The function sorts output_docs by code for consistent ordering in results

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function compare_documents 77.5% similar

    Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function main_v102 71.0% similar

    Main entry point function that orchestrates a document comparison workflow between two folders (mailsearch/output and wuxi2 repository), detecting signatures and generating comparison results.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function compare_pdf_content 70.4% similar

    Compares the textual content similarity between two PDF files by extracting text samples and computing a similarity ratio using sequence matching.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function print_summary_v1 69.3% similar

    Prints a comprehensive summary report of document comparison results, including status breakdowns, signature analysis, match quality metrics, and examples from each category.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function main_v57 67.9% similar

    Main execution function that orchestrates a document comparison workflow between two directories (mailsearch/output and wuxi2 repository), scanning for coded documents, comparing them, and generating results.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
← Back to Browse