🔍 Code Extractor

function compare_documents

Maturity: 57

Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py
Lines:
208 - 321
Complexity:
moderate

Purpose

This function performs comprehensive document comparison between two sources (output folder and wuxi2 repository) to identify document matches, duplicates, and discrepancies. It uses multiple matching strategies including exact hash matching for identical files, size matching for potential duplicates, and fuzzy filename matching for similar documents. The function generates detailed comparison results including match types, file metadata, and similarity scores, useful for document reconciliation, migration validation, or duplicate detection workflows.

Source Code

def compare_documents(
    output_docs: Dict[str, Dict],
    wuxi2_docs: Dict[str, List[Dict]]
) -> List[Dict]:
    """
    Compare documents from output folder with wuxi2 repository
    
    Args:
        output_docs: Documents from output folder
        wuxi2_docs: Documents from wuxi2 repository
        
    Returns:
        List of comparison results
    """
    print(f"\n{'='*80}")
    print("Comparing documents...")
    print(f"{'='*80}\n")
    
    results = []
    
    for code, output_info in output_docs.items():
        result = {
            'document_code': code,
            'output_filename': output_info['filename'],
            'output_size': output_info['size'],
            'output_hash': output_info['hash'],
            'status': 'ABSENT',
            'match_type': 'N/A',
            'wuxi2_filename': '',
            'wuxi2_path': '',
            'wuxi2_size': 0,
            'wuxi2_hash': '',
            'size_match': False,
            'hash_match': False,
            'filename_similarity': 0.0,
            'notes': ''
        }
        
        # Check if code exists in wuxi2
        if code in wuxi2_docs:
            result['status'] = 'PRESENT'
            wuxi2_matches = wuxi2_docs[code]
            
            # Find best match
            best_match = None
            best_score = 0.0
            exact_hash_match = None
            exact_size_match = None
            
            for wuxi2_file in wuxi2_matches:
                # Check for exact hash match (identical file)
                if wuxi2_file['hash'] == output_info['hash']:
                    exact_hash_match = wuxi2_file
                    break
                
                # Check for exact size match
                if wuxi2_file['size'] == output_info['size'] and not exact_size_match:
                    exact_size_match = wuxi2_file
                
                # Calculate filename similarity
                similarity = fuzzy_match_filename(
                    output_info['filename'],
                    wuxi2_file['filename'],
                    code
                )
                
                if similarity > best_score:
                    best_score = similarity
                    best_match = wuxi2_file
            
            # Determine match type and populate result
            if exact_hash_match:
                match = exact_hash_match
                result['match_type'] = 'IDENTICAL (hash match)'
                result['hash_match'] = True
                result['size_match'] = True
            elif exact_size_match:
                match = exact_size_match
                result['match_type'] = 'SIZE MATCH (possible identical)'
                result['size_match'] = True
                result['hash_match'] = False
            elif best_match:
                match = best_match
                if best_score > 0.7:
                    result['match_type'] = 'HIGH SIMILARITY'
                elif best_score > 0.4:
                    result['match_type'] = 'MODERATE SIMILARITY'
                else:
                    result['match_type'] = 'LOW SIMILARITY'
                result['hash_match'] = False
                result['size_match'] = False
            else:
                result['match_type'] = 'CODE MATCH ONLY'
                result['notes'] = f"{len(wuxi2_matches)} file(s) with same code but no good match"
            
            if best_match or exact_hash_match or exact_size_match:
                match = exact_hash_match or exact_size_match or best_match
                result['wuxi2_filename'] = match['filename']
                result['wuxi2_path'] = match['relative_path']
                result['wuxi2_size'] = match['size']
                result['wuxi2_hash'] = match['hash']
                result['filename_similarity'] = best_score
                
                # Add notes for multiple matches
                if len(wuxi2_matches) > 1:
                    result['notes'] = f"{len(wuxi2_matches)} files with code {code} in wuxi2"
        
        results.append(result)
        
        # Print progress
        status_symbol = "✓" if result['status'] == 'PRESENT' else "✗"
        print(f"{status_symbol} {code:20s} {result['match_type']:30s} {output_info['filename'][:50]}")
    
    return results

Parameters

Name Type Default Kind
output_docs Dict[str, Dict] - positional_or_keyword
wuxi2_docs Dict[str, List[Dict]] - positional_or_keyword

Parameter Details

output_docs: Dictionary mapping document codes (strings) to document metadata dictionaries. Each metadata dict must contain 'filename' (str), 'size' (int), and 'hash' (str) keys representing the document's filename, file size in bytes, and hash value respectively.

wuxi2_docs: Dictionary mapping document codes (strings) to lists of document metadata dictionaries. Each metadata dict must contain 'filename' (str), 'size' (int), 'hash' (str), and 'relative_path' (str) keys. Multiple documents can share the same code, hence the list structure.

Return Value

Type: List[Dict]

Returns a list of dictionaries, one per document in output_docs. Each result dictionary contains: 'document_code' (str), 'output_filename' (str), 'output_size' (int), 'output_hash' (str), 'status' ('PRESENT' or 'ABSENT'), 'match_type' (str describing match quality: 'IDENTICAL (hash match)', 'SIZE MATCH (possible identical)', 'HIGH SIMILARITY', 'MODERATE SIMILARITY', 'LOW SIMILARITY', 'CODE MATCH ONLY', or 'N/A'), 'wuxi2_filename' (str), 'wuxi2_path' (str), 'wuxi2_size' (int), 'wuxi2_hash' (str), 'size_match' (bool), 'hash_match' (bool), 'filename_similarity' (float 0.0-1.0), and 'notes' (str with additional information).

Required Imports

from typing import Dict, List

Usage Example

# Prepare input data
output_docs = {
    'DOC001': {
        'filename': 'report_2023.pdf',
        'size': 1024000,
        'hash': 'abc123def456'
    },
    'DOC002': {
        'filename': 'summary.docx',
        'size': 512000,
        'hash': 'xyz789ghi012'
    }
}

wuxi2_docs = {
    'DOC001': [
        {
            'filename': 'report_2023_final.pdf',
            'size': 1024000,
            'hash': 'abc123def456',
            'relative_path': 'documents/2023/report_2023_final.pdf'
        }
    ],
    'DOC003': [
        {
            'filename': 'other_doc.pdf',
            'size': 2048000,
            'hash': 'mno345pqr678',
            'relative_path': 'documents/other/other_doc.pdf'
        }
    ]
}

# Compare documents
results = compare_documents(output_docs, wuxi2_docs)

# Process results
for result in results:
    print(f"Code: {result['document_code']}, Status: {result['status']}, Match: {result['match_type']}")
    if result['hash_match']:
        print(f"  Identical file found at: {result['wuxi2_path']}")

Best Practices

  • Ensure both input dictionaries have consistent structure with required keys ('filename', 'size', 'hash' for output_docs; additional 'relative_path' for wuxi2_docs)
  • The fuzzy_match_filename function must be available in scope before calling this function
  • Hash values should be computed using a consistent algorithm (e.g., MD5, SHA256) across both document sources
  • File sizes should be in bytes for accurate comparison
  • Consider the performance impact when comparing large document sets, as the function performs nested iterations
  • The function prints progress to stdout; redirect or suppress output if running in a non-interactive environment
  • Review the 'notes' field in results for documents with multiple potential matches in wuxi2_docs

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function compare_documents_v1 77.5% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function main_v57 67.3% similar

    Main execution function that orchestrates a document comparison workflow between two directories (mailsearch/output and wuxi2 repository), scanning for coded documents, comparing them, and generating results.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function print_summary 64.3% similar

    Prints a formatted summary report of document comparison results, including presence status, match quality statistics, and examples of absent and modified documents.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function main_v102 64.2% similar

    Main entry point function that orchestrates a document comparison workflow between two folders (mailsearch/output and wuxi2 repository), detecting signatures and generating comparison results.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function scan_wuxi2_folder 62.8% similar

    Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
← Back to Browse