🔍 Code Extractor

function find_best_match

Maturity: 54

Finds the best matching document from a list of candidates by comparing hash, size, filename, and content similarity with configurable confidence thresholds.

File:
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
Lines:
225 - 277
Complexity:
moderate

Purpose

This function implements a multi-stage document matching algorithm to identify the best match for a given output document from a list of candidate documents. It uses a hierarchical approach: first checking for exact hash matches (identical files), then size-based matches with content verification, and finally fuzzy filename matching with optional content comparison. The function returns the matched document along with a match type classification and confidence score, enabling document deduplication, version tracking, or file organization workflows.

Source Code

def find_best_match(output_doc: Dict, wuxi2_candidates: List[Dict]) -> Optional[Tuple[Dict, str, float]]:
    """
    Find best matching document from candidates
    Returns (match_doc, match_type, confidence_score)
    """
    if not wuxi2_candidates:
        return None
    
    output_code = output_doc['code']
    output_filename = output_doc['filename']
    output_hash = output_doc['hash']
    
    # First: Check for exact hash match
    for candidate in wuxi2_candidates:
        if candidate['hash'] == output_hash:
            return (candidate, 'IDENTICAL', 1.0)
    
    # Second: Check for same size (likely identical)
    for candidate in wuxi2_candidates:
        if candidate['size'] == output_doc['size']:
            # Verify with content comparison
            content_sim = compare_pdf_content(output_doc['filepath'], candidate['filepath'])
            if content_sim > 0.95:
                return (candidate, 'SIZE_MATCH', content_sim)
    
    # Third: Find best filename match
    best_match = None
    best_score = 0.0
    
    for candidate in wuxi2_candidates:
        # Filename similarity
        filename_sim = fuzzy_match_score(output_filename, candidate['filename'])
        
        # Boost score if in same folder as other matches
        folder_boost = 0.0
        
        # Combine scores
        total_score = filename_sim + folder_boost
        
        if total_score > best_score:
            best_score = total_score
            best_match = candidate
    
    if best_match and best_score > 0.6:
        # Do content comparison for high-confidence filename matches
        if best_score > 0.85:
            content_sim = compare_pdf_content(output_doc['filepath'], best_match['filepath'])
            if content_sim > 0.7:
                return (best_match, 'HIGH_SIMILARITY', best_score)
        
        return (best_match, 'FUZZY_MATCH', best_score)
    
    return None

Parameters

Name Type Default Kind
output_doc Dict - positional_or_keyword
wuxi2_candidates List[Dict] - positional_or_keyword

Parameter Details

output_doc: A dictionary representing the document to match. Must contain keys: 'code' (document identifier), 'filename' (name of the file), 'hash' (file hash for exact matching), 'size' (file size in bytes), and 'filepath' (full path to the document for content comparison). This is the reference document for which a match is being sought.

wuxi2_candidates: A list of dictionaries representing potential matching documents. Each dictionary should have the same structure as output_doc with keys: 'code', 'filename', 'hash', 'size', and 'filepath'. This is the pool of documents to search through for matches. Can be an empty list, in which case None is returned.

Return Value

Type: Optional[Tuple[Dict, str, float]]

Returns Optional[Tuple[Dict, str, float]]. If a match is found, returns a tuple containing: (1) the matched candidate document dictionary, (2) a string indicating match type ('IDENTICAL' for exact hash match, 'SIZE_MATCH' for size-based match with high content similarity, 'HIGH_SIMILARITY' for high-confidence filename match with content verification, or 'FUZZY_MATCH' for filename-based match), and (3) a float confidence score between 0.0 and 1.0. Returns None if no candidates are provided or no match meets the minimum threshold (0.6 for fuzzy matches).

Dependencies

  • PyPDF2

Required Imports

from typing import Dict
from typing import List
from typing import Tuple
from typing import Optional

Conditional/Optional Imports

These imports are only needed under specific conditions:

import PyPDF2

Condition: Required when compare_pdf_content function is called for content similarity verification (SIZE_MATCH and HIGH_SIMILARITY paths)

Required (conditional)

Usage Example

# Assuming helper functions compare_pdf_content and fuzzy_match_score are defined

output_document = {
    'code': 'DOC001',
    'filename': 'report_2023.pdf',
    'hash': 'abc123def456',
    'size': 1024000,
    'filepath': '/path/to/output/report_2023.pdf'
}

candidates = [
    {
        'code': 'DOC002',
        'filename': 'report_2023_final.pdf',
        'hash': 'xyz789ghi012',
        'size': 1024500,
        'filepath': '/path/to/candidates/report_2023_final.pdf'
    },
    {
        'code': 'DOC003',
        'filename': 'report_2023.pdf',
        'hash': 'abc123def456',
        'size': 1024000,
        'filepath': '/path/to/candidates/report_2023.pdf'
    }
]

result = find_best_match(output_document, candidates)

if result:
    matched_doc, match_type, confidence = result
    print(f"Found match: {matched_doc['filename']}")
    print(f"Match type: {match_type}")
    print(f"Confidence: {confidence:.2f}")
else:
    print("No match found")

Best Practices

  • Ensure all documents in output_doc and wuxi2_candidates have all required keys ('code', 'filename', 'hash', 'size', 'filepath') to avoid KeyError exceptions
  • The function relies on external helper functions 'compare_pdf_content' and 'fuzzy_match_score' which must be implemented and available in the same module
  • File paths in the document dictionaries must be valid and accessible for content comparison operations
  • The function uses hardcoded thresholds (0.95 for content similarity, 0.6 for minimum fuzzy match, 0.85 for high-confidence filename match, 0.7 for content verification). Consider making these configurable parameters for different use cases
  • The 'folder_boost' variable is initialized but never used in the current implementation, suggesting incomplete functionality that may need to be implemented
  • For large candidate lists, consider pre-filtering by hash or size before calling this function to improve performance
  • The function performs file I/O operations (content comparison) which can be slow; consider caching results or using async operations for batch processing

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function compare_documents_v1 62.6% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function find_best_folder 60.9% similar

    Finds the best matching folder in a directory tree by comparing hierarchical document codes with folder names containing numeric codes.

    From: /tf/active/vicechatdev/mailsearch/copy_signed_documents.py
  • function compare_documents 60.1% similar

    Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function find_similar_documents 58.6% similar

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • function closest_match 57.9% similar

    Recursively finds the closest matching specification from a list of specs by comparing hierarchical keys (type, group, label, overlay) with a target match pattern.

    From: /tf/active/vicechatdev/patches/util.py
← Back to Browse