function find_best_match
Finds the best matching document from a list of candidates by comparing hash, size, filename, and content similarity with configurable confidence thresholds.
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
225 - 277
moderate
Purpose
This function implements a multi-stage document matching algorithm to identify the best match for a given output document from a list of candidate documents. It uses a hierarchical approach: first checking for exact hash matches (identical files), then size-based matches with content verification, and finally fuzzy filename matching with optional content comparison. The function returns the matched document along with a match type classification and confidence score, enabling document deduplication, version tracking, or file organization workflows.
Source Code
def find_best_match(output_doc: Dict, wuxi2_candidates: List[Dict]) -> Optional[Tuple[Dict, str, float]]:
"""
Find best matching document from candidates
Returns (match_doc, match_type, confidence_score)
"""
if not wuxi2_candidates:
return None
output_code = output_doc['code']
output_filename = output_doc['filename']
output_hash = output_doc['hash']
# First: Check for exact hash match
for candidate in wuxi2_candidates:
if candidate['hash'] == output_hash:
return (candidate, 'IDENTICAL', 1.0)
# Second: Check for same size (likely identical)
for candidate in wuxi2_candidates:
if candidate['size'] == output_doc['size']:
# Verify with content comparison
content_sim = compare_pdf_content(output_doc['filepath'], candidate['filepath'])
if content_sim > 0.95:
return (candidate, 'SIZE_MATCH', content_sim)
# Third: Find best filename match
best_match = None
best_score = 0.0
for candidate in wuxi2_candidates:
# Filename similarity
filename_sim = fuzzy_match_score(output_filename, candidate['filename'])
# Boost score if in same folder as other matches
folder_boost = 0.0
# Combine scores
total_score = filename_sim + folder_boost
if total_score > best_score:
best_score = total_score
best_match = candidate
if best_match and best_score > 0.6:
# Do content comparison for high-confidence filename matches
if best_score > 0.85:
content_sim = compare_pdf_content(output_doc['filepath'], best_match['filepath'])
if content_sim > 0.7:
return (best_match, 'HIGH_SIMILARITY', best_score)
return (best_match, 'FUZZY_MATCH', best_score)
return None
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
output_doc |
Dict | - | positional_or_keyword |
wuxi2_candidates |
List[Dict] | - | positional_or_keyword |
Parameter Details
output_doc: A dictionary representing the document to match. Must contain keys: 'code' (document identifier), 'filename' (name of the file), 'hash' (file hash for exact matching), 'size' (file size in bytes), and 'filepath' (full path to the document for content comparison). This is the reference document for which a match is being sought.
wuxi2_candidates: A list of dictionaries representing potential matching documents. Each dictionary should have the same structure as output_doc with keys: 'code', 'filename', 'hash', 'size', and 'filepath'. This is the pool of documents to search through for matches. Can be an empty list, in which case None is returned.
Return Value
Type: Optional[Tuple[Dict, str, float]]
Returns Optional[Tuple[Dict, str, float]]. If a match is found, returns a tuple containing: (1) the matched candidate document dictionary, (2) a string indicating match type ('IDENTICAL' for exact hash match, 'SIZE_MATCH' for size-based match with high content similarity, 'HIGH_SIMILARITY' for high-confidence filename match with content verification, or 'FUZZY_MATCH' for filename-based match), and (3) a float confidence score between 0.0 and 1.0. Returns None if no candidates are provided or no match meets the minimum threshold (0.6 for fuzzy matches).
Dependencies
PyPDF2
Required Imports
from typing import Dict
from typing import List
from typing import Tuple
from typing import Optional
Conditional/Optional Imports
These imports are only needed under specific conditions:
import PyPDF2
Condition: Required when compare_pdf_content function is called for content similarity verification (SIZE_MATCH and HIGH_SIMILARITY paths)
Required (conditional)Usage Example
# Assuming helper functions compare_pdf_content and fuzzy_match_score are defined
output_document = {
'code': 'DOC001',
'filename': 'report_2023.pdf',
'hash': 'abc123def456',
'size': 1024000,
'filepath': '/path/to/output/report_2023.pdf'
}
candidates = [
{
'code': 'DOC002',
'filename': 'report_2023_final.pdf',
'hash': 'xyz789ghi012',
'size': 1024500,
'filepath': '/path/to/candidates/report_2023_final.pdf'
},
{
'code': 'DOC003',
'filename': 'report_2023.pdf',
'hash': 'abc123def456',
'size': 1024000,
'filepath': '/path/to/candidates/report_2023.pdf'
}
]
result = find_best_match(output_document, candidates)
if result:
matched_doc, match_type, confidence = result
print(f"Found match: {matched_doc['filename']}")
print(f"Match type: {match_type}")
print(f"Confidence: {confidence:.2f}")
else:
print("No match found")
Best Practices
- Ensure all documents in output_doc and wuxi2_candidates have all required keys ('code', 'filename', 'hash', 'size', 'filepath') to avoid KeyError exceptions
- The function relies on external helper functions 'compare_pdf_content' and 'fuzzy_match_score' which must be implemented and available in the same module
- File paths in the document dictionaries must be valid and accessible for content comparison operations
- The function uses hardcoded thresholds (0.95 for content similarity, 0.6 for minimum fuzzy match, 0.85 for high-confidence filename match, 0.7 for content verification). Consider making these configurable parameters for different use cases
- The 'folder_boost' variable is initialized but never used in the current implementation, suggesting incomplete functionality that may need to be implemented
- For large candidate lists, consider pre-filtering by hash or size before calling this function to improve performance
- The function performs file I/O operations (content comparison) which can be slow; consider caching results or using async operations for batch processing
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function compare_documents_v1 62.6% similar
-
function find_best_folder 60.9% similar
-
function compare_documents 60.1% similar
-
function find_similar_documents 58.6% similar
-
function closest_match 57.9% similar