function fuzzy_match_filename
Calculates a fuzzy match similarity score between two filenames by comparing them after normalization, using exact matching, substring containment, and word overlap techniques.
/tf/active/vicechatdev/mailsearch/compare_documents.py
165 - 205
simple
Purpose
This function is designed to compare two filenames and determine their similarity on a scale from 0.0 to 1.0. It's particularly useful for identifying duplicate or related files, matching documents with similar names, or finding files that may represent the same content with slight naming variations. The function removes a specified document code prefix and file extensions before comparison, making it ideal for document management systems where files may have systematic naming conventions.
Source Code
def fuzzy_match_filename(filename1: str, filename2: str, code: str) -> float:
"""
Calculate fuzzy match score between two filenames
Args:
filename1: First filename
filename2: Second filename
code: Document code to remove from comparison
Returns:
Match score (0.0 to 1.0)
"""
# Remove code prefix and extension
name1 = filename1.replace(code, '').replace('.pdf', '').strip(' -_')
name2 = filename2.replace(code, '').replace('.pdf', '').strip(' -_')
# Convert to lowercase for comparison
name1_lower = name1.lower()
name2_lower = name2.lower()
# Exact match
if name1_lower == name2_lower:
return 1.0
# Check if one contains the other
if name1_lower in name2_lower or name2_lower in name1_lower:
shorter = min(len(name1_lower), len(name2_lower))
longer = max(len(name1_lower), len(name2_lower))
return shorter / longer if longer > 0 else 0.0
# Calculate simple word overlap
words1 = set(name1_lower.split())
words2 = set(name2_lower.split())
if not words1 or not words2:
return 0.0
intersection = len(words1 & words2)
union = len(words1 | words2)
return intersection / union if union > 0 else 0.0
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
filename1 |
str | - | positional_or_keyword |
filename2 |
str | - | positional_or_keyword |
code |
str | - | positional_or_keyword |
Parameter Details
filename1: The first filename to compare. Should be a string representing a filename, potentially including a document code prefix and file extension (e.g., '.pdf'). The function will normalize this by removing the code parameter and '.pdf' extension.
filename2: The second filename to compare. Should be a string representing a filename with the same format expectations as filename1. Will be normalized in the same way for comparison.
code: A document code or prefix string that should be removed from both filenames before comparison. This allows for comparing the meaningful parts of filenames that follow a systematic naming convention (e.g., 'DOC-2024-' might be removed to compare the actual document names).
Return Value
Type: float
Returns a float value between 0.0 and 1.0 representing the similarity score. 1.0 indicates an exact match (after normalization), values between 0.0 and 1.0 indicate partial matches (either through substring containment or word overlap), and 0.0 indicates no similarity. For substring matches, the score is calculated as the ratio of shorter to longer string length. For word overlap, it uses the Jaccard similarity coefficient (intersection over union of word sets).
Usage Example
# Example 1: Exact match after normalization
score1 = fuzzy_match_filename(
'DOC-2024-Annual_Report.pdf',
'DOC-2024-annual_report.pdf',
'DOC-2024-'
)
print(f'Exact match score: {score1}') # Output: 1.0
# Example 2: Substring containment
score2 = fuzzy_match_filename(
'DOC-2024-Report.pdf',
'DOC-2024-Annual_Report.pdf',
'DOC-2024-'
)
print(f'Substring match score: {score2}') # Output: ~0.46 (6/13)
# Example 3: Word overlap
score3 = fuzzy_match_filename(
'DOC-2024-Financial Summary Report.pdf',
'DOC-2024-Annual Financial Report.pdf',
'DOC-2024-'
)
print(f'Word overlap score: {score3}') # Output: 0.5 (2 common words out of 4 total unique)
# Example 4: No match
score4 = fuzzy_match_filename(
'DOC-2024-Invoice.pdf',
'DOC-2024-Contract.pdf',
'DOC-2024-'
)
print(f'No match score: {score4}') # Output: 0.0
Best Practices
- Ensure the 'code' parameter accurately represents the prefix to be removed from both filenames for meaningful comparison
- The function is case-insensitive, so 'Report.pdf' and 'report.pdf' will be treated as identical
- The function currently only removes '.pdf' extensions; modify the code if you need to handle other file types
- For best results with word overlap matching, ensure filenames use consistent word separators (spaces, underscores, or hyphens)
- The function returns 0.0 if either filename becomes empty after normalization, so ensure your code parameter doesn't remove the entire filename
- Consider the matching strategy: exact match returns 1.0, substring containment returns length ratio, and word overlap returns Jaccard coefficient - these different strategies may produce different scores for the same pair
- The function strips spaces, hyphens, and underscores after removing the code and extension, which helps normalize different naming conventions
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function fuzzy_match_score 64.2% similar
-
function find_best_match 57.4% similar
-
function compare_pdf_content 55.1% similar
-
function compare_documents_v1 53.7% similar
-
function extract_document_code 53.1% similar