🔍 Code Extractor

function fuzzy_match_filename

Maturity: 59

Calculates a fuzzy match similarity score between two filenames by comparing them after normalization, using exact matching, substring containment, and word overlap techniques.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py
Lines:
165 - 205
Complexity:
simple

Purpose

This function is designed to compare two filenames and determine their similarity on a scale from 0.0 to 1.0. It's particularly useful for identifying duplicate or related files, matching documents with similar names, or finding files that may represent the same content with slight naming variations. The function removes a specified document code prefix and file extensions before comparison, making it ideal for document management systems where files may have systematic naming conventions.

Source Code

def fuzzy_match_filename(filename1: str, filename2: str, code: str) -> float:
    """
    Calculate fuzzy match score between two filenames
    
    Args:
        filename1: First filename
        filename2: Second filename
        code: Document code to remove from comparison
        
    Returns:
        Match score (0.0 to 1.0)
    """
    # Remove code prefix and extension
    name1 = filename1.replace(code, '').replace('.pdf', '').strip(' -_')
    name2 = filename2.replace(code, '').replace('.pdf', '').strip(' -_')
    
    # Convert to lowercase for comparison
    name1_lower = name1.lower()
    name2_lower = name2.lower()
    
    # Exact match
    if name1_lower == name2_lower:
        return 1.0
    
    # Check if one contains the other
    if name1_lower in name2_lower or name2_lower in name1_lower:
        shorter = min(len(name1_lower), len(name2_lower))
        longer = max(len(name1_lower), len(name2_lower))
        return shorter / longer if longer > 0 else 0.0
    
    # Calculate simple word overlap
    words1 = set(name1_lower.split())
    words2 = set(name2_lower.split())
    
    if not words1 or not words2:
        return 0.0
    
    intersection = len(words1 & words2)
    union = len(words1 | words2)
    
    return intersection / union if union > 0 else 0.0

Parameters

Name Type Default Kind
filename1 str - positional_or_keyword
filename2 str - positional_or_keyword
code str - positional_or_keyword

Parameter Details

filename1: The first filename to compare. Should be a string representing a filename, potentially including a document code prefix and file extension (e.g., '.pdf'). The function will normalize this by removing the code parameter and '.pdf' extension.

filename2: The second filename to compare. Should be a string representing a filename with the same format expectations as filename1. Will be normalized in the same way for comparison.

code: A document code or prefix string that should be removed from both filenames before comparison. This allows for comparing the meaningful parts of filenames that follow a systematic naming convention (e.g., 'DOC-2024-' might be removed to compare the actual document names).

Return Value

Type: float

Returns a float value between 0.0 and 1.0 representing the similarity score. 1.0 indicates an exact match (after normalization), values between 0.0 and 1.0 indicate partial matches (either through substring containment or word overlap), and 0.0 indicates no similarity. For substring matches, the score is calculated as the ratio of shorter to longer string length. For word overlap, it uses the Jaccard similarity coefficient (intersection over union of word sets).

Usage Example

# Example 1: Exact match after normalization
score1 = fuzzy_match_filename(
    'DOC-2024-Annual_Report.pdf',
    'DOC-2024-annual_report.pdf',
    'DOC-2024-'
)
print(f'Exact match score: {score1}')  # Output: 1.0

# Example 2: Substring containment
score2 = fuzzy_match_filename(
    'DOC-2024-Report.pdf',
    'DOC-2024-Annual_Report.pdf',
    'DOC-2024-'
)
print(f'Substring match score: {score2}')  # Output: ~0.46 (6/13)

# Example 3: Word overlap
score3 = fuzzy_match_filename(
    'DOC-2024-Financial Summary Report.pdf',
    'DOC-2024-Annual Financial Report.pdf',
    'DOC-2024-'
)
print(f'Word overlap score: {score3}')  # Output: 0.5 (2 common words out of 4 total unique)

# Example 4: No match
score4 = fuzzy_match_filename(
    'DOC-2024-Invoice.pdf',
    'DOC-2024-Contract.pdf',
    'DOC-2024-'
)
print(f'No match score: {score4}')  # Output: 0.0

Best Practices

  • Ensure the 'code' parameter accurately represents the prefix to be removed from both filenames for meaningful comparison
  • The function is case-insensitive, so 'Report.pdf' and 'report.pdf' will be treated as identical
  • The function currently only removes '.pdf' extensions; modify the code if you need to handle other file types
  • For best results with word overlap matching, ensure filenames use consistent word separators (spaces, underscores, or hyphens)
  • The function returns 0.0 if either filename becomes empty after normalization, so ensure your code parameter doesn't remove the entire filename
  • Consider the matching strategy: exact match returns 1.0, substring containment returns length ratio, and word overlap returns Jaccard coefficient - these different strategies may produce different scores for the same pair
  • The function strips spaces, hyphens, and underscores after removing the code and extension, which helps normalize different naming conventions

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function fuzzy_match_score 64.2% similar

    Calculates a fuzzy string similarity score between two input strings using the SequenceMatcher algorithm, returning a ratio between 0.0 and 1.0.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function find_best_match 57.4% similar

    Finds the best matching document from a list of candidates by comparing hash, size, filename, and content similarity with configurable confidence thresholds.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function compare_pdf_content 55.1% similar

    Compares the textual content similarity between two PDF files by extracting text samples and computing a similarity ratio using sequence matching.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function compare_documents_v1 53.7% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function extract_document_code 53.1% similar

    Extracts a structured document code (e.g., '4.5.38.2') from a filename using regex pattern matching.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
← Back to Browse