function fuzzy_match_score
Calculates a fuzzy string similarity score between two input strings using the SequenceMatcher algorithm, returning a ratio between 0.0 and 1.0.
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
130 - 132
simple
Purpose
This function provides a case-insensitive fuzzy matching capability to determine how similar two strings are. It's useful for comparing text where exact matches aren't required, such as finding duplicate records with slight variations, matching user input against a database, spell-checking, or identifying similar names/addresses. The function uses Python's difflib.SequenceMatcher which implements the Ratcliff/Obershelp algorithm to compute similarity ratios.
Source Code
def fuzzy_match_score(str1: str, str2: str) -> float:
"""Calculate similarity score between two strings"""
return SequenceMatcher(None, str1.lower(), str2.lower()).ratio()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
str1 |
str | - | positional_or_keyword |
str2 |
str | - | positional_or_keyword |
Parameter Details
str1: The first string to compare. Can be any string value including empty strings. The function converts this to lowercase internally for case-insensitive comparison.
str2: The second string to compare against the first. Can be any string value including empty strings. The function converts this to lowercase internally for case-insensitive comparison.
Return Value
Type: float
Returns a float value between 0.0 and 1.0 representing the similarity ratio. A value of 1.0 indicates identical strings (case-insensitive), 0.0 indicates completely different strings, and values in between represent partial similarity. The ratio is calculated as: 2.0 * M / T, where M is the number of matches and T is the total number of elements in both sequences.
Dependencies
difflib
Required Imports
from difflib import SequenceMatcher
Usage Example
from difflib import SequenceMatcher
def fuzzy_match_score(str1: str, str2: str) -> float:
"""Calculate similarity score between two strings"""
return SequenceMatcher(None, str1.lower(), str2.lower()).ratio()
# Example usage
score1 = fuzzy_match_score("hello world", "hello world")
print(f"Exact match: {score1}") # Output: 1.0
score2 = fuzzy_match_score("hello world", "Hello World!")
print(f"Case difference with punctuation: {score2}") # Output: ~0.96
score3 = fuzzy_match_score("John Smith", "Jon Smyth")
print(f"Similar names: {score3}") # Output: ~0.82
score4 = fuzzy_match_score("apple", "orange")
print(f"Different words: {score4}") # Output: ~0.18
# Practical use case: finding best match
user_input = "Microsft"
companies = ["Microsoft", "Apple", "Google", "Amazon"]
best_match = max(companies, key=lambda x: fuzzy_match_score(user_input, x))
print(f"Best match for '{user_input}': {best_match}") # Output: Microsoft
Best Practices
- The function performs case-insensitive comparison by converting both strings to lowercase. If case-sensitive comparison is needed, modify the function accordingly.
- For large-scale string matching operations, consider caching results or using more efficient algorithms like Levenshtein distance for specific use cases.
- The SequenceMatcher algorithm works well for general text similarity but may not be optimal for all scenarios. Consider alternatives like Jaro-Winkler or Levenshtein distance for specific domain requirements.
- Empty strings will return a score of 1.0 when compared to each other, and 0.0 when compared to non-empty strings.
- The function does not handle None values. Ensure both parameters are valid strings or add None checks if needed.
- For performance-critical applications with many comparisons, consider using the quick_ratio() or real_quick_ratio() methods of SequenceMatcher for faster approximate results.
- When using this for threshold-based matching (e.g., score > 0.8 means match), test with representative data to determine appropriate threshold values for your use case.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function fuzzy_match_filename 64.2% similar
-
function compare_pdf_content 59.4% similar
-
function find_best_match 53.8% similar
-
function test_similarity_threshold_effect 51.9% similar
-
function calculate_similarity 47.0% similar