function find_similar_documents
Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.
/tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
38 - 71
moderate
Purpose
This function is designed for document similarity analysis and duplicate detection tasks. It takes a collection of documents with pre-computed embeddings and efficiently finds all pairs that meet or exceed a similarity threshold. Common use cases include: deduplication of document collections, finding related content, clustering similar documents, and content recommendation systems. The function avoids duplicate comparisons by only examining the upper triangle of the similarity matrix and returns results sorted by similarity score for easy prioritization.
Source Code
def find_similar_documents(
documents: List[Dict[str, Any]],
threshold: float = 0.85
) -> List[Tuple[str, str, float]]:
"""
Find pairs of similar documents above the similarity threshold.
Args:
documents: List of document dictionaries with 'id', 'text', and 'embedding' keys
threshold: Similarity threshold (0 to 1)
Returns:
List of tuples with (doc1_id, doc2_id, similarity_score)
"""
similar_pairs = []
n = len(documents)
# Extract embeddings
embeddings = [doc['embedding'] for doc in documents]
# Calculate similarity matrix
sim_matrix = build_similarity_matrix(embeddings)
# Find similar pairs (only upper triangle to avoid duplicates)
for i in range(n):
for j in range(i+1, n):
sim = sim_matrix[i][j]
if sim >= threshold:
similar_pairs.append((documents[i]['id'], documents[j]['id'], sim))
# Sort by similarity score (highest first)
similar_pairs.sort(key=lambda x: x[2], reverse=True)
return similar_pairs
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
documents |
List[Dict[str, Any]] | - | positional_or_keyword |
threshold |
float | 0.85 | positional_or_keyword |
Parameter Details
documents: A list of dictionaries where each dictionary represents a document. Each dictionary must contain at least three keys: 'id' (unique identifier for the document, typically a string), 'text' (the document content as a string), and 'embedding' (a numerical vector representation of the document, typically a numpy array or list of floats). The embeddings should be pre-computed using a consistent embedding model before calling this function.
threshold: A float value between 0 and 1 that determines the minimum similarity score required for two documents to be considered similar. Default is 0.85 (85% similarity). Higher values (closer to 1.0) will return fewer, more similar pairs, while lower values will return more pairs with looser similarity requirements. A threshold of 1.0 would only return exact matches, while 0.0 would return all pairs.
Return Value
Type: List[Tuple[str, str, float]]
Returns a list of tuples, where each tuple contains three elements: (doc1_id, doc2_id, similarity_score). doc1_id and doc2_id are the string identifiers from the input documents, and similarity_score is a float representing the computed similarity between the two documents (typically between 0 and 1, with 1 being identical). The list is sorted in descending order by similarity_score, so the most similar pairs appear first. Returns an empty list if no pairs meet the threshold.
Dependencies
numpytypingsklearn
Required Imports
import numpy as np
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity
Usage Example
import numpy as np
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity
def build_similarity_matrix(embeddings):
return cosine_similarity(embeddings)
# Sample documents with embeddings
documents = [
{'id': 'doc1', 'text': 'Machine learning is fascinating', 'embedding': np.array([0.1, 0.5, 0.8])},
{'id': 'doc2', 'text': 'Deep learning is a subset of ML', 'embedding': np.array([0.15, 0.52, 0.79])},
{'id': 'doc3', 'text': 'The weather is nice today', 'embedding': np.array([0.9, 0.1, 0.2])}
]
# Find similar documents with default threshold
similar_pairs = find_similar_documents(documents)
print(similar_pairs)
# Output: [('doc1', 'doc2', 0.998), ...]
# Find similar documents with custom threshold
similar_pairs = find_similar_documents(documents, threshold=0.95)
for doc1, doc2, score in similar_pairs:
print(f'{doc1} and {doc2} are {score:.2%} similar')
Best Practices
- Ensure all documents have embeddings of the same dimensionality before calling this function to avoid errors in similarity computation.
- Pre-compute embeddings using a consistent model (e.g., sentence-transformers, OpenAI embeddings) before passing documents to this function.
- Be aware that this function has O(n²) time complexity due to pairwise comparisons. For large document collections (>10,000 documents), consider using approximate nearest neighbor algorithms like FAISS or Annoy instead.
- The 'build_similarity_matrix' function must be available in scope. Ensure it's properly defined or imported.
- Choose an appropriate threshold based on your use case: 0.9-1.0 for near-duplicates, 0.7-0.9 for similar content, 0.5-0.7 for loosely related content.
- The function only returns pairs where doc1_index < doc2_index to avoid duplicate pairs (e.g., (doc1, doc2) and (doc2, doc1)).
- Consider normalizing embeddings before computing similarity if they aren't already normalized, especially when using cosine similarity.
- For very large datasets, consider batching the documents or using more efficient similarity search methods like vector databases.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function identify_duplicates 67.5% similar
-
function get_unique_documents 65.4% similar
-
class SimilarityCleaner 63.7% similar
-
function build_similarity_matrix 58.4% similar
-
function test_similarity_threshold_effect 54.3% similar