🔍 Code Extractor

function find_similar_documents

Maturity: 60

Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
Lines:
38 - 71
Complexity:
moderate

Purpose

This function is designed for document similarity analysis and duplicate detection tasks. It takes a collection of documents with pre-computed embeddings and efficiently finds all pairs that meet or exceed a similarity threshold. Common use cases include: deduplication of document collections, finding related content, clustering similar documents, and content recommendation systems. The function avoids duplicate comparisons by only examining the upper triangle of the similarity matrix and returns results sorted by similarity score for easy prioritization.

Source Code

def find_similar_documents(
    documents: List[Dict[str, Any]], 
    threshold: float = 0.85
) -> List[Tuple[str, str, float]]:
    """
    Find pairs of similar documents above the similarity threshold.
    
    Args:
        documents: List of document dictionaries with 'id', 'text', and 'embedding' keys
        threshold: Similarity threshold (0 to 1)
        
    Returns:
        List of tuples with (doc1_id, doc2_id, similarity_score)
    """
    similar_pairs = []
    n = len(documents)
    
    # Extract embeddings
    embeddings = [doc['embedding'] for doc in documents]
    
    # Calculate similarity matrix
    sim_matrix = build_similarity_matrix(embeddings)
    
    # Find similar pairs (only upper triangle to avoid duplicates)
    for i in range(n):
        for j in range(i+1, n):
            sim = sim_matrix[i][j]
            if sim >= threshold:
                similar_pairs.append((documents[i]['id'], documents[j]['id'], sim))
    
    # Sort by similarity score (highest first)
    similar_pairs.sort(key=lambda x: x[2], reverse=True)
    
    return similar_pairs

Parameters

Name Type Default Kind
documents List[Dict[str, Any]] - positional_or_keyword
threshold float 0.85 positional_or_keyword

Parameter Details

documents: A list of dictionaries where each dictionary represents a document. Each dictionary must contain at least three keys: 'id' (unique identifier for the document, typically a string), 'text' (the document content as a string), and 'embedding' (a numerical vector representation of the document, typically a numpy array or list of floats). The embeddings should be pre-computed using a consistent embedding model before calling this function.

threshold: A float value between 0 and 1 that determines the minimum similarity score required for two documents to be considered similar. Default is 0.85 (85% similarity). Higher values (closer to 1.0) will return fewer, more similar pairs, while lower values will return more pairs with looser similarity requirements. A threshold of 1.0 would only return exact matches, while 0.0 would return all pairs.

Return Value

Type: List[Tuple[str, str, float]]

Returns a list of tuples, where each tuple contains three elements: (doc1_id, doc2_id, similarity_score). doc1_id and doc2_id are the string identifiers from the input documents, and similarity_score is a float representing the computed similarity between the two documents (typically between 0 and 1, with 1 being identical). The list is sorted in descending order by similarity_score, so the most similar pairs appear first. Returns an empty list if no pairs meet the threshold.

Dependencies

  • numpy
  • typing
  • sklearn

Required Imports

import numpy as np
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

Usage Example

import numpy as np
from typing import List, Dict, Any, Tuple
from sklearn.metrics.pairwise import cosine_similarity

def build_similarity_matrix(embeddings):
    return cosine_similarity(embeddings)

# Sample documents with embeddings
documents = [
    {'id': 'doc1', 'text': 'Machine learning is fascinating', 'embedding': np.array([0.1, 0.5, 0.8])},
    {'id': 'doc2', 'text': 'Deep learning is a subset of ML', 'embedding': np.array([0.15, 0.52, 0.79])},
    {'id': 'doc3', 'text': 'The weather is nice today', 'embedding': np.array([0.9, 0.1, 0.2])}
]

# Find similar documents with default threshold
similar_pairs = find_similar_documents(documents)
print(similar_pairs)
# Output: [('doc1', 'doc2', 0.998), ...]

# Find similar documents with custom threshold
similar_pairs = find_similar_documents(documents, threshold=0.95)
for doc1, doc2, score in similar_pairs:
    print(f'{doc1} and {doc2} are {score:.2%} similar')

Best Practices

  • Ensure all documents have embeddings of the same dimensionality before calling this function to avoid errors in similarity computation.
  • Pre-compute embeddings using a consistent model (e.g., sentence-transformers, OpenAI embeddings) before passing documents to this function.
  • Be aware that this function has O(n²) time complexity due to pairwise comparisons. For large document collections (>10,000 documents), consider using approximate nearest neighbor algorithms like FAISS or Annoy instead.
  • The 'build_similarity_matrix' function must be available in scope. Ensure it's properly defined or imported.
  • Choose an appropriate threshold based on your use case: 0.9-1.0 for near-duplicates, 0.7-0.9 for similar content, 0.5-0.7 for loosely related content.
  • The function only returns pairs where doc1_index < doc2_index to avoid duplicate pairs (e.g., (doc1, doc2) and (doc2, doc1)).
  • Consider normalizing embeddings before computing similarity if they aren't already normalized, especially when using cosine similarity.
  • For very large datasets, consider batching the documents or using more efficient similarity search methods like vector databases.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function identify_duplicates 67.5% similar

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
  • function get_unique_documents 65.4% similar

    Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
  • class SimilarityCleaner 63.7% similar

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
  • function build_similarity_matrix 58.4% similar

    Computes a pairwise cosine similarity matrix for a collection of embedding vectors, where each cell (i,j) represents the similarity between embedding i and embedding j.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • function test_similarity_threshold_effect 54.3% similar

    A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
← Back to Browse