🔍 Code Extractor

class SimilarityCleaner

Maturity: 49

A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
Lines:
8 - 79
Complexity:
moderate

Purpose

SimilarityCleaner is designed to deduplicate document collections by analyzing embedding vectors and removing near-duplicate content. It uses a similarity threshold to identify documents that are semantically similar, groups them using a union-find algorithm, and retains only one representative document per group (the one with the longest text). This is useful for cleaning large document datasets, removing redundant information, and reducing storage/processing costs while maintaining content diversity.

Source Code

class SimilarityCleaner(BaseCleaner):
    """Cleaner that removes or merges similar documents based on embedding similarity."""
    
    def __init__(self, config: Config):
        """
        Initialize the SimilarityCleaner with configuration.
        
        Args:
            config: Configuration object
        """
        super().__init__(config)
        self.similarity_threshold = config.similarity_threshold
    
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents by removing or combining similar documents.
        
        Args:
            documents: List of document dictionaries with 'id', 'text' and 'embedding' keys
            
        Returns:
            List of documents with similar documents removed
        """
        # Ensure all documents have embeddings
        for doc in documents:
            if 'embedding' not in doc:
                raise ValueError(f"Document {doc['id']} does not have an embedding")
        
        # Find pairs of similar documents
        similar_pairs = find_similar_documents(documents, self.similarity_threshold)
        
        # Group similar documents
        doc_map = {doc['id']: doc for doc in documents}
        id_to_group = {}
        
        # Build connected components (groups of similar documents)
        for id1, id2, _ in similar_pairs:
            # Union-find algorithm to group similar documents
            group1 = id_to_group.get(id1, {id1})
            group2 = id_to_group.get(id2, {id2})
            
            if group1 is not group2:  # Different groups, merge them
                merged_group = group1.union(group2)
                for doc_id in merged_group:
                    id_to_group[doc_id] = merged_group
        
        # Collect all disjoint similarity groups
        unique_groups = set(frozenset(group) for group in id_to_group.values())
        
        # Create a set of document IDs to keep
        ids_to_keep = set()
        
        # For each group, select one representative document to keep
        for group in unique_groups:
            # Select the document with the longest text as representative
            rep_id = max(group, key=lambda doc_id: len(doc_map[doc_id]['text']))
            ids_to_keep.add(rep_id)
        
        # Include documents that aren't part of any similarity group
        for doc in documents:
            if doc['id'] not in id_to_group:
                ids_to_keep.add(doc['id'])
        
        # Create the cleaned document list
        cleaned_documents = [doc for doc in documents if doc['id'] in ids_to_keep]
        
        # Log statistics
        stats = self.get_stats(documents, cleaned_documents)
        print(f"SimilarityCleaner: Removed {stats['original_count'] - stats['cleaned_count']} similar documents "
              f"({stats['reduction_percentage']:.2f}% reduction)")
        
        return cleaned_documents

Parameters

Name Type Default Kind
bases BaseCleaner -

Parameter Details

config: A Config object containing configuration settings for the cleaner. Must include a 'similarity_threshold' attribute (float between 0 and 1) that determines how similar two documents must be to be considered duplicates. Higher values (closer to 1) require more similarity for documents to be merged.

Return Value

The class instantiation returns a SimilarityCleaner object. The clean() method returns a List[Dict[str, Any]] containing the deduplicated documents. Each document dictionary retains its original structure with 'id', 'text', and 'embedding' keys. Documents that were deemed similar are merged by keeping only the representative (longest text) from each similarity group.

Class Interface

Methods

__init__(self, config: Config) -> None

Purpose: Initialize the SimilarityCleaner with configuration settings, extracting the similarity threshold from the config object

Parameters:

  • config: Configuration object that must contain a similarity_threshold attribute (float between 0 and 1)

Returns: None - initializes the instance

clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]

Purpose: Process a list of documents to remove or merge similar ones based on embedding similarity, using union-find algorithm to group similar documents and keeping one representative per group

Parameters:

  • documents: List of document dictionaries, each must contain 'id' (unique identifier), 'text' (document content), and 'embedding' (numpy array or list of floats representing the document's vector embedding)

Returns: List of deduplicated documents with the same structure as input. Similar documents are merged by keeping only the document with the longest text from each similarity group. Documents not similar to any others are retained unchanged.

Attributes

Name Type Description Scope
similarity_threshold float The cosine similarity threshold (0 to 1) used to determine if two documents are similar enough to be merged. Inherited from config parameter during initialization. Typical values range from 0.7 to 0.95. instance
config Config Configuration object passed to parent BaseCleaner class, containing various settings for the cleaning process instance

Dependencies

  • typing
  • numpy
  • src.cleaners.base_cleaner
  • src.utils.similarity_utils
  • src.config

Required Imports

from typing import List, Dict, Any
import numpy as np
from src.cleaners.base_cleaner import BaseCleaner
from src.utils.similarity_utils import find_similar_documents
from src.config import Config

Usage Example

from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config
import numpy as np

# Create configuration with similarity threshold
config = Config()
config.similarity_threshold = 0.85

# Initialize the cleaner
cleaner = SimilarityCleaner(config)

# Prepare documents with embeddings
documents = [
    {'id': 'doc1', 'text': 'This is a sample document.', 'embedding': np.array([0.1, 0.2, 0.3])},
    {'id': 'doc2', 'text': 'This is a similar sample document.', 'embedding': np.array([0.11, 0.21, 0.31])},
    {'id': 'doc3', 'text': 'Completely different content here.', 'embedding': np.array([0.9, 0.8, 0.7])}
]

# Clean the documents
cleaned_docs = cleaner.clean(documents)

# Result: Similar documents merged, keeping longest text
print(f"Original: {len(documents)} documents, Cleaned: {len(cleaned_docs)} documents")

Best Practices

  • Always ensure documents have embeddings before calling clean() - the method will raise ValueError if embeddings are missing
  • Choose similarity_threshold carefully: values too low (e.g., 0.5) may merge unrelated documents, values too high (e.g., 0.99) may not catch duplicates
  • The cleaner keeps the document with the longest text as the representative - ensure this selection criterion fits your use case
  • For large document collections, be aware that similarity computation can be O(n²) - consider batching or using approximate nearest neighbor methods
  • The union-find algorithm ensures transitive similarity: if A is similar to B and B is similar to C, all three are grouped together
  • Review the printed statistics after cleaning to understand the reduction rate and validate the threshold setting
  • Documents not part of any similarity group are always retained in the output
  • The method modifies the document list by filtering, but does not modify individual document dictionaries

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class CombinedCleaner 78.5% similar

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
  • class HashCleaner 70.2% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
  • class TextClusterer 63.7% similar

    A class that clusters similar documents based on their embeddings using various clustering algorithms (K-means, Agglomerative, DBSCAN) and optionally generates summaries for each cluster.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/clustering/text_clusterer.py
  • function find_similar_documents 63.7% similar

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • function test_identical_text_removal 61.6% similar

    A pytest test function that verifies the SimilarityCleaner's ability to remove identical duplicate text entries from a list while preserving unique documents.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
← Back to Browse