SimilarityCleaner - Code Extractor

class SimilarityCleaner

Maturity: 49

A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py

Lines:
8 - 79

Complexity:
moderate

Purpose

SimilarityCleaner is designed to deduplicate document collections by analyzing embedding vectors and removing near-duplicate content. It uses a similarity threshold to identify documents that are semantically similar, groups them using a union-find algorithm, and retains only one representative document per group (the one with the longest text). This is useful for cleaning large document datasets, removing redundant information, and reducing storage/processing costs while maintaining content diversity.

Source Code

class SimilarityCleaner(BaseCleaner):
    """Cleaner that removes or merges similar documents based on embedding similarity."""
    
    def __init__(self, config: Config):
        """
        Initialize the SimilarityCleaner with configuration.
        
        Args:
            config: Configuration object
        """
        super().__init__(config)
        self.similarity_threshold = config.similarity_threshold
    
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents by removing or combining similar documents.
        
        Args:
            documents: List of document dictionaries with 'id', 'text' and 'embedding' keys
            
        Returns:
            List of documents with similar documents removed
        """
        # Ensure all documents have embeddings
        for doc in documents:
            if 'embedding' not in doc:
                raise ValueError(f"Document {doc['id']} does not have an embedding")
        
        # Find pairs of similar documents
        similar_pairs = find_similar_documents(documents, self.similarity_threshold)
        
        # Group similar documents
        doc_map = {doc['id']: doc for doc in documents}
        id_to_group = {}
        
        # Build connected components (groups of similar documents)
        for id1, id2, _ in similar_pairs:
            # Union-find algorithm to group similar documents
            group1 = id_to_group.get(id1, {id1})
            group2 = id_to_group.get(id2, {id2})
            
            if group1 is not group2:  # Different groups, merge them
                merged_group = group1.union(group2)
                for doc_id in merged_group:
                    id_to_group[doc_id] = merged_group
        
        # Collect all disjoint similarity groups
        unique_groups = set(frozenset(group) for group in id_to_group.values())
        
        # Create a set of document IDs to keep
        ids_to_keep = set()
        
        # For each group, select one representative document to keep
        for group in unique_groups:
            # Select the document with the longest text as representative
            rep_id = max(group, key=lambda doc_id: len(doc_map[doc_id]['text']))
            ids_to_keep.add(rep_id)
        
        # Include documents that aren't part of any similarity group
        for doc in documents:
            if doc['id'] not in id_to_group:
                ids_to_keep.add(doc['id'])
        
        # Create the cleaned document list
        cleaned_documents = [doc for doc in documents if doc['id'] in ids_to_keep]
        
        # Log statistics
        stats = self.get_stats(documents, cleaned_documents)
        print(f"SimilarityCleaner: Removed {stats['original_count'] - stats['cleaned_count']} similar documents "
              f"({stats['reduction_percentage']:.2f}% reduction)")
        
        return cleaned_documents

Parameters

Name	Type	Default	Kind
`bases`	BaseCleaner	-

Parameter Details

config: A Config object containing configuration settings for the cleaner. Must include a 'similarity_threshold' attribute (float between 0 and 1) that determines how similar two documents must be to be considered duplicates. Higher values (closer to 1) require more similarity for documents to be merged.

Return Value

The class instantiation returns a SimilarityCleaner object. The clean() method returns a List[Dict[str, Any]] containing the deduplicated documents. Each document dictionary retains its original structure with 'id', 'text', and 'embedding' keys. Documents that were deemed similar are merged by keeping only the representative (longest text) from each similarity group.

Class Interface

Methods

`init(self, config: Config) -> None`

Purpose: Initialize the SimilarityCleaner with configuration settings, extracting the similarity threshold from the config object

Parameters:

config: Configuration object that must contain a similarity_threshold attribute (float between 0 and 1)

Returns: None - initializes the instance

`clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]`

Purpose: Process a list of documents to remove or merge similar ones based on embedding similarity, using union-find algorithm to group similar documents and keeping one representative per group

Parameters:

documents: List of document dictionaries, each must contain 'id' (unique identifier), 'text' (document content), and 'embedding' (numpy array or list of floats representing the document's vector embedding)

Returns: List of deduplicated documents with the same structure as input. Similar documents are merged by keeping only the document with the longest text from each similarity group. Documents not similar to any others are retained unchanged.

Attributes

Name	Type	Description	Scope
`similarity_threshold`	float	The cosine similarity threshold (0 to 1) used to determine if two documents are similar enough to be merged. Inherited from config parameter during initialization. Typical values range from 0.7 to 0.95.	instance
`config`	Config	Configuration object passed to parent BaseCleaner class, containing various settings for the cleaning process	instance

Dependencies

typing
numpy
src.cleaners.base_cleaner
src.utils.similarity_utils
src.config

Required Imports

from typing import List, Dict, Any
import numpy as np
from src.cleaners.base_cleaner import BaseCleaner
from src.utils.similarity_utils import find_similar_documents
from src.config import Config

Usage Example

from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config
import numpy as np

# Create configuration with similarity threshold
config = Config()
config.similarity_threshold = 0.85

# Initialize the cleaner
cleaner = SimilarityCleaner(config)

# Prepare documents with embeddings
documents = [
    {'id': 'doc1', 'text': 'This is a sample document.', 'embedding': np.array([0.1, 0.2, 0.3])},
    {'id': 'doc2', 'text': 'This is a similar sample document.', 'embedding': np.array([0.11, 0.21, 0.31])},
    {'id': 'doc3', 'text': 'Completely different content here.', 'embedding': np.array([0.9, 0.8, 0.7])}
]

# Clean the documents
cleaned_docs = cleaner.clean(documents)

# Result: Similar documents merged, keeping longest text
print(f"Original: {len(documents)} documents, Cleaned: {len(cleaned_docs)} documents")

Best Practices

Always ensure documents have embeddings before calling clean() - the method will raise ValueError if embeddings are missing
Choose similarity_threshold carefully: values too low (e.g., 0.5) may merge unrelated documents, values too high (e.g., 0.99) may not catch duplicates
The cleaner keeps the document with the longest text as the representative - ensure this selection criterion fits your use case
For large document collections, be aware that similarity computation can be O(n²) - consider batching or using approximate nearest neighbor methods
The union-find algorithm ensures transitive similarity: if A is similar to B and B is similar to C, all three are grouped together
Review the printed statistics after cleaning to understand the reduction rate and validate the threshold setting
Documents not part of any similarity group are always retained in the output
The method modifies the document list by filtering, but does not modify individual document dictionaries

Similar Components

AI-powered semantic similarity - components with related functionality:

class CombinedCleaner 78.5% similar

A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
class HashCleaner 70.2% similar

A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
class TextClusterer 63.7% similar

A class that clusters similar documents based on their embeddings using various clustering algorithms (K-means, Agglomerative, DBSCAN) and optionally generates summaries for each cluster.
From: /tf/active/vicechatdev/chromadb-cleanup/src/clustering/text_clusterer.py
function find_similar_documents 63.7% similar

Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.
From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
function test_identical_text_removal 61.6% similar

A pytest test function that verifies the SimilarityCleaner's ability to remove identical duplicate text entries from a list while preserving unique documents.
From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class SimilarityCleaner(BaseCleaner):
    """Cleaner that removes or merges similar documents based on embedding similarity."""
    
    def __init__(self, config: Config):
        """
        Initialize the SimilarityCleaner with configuration.
        
        Args:
            config: Configuration object
        """
        super().__init__(config)
        self.similarity_threshold = config.similarity_threshold
    
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents by removing or combining similar documents.
        
        Args:
            documents: List of document dictionaries with 'id', 'text' and 'embedding' keys
            
        Returns:
            List of documents with similar documents removed
        """
        # Ensure all documents have embeddings
        for doc in documents:
            if 'embedding' not in doc:
                raise ValueError(f"Document {doc['id']} does not have an embedding")
        
        # Find pairs of similar documents
        similar_pairs = find_similar_documents(documents, self.similarity_threshold)
        
        # Group similar documents
        doc_map = {doc['id']: doc for doc in documents}
        id_to_group = {}
        
        # Build connected components (groups of similar documents)
        for id1, id2, _ in similar_pairs:
            # Union-find algorithm to group similar documents
            group1 = id_to_group.get(id1, {id1})
            group2 = id_to_group.get(id2, {id2})
            
            if group1 is not group2:  # Different groups, merge them
                merged_group = group1.union(group2)
                for doc_id in merged_group:
                    id_to_group[doc_id] = merged_group
        
        # Collect all disjoint similarity groups
        unique_groups = set(frozenset(group) for group in id_to_group.values())
        
        # Create a set of document IDs to keep
        ids_to_keep = set()
        
        # For each group, select one representative document to keep
        for group in unique_groups:
            # Select the document with the longest text as representative
            rep_id = max(group, key=lambda doc_id: len(doc_map[doc_id]['text']))
            ids_to_keep.add(rep_id)
        
        # Include documents that aren't part of any similarity group
        for doc in documents:
            if doc['id'] not in id_to_group:
                ids_to_keep.add(doc['id'])
        
        # Create the cleaned document list
        cleaned_documents = [doc for doc in documents if doc['id'] in ids_to_keep]
        
        # Log statistics
        stats = self.get_stats(documents, cleaned_documents)
        print(f"SimilarityCleaner: Removed {stats['original_count'] - stats['cleaned_count']} similar documents "
              f"({stats['reduction_percentage']:.2f}% reduction)")
        
        return cleaned_documents
                        

Improved Code

🔍 Code Extractor

class SimilarityCleaner

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, config: Config) -> None`

`clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]`

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class CombinedCleaner 78.5% similar

class HashCleaner 70.2% similar

class TextClusterer 63.7% similar

function find_similar_documents 63.7% similar

function test_identical_text_removal 61.6% similar

class SimilarityCleaner

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, config: Config) -> None

clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class CombinedCleaner 78.5% similar

class HashCleaner 70.2% similar

class TextClusterer 63.7% similar

function find_similar_documents 63.7% similar

function test_identical_text_removal 61.6% similar

✨ Improve Code: SimilarityCleaner

Code Comparison

`init(self, config: Config) -> None`

`clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]`