class SimilarityCleaner
A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
8 - 79
moderate
Purpose
SimilarityCleaner is designed to deduplicate document collections by analyzing embedding vectors and removing near-duplicate content. It uses a similarity threshold to identify documents that are semantically similar, groups them using a union-find algorithm, and retains only one representative document per group (the one with the longest text). This is useful for cleaning large document datasets, removing redundant information, and reducing storage/processing costs while maintaining content diversity.
Source Code
class SimilarityCleaner(BaseCleaner):
"""Cleaner that removes or merges similar documents based on embedding similarity."""
def __init__(self, config: Config):
"""
Initialize the SimilarityCleaner with configuration.
Args:
config: Configuration object
"""
super().__init__(config)
self.similarity_threshold = config.similarity_threshold
def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Clean the documents by removing or combining similar documents.
Args:
documents: List of document dictionaries with 'id', 'text' and 'embedding' keys
Returns:
List of documents with similar documents removed
"""
# Ensure all documents have embeddings
for doc in documents:
if 'embedding' not in doc:
raise ValueError(f"Document {doc['id']} does not have an embedding")
# Find pairs of similar documents
similar_pairs = find_similar_documents(documents, self.similarity_threshold)
# Group similar documents
doc_map = {doc['id']: doc for doc in documents}
id_to_group = {}
# Build connected components (groups of similar documents)
for id1, id2, _ in similar_pairs:
# Union-find algorithm to group similar documents
group1 = id_to_group.get(id1, {id1})
group2 = id_to_group.get(id2, {id2})
if group1 is not group2: # Different groups, merge them
merged_group = group1.union(group2)
for doc_id in merged_group:
id_to_group[doc_id] = merged_group
# Collect all disjoint similarity groups
unique_groups = set(frozenset(group) for group in id_to_group.values())
# Create a set of document IDs to keep
ids_to_keep = set()
# For each group, select one representative document to keep
for group in unique_groups:
# Select the document with the longest text as representative
rep_id = max(group, key=lambda doc_id: len(doc_map[doc_id]['text']))
ids_to_keep.add(rep_id)
# Include documents that aren't part of any similarity group
for doc in documents:
if doc['id'] not in id_to_group:
ids_to_keep.add(doc['id'])
# Create the cleaned document list
cleaned_documents = [doc for doc in documents if doc['id'] in ids_to_keep]
# Log statistics
stats = self.get_stats(documents, cleaned_documents)
print(f"SimilarityCleaner: Removed {stats['original_count'] - stats['cleaned_count']} similar documents "
f"({stats['reduction_percentage']:.2f}% reduction)")
return cleaned_documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
BaseCleaner | - |
Parameter Details
config: A Config object containing configuration settings for the cleaner. Must include a 'similarity_threshold' attribute (float between 0 and 1) that determines how similar two documents must be to be considered duplicates. Higher values (closer to 1) require more similarity for documents to be merged.
Return Value
The class instantiation returns a SimilarityCleaner object. The clean() method returns a List[Dict[str, Any]] containing the deduplicated documents. Each document dictionary retains its original structure with 'id', 'text', and 'embedding' keys. Documents that were deemed similar are merged by keeping only the representative (longest text) from each similarity group.
Class Interface
Methods
__init__(self, config: Config) -> None
Purpose: Initialize the SimilarityCleaner with configuration settings, extracting the similarity threshold from the config object
Parameters:
config: Configuration object that must contain a similarity_threshold attribute (float between 0 and 1)
Returns: None - initializes the instance
clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]
Purpose: Process a list of documents to remove or merge similar ones based on embedding similarity, using union-find algorithm to group similar documents and keeping one representative per group
Parameters:
documents: List of document dictionaries, each must contain 'id' (unique identifier), 'text' (document content), and 'embedding' (numpy array or list of floats representing the document's vector embedding)
Returns: List of deduplicated documents with the same structure as input. Similar documents are merged by keeping only the document with the longest text from each similarity group. Documents not similar to any others are retained unchanged.
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
similarity_threshold |
float | The cosine similarity threshold (0 to 1) used to determine if two documents are similar enough to be merged. Inherited from config parameter during initialization. Typical values range from 0.7 to 0.95. | instance |
config |
Config | Configuration object passed to parent BaseCleaner class, containing various settings for the cleaning process | instance |
Dependencies
typingnumpysrc.cleaners.base_cleanersrc.utils.similarity_utilssrc.config
Required Imports
from typing import List, Dict, Any
import numpy as np
from src.cleaners.base_cleaner import BaseCleaner
from src.utils.similarity_utils import find_similar_documents
from src.config import Config
Usage Example
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config
import numpy as np
# Create configuration with similarity threshold
config = Config()
config.similarity_threshold = 0.85
# Initialize the cleaner
cleaner = SimilarityCleaner(config)
# Prepare documents with embeddings
documents = [
{'id': 'doc1', 'text': 'This is a sample document.', 'embedding': np.array([0.1, 0.2, 0.3])},
{'id': 'doc2', 'text': 'This is a similar sample document.', 'embedding': np.array([0.11, 0.21, 0.31])},
{'id': 'doc3', 'text': 'Completely different content here.', 'embedding': np.array([0.9, 0.8, 0.7])}
]
# Clean the documents
cleaned_docs = cleaner.clean(documents)
# Result: Similar documents merged, keeping longest text
print(f"Original: {len(documents)} documents, Cleaned: {len(cleaned_docs)} documents")
Best Practices
- Always ensure documents have embeddings before calling clean() - the method will raise ValueError if embeddings are missing
- Choose similarity_threshold carefully: values too low (e.g., 0.5) may merge unrelated documents, values too high (e.g., 0.99) may not catch duplicates
- The cleaner keeps the document with the longest text as the representative - ensure this selection criterion fits your use case
- For large document collections, be aware that similarity computation can be O(n²) - consider batching or using approximate nearest neighbor methods
- The union-find algorithm ensures transitive similarity: if A is similar to B and B is similar to C, all three are grouped together
- Review the printed statistics after cleaning to understand the reduction rate and validate the threshold setting
- Documents not part of any similarity group are always retained in the output
- The method modifies the document list by filtering, but does not modify individual document dictionaries
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class CombinedCleaner 78.5% similar
-
class HashCleaner 70.2% similar
-
class TextClusterer 63.7% similar
-
function find_similar_documents 63.7% similar
-
function test_identical_text_removal 61.6% similar