class CombinedCleaner
A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
8 - 45
moderate
Purpose
CombinedCleaner provides a comprehensive document deduplication solution by first removing exact duplicates using hash-based comparison, then removing similar documents using similarity metrics. This two-stage approach ensures both computational efficiency (hash-based first pass) and thorough deduplication (similarity-based second pass). It inherits from BaseCleaner and orchestrates HashCleaner and SimilarityCleaner to provide a complete cleaning pipeline with statistics reporting.
Source Code
class CombinedCleaner(BaseCleaner):
"""
A cleaner that combines multiple cleaning approaches (hash-based and similarity-based).
"""
def __init__(self, config: Config):
"""
Initialize the CombinedCleaner with configuration.
Args:
config: Configuration object
"""
super().__init__(config)
self.hash_cleaner = HashCleaner(config)
self.similarity_cleaner = SimilarityCleaner(config)
def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Clean the documents using a combination of methods.
Args:
documents: List of document dictionaries
Returns:
List of cleaned documents
"""
# First step: Remove exact duplicates using hash-based approach
unique_documents = self.hash_cleaner.clean(documents)
# Second step: Remove similar documents using similarity-based approach
cleaned_documents = self.similarity_cleaner.clean(unique_documents)
# Log statistics
stats = self.get_stats(documents, cleaned_documents)
print(f"CombinedCleaner: Total reduction: {stats['reduction_percentage']:.2f}% "
f"(from {stats['original_count']} to {stats['cleaned_count']} documents)")
return cleaned_documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
BaseCleaner | - |
Parameter Details
config: A Config object containing configuration settings for both hash-based and similarity-based cleaning operations. This config is passed to both internal cleaners (HashCleaner and SimilarityCleaner) and may include settings like similarity thresholds, hash algorithms, and other cleaning parameters.
Return Value
The clean() method returns a List[Dict[str, Any]] containing the cleaned documents after both hash-based and similarity-based deduplication. The returned list will be smaller than or equal to the input list, with exact duplicates and similar documents removed. Instantiation returns a CombinedCleaner instance ready to process documents.
Class Interface
Methods
__init__(self, config: Config) -> None
Purpose: Initialize the CombinedCleaner with configuration and create internal hash and similarity cleaner instances
Parameters:
config: Configuration object containing settings for both cleaning approaches
Returns: None (constructor)
clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]
Purpose: Clean documents using a two-stage process: first removing exact duplicates via hash-based cleaning, then removing similar documents via similarity-based cleaning
Parameters:
documents: List of document dictionaries to be cleaned and deduplicated
Returns: List of cleaned documents with exact and similar duplicates removed, along with printed statistics showing reduction percentage and counts
get_stats(self, documents: List[Dict[str, Any]], cleaned_documents: List[Dict[str, Any]]) -> Dict[str, Any]
Purpose: Inherited from BaseCleaner - calculates statistics comparing original and cleaned document sets
Parameters:
documents: Original list of documents before cleaningcleaned_documents: List of documents after cleaning
Returns: Dictionary containing statistics including 'original_count', 'cleaned_count', and 'reduction_percentage'
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
hash_cleaner |
HashCleaner | Instance of HashCleaner used for the first stage of cleaning to remove exact duplicates using hash-based comparison | instance |
similarity_cleaner |
SimilarityCleaner | Instance of SimilarityCleaner used for the second stage of cleaning to remove similar documents using similarity metrics | instance |
config |
Config | Configuration object inherited from BaseCleaner containing settings for cleaning operations | instance |
Dependencies
typingsrc.cleaners.base_cleanersrc.cleaners.hash_cleanersrc.cleaners.similarity_cleanersrc.config
Required Imports
from typing import List, Dict, Any
from src.cleaners.base_cleaner import BaseCleaner
from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config
Usage Example
from src.cleaners.combined_cleaner import CombinedCleaner
from src.config import Config
# Initialize configuration
config = Config()
# Create CombinedCleaner instance
cleaner = CombinedCleaner(config)
# Prepare documents for cleaning
documents = [
{'id': 1, 'text': 'This is a document'},
{'id': 2, 'text': 'This is a document'},
{'id': 3, 'text': 'This is a similar document'},
{'id': 4, 'text': 'Completely different content'}
]
# Clean documents (removes exact and similar duplicates)
cleaned_docs = cleaner.clean(documents)
# Output will show statistics:
# CombinedCleaner: Total reduction: X.XX% (from 4 to N documents)
print(f'Cleaned documents: {len(cleaned_docs)}')
Best Practices
- Always instantiate with a properly configured Config object that contains settings for both hash-based and similarity-based cleaning
- The clean() method should be called with a list of document dictionaries; ensure documents have consistent structure
- The cleaning process is sequential: hash-based cleaning runs first, then similarity-based cleaning on the results
- Monitor the printed statistics to understand the effectiveness of the cleaning process
- For large document sets, be aware that similarity-based cleaning (second stage) may be computationally expensive
- The class maintains state through hash_cleaner and similarity_cleaner instances, which are reusable across multiple clean() calls
- Inherits get_stats() method from BaseCleaner for statistics calculation
- The two-stage approach is optimal: hash-based cleaning is fast and reduces the dataset before expensive similarity comparisons
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class HashCleaner 83.6% similar
-
class SimilarityCleaner 78.5% similar
-
class TestCombinedCleaner 70.3% similar
-
class BaseCleaner 69.7% similar
-
function main_v51 62.9% similar