CombinedCleaner - Code Extractor

class CombinedCleaner

Maturity: 49

A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py

Lines:
8 - 45

Complexity:
moderate

Purpose

CombinedCleaner provides a comprehensive document deduplication solution by first removing exact duplicates using hash-based comparison, then removing similar documents using similarity metrics. This two-stage approach ensures both computational efficiency (hash-based first pass) and thorough deduplication (similarity-based second pass). It inherits from BaseCleaner and orchestrates HashCleaner and SimilarityCleaner to provide a complete cleaning pipeline with statistics reporting.

Source Code

class CombinedCleaner(BaseCleaner):
    """
    A cleaner that combines multiple cleaning approaches (hash-based and similarity-based).
    """
    
    def __init__(self, config: Config):
        """
        Initialize the CombinedCleaner with configuration.
        
        Args:
            config: Configuration object
        """
        super().__init__(config)
        self.hash_cleaner = HashCleaner(config)
        self.similarity_cleaner = SimilarityCleaner(config)
    
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents using a combination of methods.
        
        Args:
            documents: List of document dictionaries
            
        Returns:
            List of cleaned documents
        """
        # First step: Remove exact duplicates using hash-based approach
        unique_documents = self.hash_cleaner.clean(documents)
        
        # Second step: Remove similar documents using similarity-based approach
        cleaned_documents = self.similarity_cleaner.clean(unique_documents)
        
        # Log statistics
        stats = self.get_stats(documents, cleaned_documents)
        print(f"CombinedCleaner: Total reduction: {stats['reduction_percentage']:.2f}% "
              f"(from {stats['original_count']} to {stats['cleaned_count']} documents)")
        
        return cleaned_documents

Parameters

Name	Type	Default	Kind
`bases`	BaseCleaner	-

Parameter Details

config: A Config object containing configuration settings for both hash-based and similarity-based cleaning operations. This config is passed to both internal cleaners (HashCleaner and SimilarityCleaner) and may include settings like similarity thresholds, hash algorithms, and other cleaning parameters.

Return Value

The clean() method returns a List[Dict[str, Any]] containing the cleaned documents after both hash-based and similarity-based deduplication. The returned list will be smaller than or equal to the input list, with exact duplicates and similar documents removed. Instantiation returns a CombinedCleaner instance ready to process documents.

Class Interface

Methods

`init(self, config: Config) -> None`

Purpose: Initialize the CombinedCleaner with configuration and create internal hash and similarity cleaner instances

Parameters:

config: Configuration object containing settings for both cleaning approaches

Returns: None (constructor)

`clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]`

Purpose: Clean documents using a two-stage process: first removing exact duplicates via hash-based cleaning, then removing similar documents via similarity-based cleaning

Parameters:

documents: List of document dictionaries to be cleaned and deduplicated

Returns: List of cleaned documents with exact and similar duplicates removed, along with printed statistics showing reduction percentage and counts

`get_stats(self, documents: List[Dict[str, Any]], cleaned_documents: List[Dict[str, Any]]) -> Dict[str, Any]`

Purpose: Inherited from BaseCleaner - calculates statistics comparing original and cleaned document sets

Parameters:

documents: Original list of documents before cleaning
cleaned_documents: List of documents after cleaning

Returns: Dictionary containing statistics including 'original_count', 'cleaned_count', and 'reduction_percentage'

Attributes

Name	Type	Description	Scope
`hash_cleaner`	HashCleaner	Instance of HashCleaner used for the first stage of cleaning to remove exact duplicates using hash-based comparison	instance
`similarity_cleaner`	SimilarityCleaner	Instance of SimilarityCleaner used for the second stage of cleaning to remove similar documents using similarity metrics	instance
`config`	Config	Configuration object inherited from BaseCleaner containing settings for cleaning operations	instance

Dependencies

typing
src.cleaners.base_cleaner
src.cleaners.hash_cleaner
src.cleaners.similarity_cleaner
src.config

Required Imports

from typing import List, Dict, Any
from src.cleaners.base_cleaner import BaseCleaner
from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config

Usage Example

from src.cleaners.combined_cleaner import CombinedCleaner
from src.config import Config

# Initialize configuration
config = Config()

# Create CombinedCleaner instance
cleaner = CombinedCleaner(config)

# Prepare documents for cleaning
documents = [
    {'id': 1, 'text': 'This is a document'},
    {'id': 2, 'text': 'This is a document'},
    {'id': 3, 'text': 'This is a similar document'},
    {'id': 4, 'text': 'Completely different content'}
]

# Clean documents (removes exact and similar duplicates)
cleaned_docs = cleaner.clean(documents)

# Output will show statistics:
# CombinedCleaner: Total reduction: X.XX% (from 4 to N documents)
print(f'Cleaned documents: {len(cleaned_docs)}')

Best Practices

Always instantiate with a properly configured Config object that contains settings for both hash-based and similarity-based cleaning
The clean() method should be called with a list of document dictionaries; ensure documents have consistent structure
The cleaning process is sequential: hash-based cleaning runs first, then similarity-based cleaning on the results
Monitor the printed statistics to understand the effectiveness of the cleaning process
For large document sets, be aware that similarity-based cleaning (second stage) may be computationally expensive
The class maintains state through hash_cleaner and similarity_cleaner instances, which are reusable across multiple clean() calls
Inherits get_stats() method from BaseCleaner for statistics calculation
The two-stage approach is optimal: hash-based cleaning is fast and reduces the dataset before expensive similarity comparisons

Similar Components

AI-powered semantic similarity - components with related functionality:

class HashCleaner 83.6% similar

A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
class SimilarityCleaner 78.5% similar

A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
class TestCombinedCleaner 70.3% similar

A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.
From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py
class BaseCleaner 69.7% similar

Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py
function main_v51 62.9% similar

Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.
From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py

🔍 Code Extractor

class CombinedCleaner

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, config: Config) -> None`

`clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]`

`get_stats(self, documents: List[Dict[str, Any]], cleaned_documents: List[Dict[str, Any]]) -> Dict[str, Any]`

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class HashCleaner 83.6% similar

class SimilarityCleaner 78.5% similar

class TestCombinedCleaner 70.3% similar

class BaseCleaner 69.7% similar

function main_v51 62.9% similar

class CombinedCleaner

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, config: Config) -> None

clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]

get_stats(self, documents: List[Dict[str, Any]], cleaned_documents: List[Dict[str, Any]]) -> Dict[str, Any]

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class HashCleaner 83.6% similar

class SimilarityCleaner 78.5% similar

class TestCombinedCleaner 70.3% similar

class BaseCleaner 69.7% similar

function main_v51 62.9% similar

✨ Improve Code: CombinedCleaner

Code Comparison

`init(self, config: Config) -> None`

`clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]`

`get_stats(self, documents: List[Dict[str, Any]], cleaned_documents: List[Dict[str, Any]]) -> Dict[str, Any]`