🔍 Code Extractor

class CombinedCleaner

Maturity: 49

A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
Lines:
8 - 45
Complexity:
moderate

Purpose

CombinedCleaner provides a comprehensive document deduplication solution by first removing exact duplicates using hash-based comparison, then removing similar documents using similarity metrics. This two-stage approach ensures both computational efficiency (hash-based first pass) and thorough deduplication (similarity-based second pass). It inherits from BaseCleaner and orchestrates HashCleaner and SimilarityCleaner to provide a complete cleaning pipeline with statistics reporting.

Source Code

class CombinedCleaner(BaseCleaner):
    """
    A cleaner that combines multiple cleaning approaches (hash-based and similarity-based).
    """
    
    def __init__(self, config: Config):
        """
        Initialize the CombinedCleaner with configuration.
        
        Args:
            config: Configuration object
        """
        super().__init__(config)
        self.hash_cleaner = HashCleaner(config)
        self.similarity_cleaner = SimilarityCleaner(config)
    
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents using a combination of methods.
        
        Args:
            documents: List of document dictionaries
            
        Returns:
            List of cleaned documents
        """
        # First step: Remove exact duplicates using hash-based approach
        unique_documents = self.hash_cleaner.clean(documents)
        
        # Second step: Remove similar documents using similarity-based approach
        cleaned_documents = self.similarity_cleaner.clean(unique_documents)
        
        # Log statistics
        stats = self.get_stats(documents, cleaned_documents)
        print(f"CombinedCleaner: Total reduction: {stats['reduction_percentage']:.2f}% "
              f"(from {stats['original_count']} to {stats['cleaned_count']} documents)")
        
        return cleaned_documents

Parameters

Name Type Default Kind
bases BaseCleaner -

Parameter Details

config: A Config object containing configuration settings for both hash-based and similarity-based cleaning operations. This config is passed to both internal cleaners (HashCleaner and SimilarityCleaner) and may include settings like similarity thresholds, hash algorithms, and other cleaning parameters.

Return Value

The clean() method returns a List[Dict[str, Any]] containing the cleaned documents after both hash-based and similarity-based deduplication. The returned list will be smaller than or equal to the input list, with exact duplicates and similar documents removed. Instantiation returns a CombinedCleaner instance ready to process documents.

Class Interface

Methods

__init__(self, config: Config) -> None

Purpose: Initialize the CombinedCleaner with configuration and create internal hash and similarity cleaner instances

Parameters:

  • config: Configuration object containing settings for both cleaning approaches

Returns: None (constructor)

clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]

Purpose: Clean documents using a two-stage process: first removing exact duplicates via hash-based cleaning, then removing similar documents via similarity-based cleaning

Parameters:

  • documents: List of document dictionaries to be cleaned and deduplicated

Returns: List of cleaned documents with exact and similar duplicates removed, along with printed statistics showing reduction percentage and counts

get_stats(self, documents: List[Dict[str, Any]], cleaned_documents: List[Dict[str, Any]]) -> Dict[str, Any]

Purpose: Inherited from BaseCleaner - calculates statistics comparing original and cleaned document sets

Parameters:

  • documents: Original list of documents before cleaning
  • cleaned_documents: List of documents after cleaning

Returns: Dictionary containing statistics including 'original_count', 'cleaned_count', and 'reduction_percentage'

Attributes

Name Type Description Scope
hash_cleaner HashCleaner Instance of HashCleaner used for the first stage of cleaning to remove exact duplicates using hash-based comparison instance
similarity_cleaner SimilarityCleaner Instance of SimilarityCleaner used for the second stage of cleaning to remove similar documents using similarity metrics instance
config Config Configuration object inherited from BaseCleaner containing settings for cleaning operations instance

Dependencies

  • typing
  • src.cleaners.base_cleaner
  • src.cleaners.hash_cleaner
  • src.cleaners.similarity_cleaner
  • src.config

Required Imports

from typing import List, Dict, Any
from src.cleaners.base_cleaner import BaseCleaner
from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config

Usage Example

from src.cleaners.combined_cleaner import CombinedCleaner
from src.config import Config

# Initialize configuration
config = Config()

# Create CombinedCleaner instance
cleaner = CombinedCleaner(config)

# Prepare documents for cleaning
documents = [
    {'id': 1, 'text': 'This is a document'},
    {'id': 2, 'text': 'This is a document'},
    {'id': 3, 'text': 'This is a similar document'},
    {'id': 4, 'text': 'Completely different content'}
]

# Clean documents (removes exact and similar duplicates)
cleaned_docs = cleaner.clean(documents)

# Output will show statistics:
# CombinedCleaner: Total reduction: X.XX% (from 4 to N documents)
print(f'Cleaned documents: {len(cleaned_docs)}')

Best Practices

  • Always instantiate with a properly configured Config object that contains settings for both hash-based and similarity-based cleaning
  • The clean() method should be called with a list of document dictionaries; ensure documents have consistent structure
  • The cleaning process is sequential: hash-based cleaning runs first, then similarity-based cleaning on the results
  • Monitor the printed statistics to understand the effectiveness of the cleaning process
  • For large document sets, be aware that similarity-based cleaning (second stage) may be computationally expensive
  • The class maintains state through hash_cleaner and similarity_cleaner instances, which are reusable across multiple clean() calls
  • Inherits get_stats() method from BaseCleaner for statistics calculation
  • The two-stage approach is optimal: hash-based cleaning is fast and reduces the dataset before expensive similarity comparisons

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class HashCleaner 83.6% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
  • class SimilarityCleaner 78.5% similar

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
  • class TestCombinedCleaner 70.3% similar

    A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py
  • class BaseCleaner 69.7% similar

    Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py
  • function main_v51 62.9% similar

    Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
← Back to Browse