🔍 Code Extractor

class HashCleaner

Maturity: 47

A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
Lines:
7 - 36
Complexity:
simple

Purpose

HashCleaner is responsible for identifying and removing exact duplicate documents from a collection. It inherits from BaseCleaner and uses hash-based comparison to efficiently detect documents with identical text content. The cleaner processes a list of documents, identifies duplicates, removes them, and provides statistics about the deduplication process. This is useful in data preprocessing pipelines where duplicate content needs to be eliminated before further processing or analysis.

Source Code

class HashCleaner(BaseCleaner):
    """Cleaner that removes documents with identical content using hash values."""
    
    def __init__(self, config: Config):
        """
        Initialize the HashCleaner with configuration.
        
        Args:
            config: Configuration object
        """
        super().__init__(config)
    
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents by removing exact duplicates.
        
        Args:
            documents: List of document dictionaries with 'id' and 'text' keys
            
        Returns:
            List of documents with duplicates removed
        """
        unique_docs, duplicate_docs = get_unique_documents(documents)
        
        # Log statistics
        stats = self.get_stats(documents, unique_docs)
        print(f"HashCleaner: Removed {stats['original_count'] - stats['cleaned_count']} duplicates "
              f"({stats['reduction_percentage']:.2f}% reduction)")
        
        return unique_docs

Parameters

Name Type Default Kind
bases BaseCleaner -

Parameter Details

config: A Config object containing configuration settings for the cleaner. This is passed to the parent BaseCleaner class and may contain settings related to logging, processing options, or other cleaner-specific configurations.

Return Value

The constructor returns a HashCleaner instance. The clean() method returns a List[Dict[str, Any]] containing only unique documents with duplicates removed. Each document dictionary must have at least 'id' and 'text' keys. The returned list maintains the original document structure but excludes any documents that have identical text content to previously seen documents.

Class Interface

Methods

__init__(self, config: Config) -> None

Purpose: Initialize the HashCleaner with configuration settings

Parameters:

  • config: Configuration object containing settings for the cleaner, passed to parent BaseCleaner

Returns: None - constructor initializes the instance

clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]

Purpose: Remove exact duplicate documents from the input list based on content hash comparison

Parameters:

  • documents: List of document dictionaries, each must contain 'id' and 'text' keys. The 'text' key is used for hash-based duplicate detection

Returns: List of unique documents with duplicates removed. Returns only the first occurrence of each unique document based on text content hash

Attributes

Name Type Description Scope
config Config Configuration object inherited from BaseCleaner, stores settings for the cleaner's operation instance

Dependencies

  • typing
  • src.cleaners.base_cleaner
  • src.utils.hash_utils
  • src.config

Required Imports

from typing import List, Dict, Any
from src.cleaners.base_cleaner import BaseCleaner
from src.utils.hash_utils import get_unique_documents
from src.config import Config

Usage Example

from src.cleaners.hash_cleaner import HashCleaner
from src.config import Config

# Initialize configuration
config = Config()

# Create HashCleaner instance
cleaner = HashCleaner(config)

# Prepare documents for cleaning
documents = [
    {'id': '1', 'text': 'This is document one'},
    {'id': '2', 'text': 'This is document two'},
    {'id': '3', 'text': 'This is document one'},  # duplicate
    {'id': '4', 'text': 'This is document three'}
]

# Clean documents to remove duplicates
unique_documents = cleaner.clean(documents)

# Result will contain 3 unique documents
print(f"Original: {len(documents)}, Unique: {len(unique_documents)}")

Best Practices

  • Always ensure documents have both 'id' and 'text' keys before passing to clean() method
  • The cleaner modifies the document list by removing duplicates; if you need the original list, create a copy before cleaning
  • The order of documents matters: the first occurrence of duplicate content is kept, subsequent duplicates are removed
  • Monitor the console output for deduplication statistics to understand the impact on your dataset
  • Instantiate the cleaner once and reuse it for multiple clean() calls if processing multiple batches
  • The Config object should be properly initialized before passing to the constructor
  • This cleaner uses exact hash matching, so even minor differences in text will result in different hashes

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class CombinedCleaner 83.6% similar

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
  • class SimilarityCleaner 70.2% similar

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
  • function test_identical_chunks_with_different_cases 65.9% similar

    A unit test function that verifies the HashCleaner's ability to remove duplicate text chunks while being case-sensitive, ensuring that strings differing only in case are treated as distinct entries.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py
  • function identify_duplicates 64.8% similar

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
  • function test_remove_identical_chunks 64.8% similar

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py
← Back to Browse