class HashCleaner
A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
7 - 36
simple
Purpose
HashCleaner is responsible for identifying and removing exact duplicate documents from a collection. It inherits from BaseCleaner and uses hash-based comparison to efficiently detect documents with identical text content. The cleaner processes a list of documents, identifies duplicates, removes them, and provides statistics about the deduplication process. This is useful in data preprocessing pipelines where duplicate content needs to be eliminated before further processing or analysis.
Source Code
class HashCleaner(BaseCleaner):
"""Cleaner that removes documents with identical content using hash values."""
def __init__(self, config: Config):
"""
Initialize the HashCleaner with configuration.
Args:
config: Configuration object
"""
super().__init__(config)
def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Clean the documents by removing exact duplicates.
Args:
documents: List of document dictionaries with 'id' and 'text' keys
Returns:
List of documents with duplicates removed
"""
unique_docs, duplicate_docs = get_unique_documents(documents)
# Log statistics
stats = self.get_stats(documents, unique_docs)
print(f"HashCleaner: Removed {stats['original_count'] - stats['cleaned_count']} duplicates "
f"({stats['reduction_percentage']:.2f}% reduction)")
return unique_docs
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
BaseCleaner | - |
Parameter Details
config: A Config object containing configuration settings for the cleaner. This is passed to the parent BaseCleaner class and may contain settings related to logging, processing options, or other cleaner-specific configurations.
Return Value
The constructor returns a HashCleaner instance. The clean() method returns a List[Dict[str, Any]] containing only unique documents with duplicates removed. Each document dictionary must have at least 'id' and 'text' keys. The returned list maintains the original document structure but excludes any documents that have identical text content to previously seen documents.
Class Interface
Methods
__init__(self, config: Config) -> None
Purpose: Initialize the HashCleaner with configuration settings
Parameters:
config: Configuration object containing settings for the cleaner, passed to parent BaseCleaner
Returns: None - constructor initializes the instance
clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]
Purpose: Remove exact duplicate documents from the input list based on content hash comparison
Parameters:
documents: List of document dictionaries, each must contain 'id' and 'text' keys. The 'text' key is used for hash-based duplicate detection
Returns: List of unique documents with duplicates removed. Returns only the first occurrence of each unique document based on text content hash
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
config |
Config | Configuration object inherited from BaseCleaner, stores settings for the cleaner's operation | instance |
Dependencies
typingsrc.cleaners.base_cleanersrc.utils.hash_utilssrc.config
Required Imports
from typing import List, Dict, Any
from src.cleaners.base_cleaner import BaseCleaner
from src.utils.hash_utils import get_unique_documents
from src.config import Config
Usage Example
from src.cleaners.hash_cleaner import HashCleaner
from src.config import Config
# Initialize configuration
config = Config()
# Create HashCleaner instance
cleaner = HashCleaner(config)
# Prepare documents for cleaning
documents = [
{'id': '1', 'text': 'This is document one'},
{'id': '2', 'text': 'This is document two'},
{'id': '3', 'text': 'This is document one'}, # duplicate
{'id': '4', 'text': 'This is document three'}
]
# Clean documents to remove duplicates
unique_documents = cleaner.clean(documents)
# Result will contain 3 unique documents
print(f"Original: {len(documents)}, Unique: {len(unique_documents)}")
Best Practices
- Always ensure documents have both 'id' and 'text' keys before passing to clean() method
- The cleaner modifies the document list by removing duplicates; if you need the original list, create a copy before cleaning
- The order of documents matters: the first occurrence of duplicate content is kept, subsequent duplicates are removed
- Monitor the console output for deduplication statistics to understand the impact on your dataset
- Instantiate the cleaner once and reuse it for multiple clean() calls if processing multiple batches
- The Config object should be properly initialized before passing to the constructor
- This cleaner uses exact hash matching, so even minor differences in text will result in different hashes
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class CombinedCleaner 83.6% similar
-
class SimilarityCleaner 70.2% similar
-
function test_identical_chunks_with_different_cases 65.9% similar
-
function identify_duplicates 64.8% similar
-
function test_remove_identical_chunks 64.8% similar