class BaseCleaner
Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py
6 - 46
simple
Purpose
BaseCleaner serves as an abstract foundation for implementing various document cleaning strategies. It enforces a consistent interface across all cleaner implementations through the abstract 'clean' method, while providing utility functionality for tracking cleaning statistics. Subclasses must implement the clean method to define specific redundancy removal logic. This class is designed to work with document dictionaries and integrates with a configuration system for customizable behavior.
Source Code
class BaseCleaner(ABC):
"""Base class for all document cleaners."""
def __init__(self, config: Config):
"""
Initialize the cleaner with configuration.
Args:
config: Configuration object
"""
self.config = config
@abstractmethod
def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""
Clean the documents by removing redundancy.
Args:
documents: List of document dictionaries
Returns:
List of cleaned documents
"""
pass
def get_stats(self, original_docs: List[Dict[str, Any]], cleaned_docs: List[Dict[str, Any]]) -> Dict[str, Any]:
"""
Get statistics about the cleaning process.
Args:
original_docs: Original documents
cleaned_docs: Cleaned documents
Returns:
Dictionary of statistics
"""
return {
"original_count": len(original_docs),
"cleaned_count": len(cleaned_docs),
"reduction_percentage": (1 - len(cleaned_docs) / len(original_docs)) * 100 if original_docs else 0,
}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
ABC | - |
Parameter Details
config: A Config object containing configuration settings for the cleaner. This parameter is required during instantiation and is stored as an instance attribute for use by subclass implementations. The config object should contain any settings needed to control the cleaning behavior.
Return Value
Instantiation returns a BaseCleaner instance (or more specifically, an instance of a concrete subclass). The clean method (implemented by subclasses) returns a List[Dict[str, Any]] containing cleaned documents with redundancy removed. The get_stats method returns a Dict[str, Any] with keys 'original_count' (int), 'cleaned_count' (int), and 'reduction_percentage' (float) representing the cleaning effectiveness.
Class Interface
Methods
__init__(self, config: Config) -> None
Purpose: Initialize the cleaner with configuration settings
Parameters:
config: Configuration object containing settings for the cleaner
Returns: None - initializes the instance
clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]
Purpose: Abstract method that must be implemented by subclasses to clean documents by removing redundancy
Parameters:
documents: List of document dictionaries to be cleaned, where each dictionary represents a document with arbitrary key-value pairs
Returns: List of cleaned document dictionaries with redundancy removed according to the subclass implementation
get_stats(self, original_docs: List[Dict[str, Any]], cleaned_docs: List[Dict[str, Any]]) -> Dict[str, Any]
Purpose: Calculate and return statistics about the cleaning process to measure effectiveness
Parameters:
original_docs: List of original document dictionaries before cleaningcleaned_docs: List of document dictionaries after cleaning
Returns: Dictionary containing 'original_count' (int), 'cleaned_count' (int), and 'reduction_percentage' (float) keys representing cleaning statistics
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
config |
Config | Configuration object passed during initialization, used to control cleaner behavior and settings | instance |
Dependencies
abctypingsrc.config
Required Imports
from abc import ABC, abstractmethod
from typing import List, Dict, Any
from src.config import Config
Usage Example
from abc import ABC, abstractmethod
from typing import List, Dict, Any
from src.config import Config
# Define a concrete implementation
class SimpleCleaner(BaseCleaner):
def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
# Remove duplicate documents based on 'id' field
seen_ids = set()
cleaned = []
for doc in documents:
if doc.get('id') not in seen_ids:
seen_ids.add(doc.get('id'))
cleaned.append(doc)
return cleaned
# Usage
config = Config() # Assume Config is properly defined
cleaner = SimpleCleaner(config)
original_docs = [
{'id': 1, 'text': 'Document 1'},
{'id': 2, 'text': 'Document 2'},
{'id': 1, 'text': 'Document 1 duplicate'}
]
cleaned_docs = cleaner.clean(original_docs)
stats = cleaner.get_stats(original_docs, cleaned_docs)
print(f"Reduced from {stats['original_count']} to {stats['cleaned_count']} documents")
print(f"Reduction: {stats['reduction_percentage']:.2f}%")
Best Practices
- This is an abstract base class and cannot be instantiated directly. Always create a concrete subclass that implements the clean method.
- Subclasses must implement the clean method with the exact signature specified to maintain interface consistency.
- The clean method should be idempotent - calling it multiple times on the same input should produce the same result.
- Use get_stats after calling clean to track the effectiveness of the cleaning process.
- The config object should be accessed via self.config in subclass implementations to maintain consistency.
- Document dictionaries should maintain a consistent structure across the cleaning pipeline.
- Handle edge cases in clean implementations: empty lists, None values, and malformed documents.
- The get_stats method handles division by zero when original_docs is empty, but subclasses should also handle empty input gracefully.
- Consider thread-safety if the cleaner will be used in concurrent contexts - the base class stores state in self.config.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class CombinedCleaner 69.7% similar
-
class HashCleaner 64.0% similar
-
class SimilarityCleaner 60.7% similar
-
class TestCombinedCleaner 57.8% similar
-
class BaseModel 57.1% similar