🔍 Code Extractor

class BaseCleaner

Maturity: 50

Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py
Lines:
6 - 46
Complexity:
simple

Purpose

BaseCleaner serves as an abstract foundation for implementing various document cleaning strategies. It enforces a consistent interface across all cleaner implementations through the abstract 'clean' method, while providing utility functionality for tracking cleaning statistics. Subclasses must implement the clean method to define specific redundancy removal logic. This class is designed to work with document dictionaries and integrates with a configuration system for customizable behavior.

Source Code

class BaseCleaner(ABC):
    """Base class for all document cleaners."""
    
    def __init__(self, config: Config):
        """
        Initialize the cleaner with configuration.
        
        Args:
            config: Configuration object
        """
        self.config = config
    
    @abstractmethod
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        """
        Clean the documents by removing redundancy.
        
        Args:
            documents: List of document dictionaries
            
        Returns:
            List of cleaned documents
        """
        pass
    
    def get_stats(self, original_docs: List[Dict[str, Any]], cleaned_docs: List[Dict[str, Any]]) -> Dict[str, Any]:
        """
        Get statistics about the cleaning process.
        
        Args:
            original_docs: Original documents
            cleaned_docs: Cleaned documents
            
        Returns:
            Dictionary of statistics
        """
        return {
            "original_count": len(original_docs),
            "cleaned_count": len(cleaned_docs),
            "reduction_percentage": (1 - len(cleaned_docs) / len(original_docs)) * 100 if original_docs else 0,
        }

Parameters

Name Type Default Kind
bases ABC -

Parameter Details

config: A Config object containing configuration settings for the cleaner. This parameter is required during instantiation and is stored as an instance attribute for use by subclass implementations. The config object should contain any settings needed to control the cleaning behavior.

Return Value

Instantiation returns a BaseCleaner instance (or more specifically, an instance of a concrete subclass). The clean method (implemented by subclasses) returns a List[Dict[str, Any]] containing cleaned documents with redundancy removed. The get_stats method returns a Dict[str, Any] with keys 'original_count' (int), 'cleaned_count' (int), and 'reduction_percentage' (float) representing the cleaning effectiveness.

Class Interface

Methods

__init__(self, config: Config) -> None

Purpose: Initialize the cleaner with configuration settings

Parameters:

  • config: Configuration object containing settings for the cleaner

Returns: None - initializes the instance

clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]

Purpose: Abstract method that must be implemented by subclasses to clean documents by removing redundancy

Parameters:

  • documents: List of document dictionaries to be cleaned, where each dictionary represents a document with arbitrary key-value pairs

Returns: List of cleaned document dictionaries with redundancy removed according to the subclass implementation

get_stats(self, original_docs: List[Dict[str, Any]], cleaned_docs: List[Dict[str, Any]]) -> Dict[str, Any]

Purpose: Calculate and return statistics about the cleaning process to measure effectiveness

Parameters:

  • original_docs: List of original document dictionaries before cleaning
  • cleaned_docs: List of document dictionaries after cleaning

Returns: Dictionary containing 'original_count' (int), 'cleaned_count' (int), and 'reduction_percentage' (float) keys representing cleaning statistics

Attributes

Name Type Description Scope
config Config Configuration object passed during initialization, used to control cleaner behavior and settings instance

Dependencies

  • abc
  • typing
  • src.config

Required Imports

from abc import ABC, abstractmethod
from typing import List, Dict, Any
from src.config import Config

Usage Example

from abc import ABC, abstractmethod
from typing import List, Dict, Any
from src.config import Config

# Define a concrete implementation
class SimpleCleaner(BaseCleaner):
    def clean(self, documents: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
        # Remove duplicate documents based on 'id' field
        seen_ids = set()
        cleaned = []
        for doc in documents:
            if doc.get('id') not in seen_ids:
                seen_ids.add(doc.get('id'))
                cleaned.append(doc)
        return cleaned

# Usage
config = Config()  # Assume Config is properly defined
cleaner = SimpleCleaner(config)

original_docs = [
    {'id': 1, 'text': 'Document 1'},
    {'id': 2, 'text': 'Document 2'},
    {'id': 1, 'text': 'Document 1 duplicate'}
]

cleaned_docs = cleaner.clean(original_docs)
stats = cleaner.get_stats(original_docs, cleaned_docs)
print(f"Reduced from {stats['original_count']} to {stats['cleaned_count']} documents")
print(f"Reduction: {stats['reduction_percentage']:.2f}%")

Best Practices

  • This is an abstract base class and cannot be instantiated directly. Always create a concrete subclass that implements the clean method.
  • Subclasses must implement the clean method with the exact signature specified to maintain interface consistency.
  • The clean method should be idempotent - calling it multiple times on the same input should produce the same result.
  • Use get_stats after calling clean to track the effectiveness of the cleaning process.
  • The config object should be accessed via self.config in subclass implementations to maintain consistency.
  • Document dictionaries should maintain a consistent structure across the cleaning pipeline.
  • Handle edge cases in clean implementations: empty lists, None values, and malformed documents.
  • The get_stats method handles division by zero when original_docs is empty, but subclasses should also handle empty input gracefully.
  • Consider thread-safety if the cleaner will be used in concurrent contexts - the base class stores state in self.config.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class CombinedCleaner 69.7% similar

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
  • class HashCleaner 64.0% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
  • class SimilarityCleaner 60.7% similar

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
  • class TestCombinedCleaner 57.8% similar

    A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py
  • class BaseModel 57.1% similar

    Base class providing common data model functionality for all models in the system, including property management, serialization, and deserialization.

    From: /tf/active/vicechatdev/CDocs/models/__init__.py
← Back to Browse