🔍 Code Extractor

class Config_v6

Maturity: 45

A dataclass that stores configuration settings for a ChromaDB cleanup process, including connection parameters, cleaning/clustering options, and summarization settings.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/config.py
Lines:
6 - 33
Complexity:
simple

Purpose

This configuration class centralizes all settings needed for a ChromaDB cleanup operation. It manages connection details (host, port, collection), cleaning parameters (hash matching, similarity thresholds), clustering options (method, number of clusters), summarization settings (model, API keys), and processing parameters (batch size). The class provides default values from environment variables where applicable and includes a method to convert the configuration to a dictionary format.

Source Code

class Config:
    """Configuration settings for the ChromaDB cleanup process."""
    
    # ChromaDB connection settings
    chroma_host: str = os.environ.get("CHROMA_HOST", "localhost")
    chroma_port: int = int(os.environ.get("CHROMA_PORT", "8000"))
    chroma_collection: str = os.environ.get("CHROMA_COLLECTION", "default")
    
    # Cleaning parameters
    hash_identical_only: bool = True  # Only exact matches
    similarity_threshold: float = 0.85  # Minimum cosine similarity to consider as "similar"
    
    # Clustering parameters
    num_clusters: int = 10
    clustering_method: str = "kmeans"  # Options: "kmeans", "agglomerative", "dbscan"
    
    # Summarization parameters
    skip_summarization: bool = False
    max_summary_length: int = 500
    summary_model: str = "gpt-4o-mini"  # Model to use for summarization
    openai_api_key: Optional[str] = os.environ.get("OPENAI_API_KEY", None)
    
    # Processing parameters
    batch_size: int = 100  # Number of documents to process at once
    
    def to_dict(self) -> Dict[str, Any]:
        """Convert config to dictionary."""
        return {k: v for k, v in self.__dict__.items()}

Parameters

Name Type Default Kind
bases - -

Parameter Details

bases: The dataclass decorator automatically generates __init__ parameters from class attributes. No explicit __init__ parameters are defined beyond the class attributes themselves.

Return Value

Instantiation returns a Config object with all attributes initialized to their default values or values from environment variables. The to_dict() method returns a Dict[str, Any] containing all instance attributes as key-value pairs.

Class Interface

Methods

to_dict(self) -> Dict[str, Any]

Purpose: Converts the configuration object to a dictionary representation

Returns: A dictionary containing all instance attributes as key-value pairs

Attributes

Name Type Description Scope
chroma_host str Hostname or IP address of the ChromaDB server, defaults to 'localhost' or CHROMA_HOST environment variable class
chroma_port int Port number for ChromaDB connection, defaults to 8000 or CHROMA_PORT environment variable class
chroma_collection str Name of the ChromaDB collection to operate on, defaults to 'default' or CHROMA_COLLECTION environment variable class
hash_identical_only bool Whether to only match documents with identical hashes (exact matches), defaults to True class
similarity_threshold float Minimum cosine similarity score (0.0-1.0) to consider documents as similar, defaults to 0.85 class
num_clusters int Number of clusters to create during clustering operations, defaults to 10 class
clustering_method str Clustering algorithm to use, options are 'kmeans', 'agglomerative', or 'dbscan', defaults to 'kmeans' class
skip_summarization bool Whether to skip the summarization step in the cleanup process, defaults to False class
max_summary_length int Maximum length in characters for generated summaries, defaults to 500 class
summary_model str OpenAI model name to use for summarization, defaults to 'gpt-4o-mini' class
openai_api_key Optional[str] OpenAI API key for summarization, defaults to None or OPENAI_API_KEY environment variable class
batch_size int Number of documents to process in a single batch operation, defaults to 100 class

Dependencies

  • os
  • dataclasses
  • typing

Required Imports

import os
from dataclasses import dataclass
from typing import Dict, Any, Optional

Usage Example

from dataclasses import dataclass
import os
from typing import Dict, Any, Optional

@dataclass
class Config:
    chroma_host: str = os.environ.get("CHROMA_HOST", "localhost")
    chroma_port: int = int(os.environ.get("CHROMA_PORT", "8000"))
    chroma_collection: str = os.environ.get("CHROMA_COLLECTION", "default")
    hash_identical_only: bool = True
    similarity_threshold: float = 0.85
    num_clusters: int = 10
    clustering_method: str = "kmeans"
    skip_summarization: bool = False
    max_summary_length: int = 500
    summary_model: str = "gpt-4o-mini"
    openai_api_key: Optional[str] = os.environ.get("OPENAI_API_KEY", None)
    batch_size: int = 100
    
    def to_dict(self) -> Dict[str, Any]:
        return {k: v for k, v in self.__dict__.items()}

# Create config with defaults
config = Config()

# Create config with custom values
config = Config(
    chroma_host="192.168.1.100",
    chroma_port=9000,
    similarity_threshold=0.90,
    num_clusters=15
)

# Access attributes
print(config.chroma_host)
print(config.similarity_threshold)

# Convert to dictionary
config_dict = config.to_dict()
print(config_dict)

Best Practices

  • Set environment variables (CHROMA_HOST, CHROMA_PORT, CHROMA_COLLECTION, OPENAI_API_KEY) before instantiating if you want to override defaults
  • Use the dataclass decorator when defining this class to automatically generate __init__, __repr__, and other methods
  • Validate clustering_method is one of 'kmeans', 'agglomerative', or 'dbscan' after instantiation if needed
  • Ensure openai_api_key is set if skip_summarization is False
  • The Config object is immutable by default unless frozen=False is specified in the dataclass decorator
  • Use to_dict() method when you need to serialize the configuration or pass it to functions expecting dictionaries
  • Consider validating similarity_threshold is between 0.0 and 1.0 after instantiation

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function main_v60 58.0% similar

    Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • class Config 57.6% similar

    Configuration class that manages application-wide settings, directory structures, API keys, and operational parameters for a statistical analysis application.

    From: /tf/active/vicechatdev/vice_ai/smartstat_config.py
  • class ChatConfiguration 57.5% similar

    A dataclass that stores configuration settings for a chat interface integrated with a RAG (Retrieval-Augmented Generation) engine, managing search parameters, data sources, and model settings.

    From: /tf/active/vicechatdev/vice_ai/models.py
  • class AnalysisConfiguration 57.1% similar

    A dataclass that encapsulates configuration parameters for statistical analysis operations, including analysis type, variables, and statistical thresholds.

    From: /tf/active/vicechatdev/vice_ai/models.py
  • function main_v51 56.0% similar

    Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
← Back to Browse