class Config_v6
A dataclass that stores configuration settings for a ChromaDB cleanup process, including connection parameters, cleaning/clustering options, and summarization settings.
/tf/active/vicechatdev/chromadb-cleanup/src/config.py
6 - 33
simple
Purpose
This configuration class centralizes all settings needed for a ChromaDB cleanup operation. It manages connection details (host, port, collection), cleaning parameters (hash matching, similarity thresholds), clustering options (method, number of clusters), summarization settings (model, API keys), and processing parameters (batch size). The class provides default values from environment variables where applicable and includes a method to convert the configuration to a dictionary format.
Source Code
class Config:
"""Configuration settings for the ChromaDB cleanup process."""
# ChromaDB connection settings
chroma_host: str = os.environ.get("CHROMA_HOST", "localhost")
chroma_port: int = int(os.environ.get("CHROMA_PORT", "8000"))
chroma_collection: str = os.environ.get("CHROMA_COLLECTION", "default")
# Cleaning parameters
hash_identical_only: bool = True # Only exact matches
similarity_threshold: float = 0.85 # Minimum cosine similarity to consider as "similar"
# Clustering parameters
num_clusters: int = 10
clustering_method: str = "kmeans" # Options: "kmeans", "agglomerative", "dbscan"
# Summarization parameters
skip_summarization: bool = False
max_summary_length: int = 500
summary_model: str = "gpt-4o-mini" # Model to use for summarization
openai_api_key: Optional[str] = os.environ.get("OPENAI_API_KEY", None)
# Processing parameters
batch_size: int = 100 # Number of documents to process at once
def to_dict(self) -> Dict[str, Any]:
"""Convert config to dictionary."""
return {k: v for k, v in self.__dict__.items()}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
bases: The dataclass decorator automatically generates __init__ parameters from class attributes. No explicit __init__ parameters are defined beyond the class attributes themselves.
Return Value
Instantiation returns a Config object with all attributes initialized to their default values or values from environment variables. The to_dict() method returns a Dict[str, Any] containing all instance attributes as key-value pairs.
Class Interface
Methods
to_dict(self) -> Dict[str, Any]
Purpose: Converts the configuration object to a dictionary representation
Returns: A dictionary containing all instance attributes as key-value pairs
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
chroma_host |
str | Hostname or IP address of the ChromaDB server, defaults to 'localhost' or CHROMA_HOST environment variable | class |
chroma_port |
int | Port number for ChromaDB connection, defaults to 8000 or CHROMA_PORT environment variable | class |
chroma_collection |
str | Name of the ChromaDB collection to operate on, defaults to 'default' or CHROMA_COLLECTION environment variable | class |
hash_identical_only |
bool | Whether to only match documents with identical hashes (exact matches), defaults to True | class |
similarity_threshold |
float | Minimum cosine similarity score (0.0-1.0) to consider documents as similar, defaults to 0.85 | class |
num_clusters |
int | Number of clusters to create during clustering operations, defaults to 10 | class |
clustering_method |
str | Clustering algorithm to use, options are 'kmeans', 'agglomerative', or 'dbscan', defaults to 'kmeans' | class |
skip_summarization |
bool | Whether to skip the summarization step in the cleanup process, defaults to False | class |
max_summary_length |
int | Maximum length in characters for generated summaries, defaults to 500 | class |
summary_model |
str | OpenAI model name to use for summarization, defaults to 'gpt-4o-mini' | class |
openai_api_key |
Optional[str] | OpenAI API key for summarization, defaults to None or OPENAI_API_KEY environment variable | class |
batch_size |
int | Number of documents to process in a single batch operation, defaults to 100 | class |
Dependencies
osdataclassestyping
Required Imports
import os
from dataclasses import dataclass
from typing import Dict, Any, Optional
Usage Example
from dataclasses import dataclass
import os
from typing import Dict, Any, Optional
@dataclass
class Config:
chroma_host: str = os.environ.get("CHROMA_HOST", "localhost")
chroma_port: int = int(os.environ.get("CHROMA_PORT", "8000"))
chroma_collection: str = os.environ.get("CHROMA_COLLECTION", "default")
hash_identical_only: bool = True
similarity_threshold: float = 0.85
num_clusters: int = 10
clustering_method: str = "kmeans"
skip_summarization: bool = False
max_summary_length: int = 500
summary_model: str = "gpt-4o-mini"
openai_api_key: Optional[str] = os.environ.get("OPENAI_API_KEY", None)
batch_size: int = 100
def to_dict(self) -> Dict[str, Any]:
return {k: v for k, v in self.__dict__.items()}
# Create config with defaults
config = Config()
# Create config with custom values
config = Config(
chroma_host="192.168.1.100",
chroma_port=9000,
similarity_threshold=0.90,
num_clusters=15
)
# Access attributes
print(config.chroma_host)
print(config.similarity_threshold)
# Convert to dictionary
config_dict = config.to_dict()
print(config_dict)
Best Practices
- Set environment variables (CHROMA_HOST, CHROMA_PORT, CHROMA_COLLECTION, OPENAI_API_KEY) before instantiating if you want to override defaults
- Use the dataclass decorator when defining this class to automatically generate __init__, __repr__, and other methods
- Validate clustering_method is one of 'kmeans', 'agglomerative', or 'dbscan' after instantiation if needed
- Ensure openai_api_key is set if skip_summarization is False
- The Config object is immutable by default unless frozen=False is specified in the dataclass decorator
- Use to_dict() method when you need to serialize the configuration or pass it to functions expecting dictionaries
- Consider validating similarity_threshold is between 0.0 and 1.0 after instantiation
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v60 58.0% similar
-
class Config 57.6% similar
-
class ChatConfiguration 57.5% similar
-
class AnalysisConfiguration 57.1% similar
-
function main_v51 56.0% similar