function main_v59
Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.
/tf/active/vicechatdev/chromadb-cleanup/main.py
20 - 68
moderate
Purpose
This is the main entry point for a ChromaDB collection cleaning utility. It connects to a ChromaDB instance, retrieves all collections, filters out collections to skip (including already cleaned ones), and processes each collection through a cleaning pipeline that removes duplicates and optionally summarizes similar documents. The cleaned data is stored in new collections with a configurable suffix.
Source Code
def main():
# Parse command line arguments
parser = argparse.ArgumentParser(description='Clean up all ChromaDB collections')
parser.add_argument('--host', type=str, default='vice_chroma', help='ChromaDB host')
parser.add_argument('--port', type=int, default=8000, help='ChromaDB port')
parser.add_argument('--similarity-threshold', type=float, default=0.95,
help='Similarity threshold for detecting similar documents')
parser.add_argument('--skip-collections', type=str, nargs='+', default=[],
help='Collections to skip (e.g., already cleaned ones)')
parser.add_argument('--suffix', type=str, default='_clean',
help='Suffix to add to cleaned collection names')
parser.add_argument('--skip-summarization', action='store_true',
help='Skip the summarization step')
args = parser.parse_args()
# Connect to ChromaDB
client = chromadb.HttpClient(
host=args.host,
port=args.port,
settings=Settings(anonymized_telemetry=False)
)
# Get all available collections
collection_names = client.list_collections()
# Filter out collections to skip (e.g., already cleaned ones)
skip_suffix = args.suffix
to_process = [name for name in collection_names
if not name.endswith(skip_suffix) and name not in args.skip_collections]
print(f"Found {len(collection_names)} total collections")
print(f"Will clean {len(to_process)} collections (skipping {len(collection_names) - len(to_process)})")
# Process each collection
for collection_name in tqdm(to_process, desc="Cleaning collections"):
try:
clean_collection(
collection_name=collection_name,
output_collection=f"{collection_name}{args.suffix}",
host=args.host,
port=args.port,
similarity_threshold=args.similarity_threshold,
skip_summarization=args.skip_summarization
)
# Sleep briefly to avoid overwhelming the server
time.sleep(1)
except Exception as e:
print(f"Error cleaning collection {collection_name}: {e}")
Return Value
Returns None. This function performs side effects by creating new cleaned collections in ChromaDB and printing progress information to stdout. Errors during collection cleaning are caught and printed but do not stop the overall process.
Dependencies
argparsechromadbtimetqdmsrc.cleaners.hash_cleanersrc.cleaners.similarity_cleanersrc.cleaners.combined_cleanersrc.utils.hash_utilssrc.utils.similarity_utilssrc.clustering.text_clusterersrc.config
Required Imports
import argparse
import chromadb
from chromadb.config import Settings
import time
from tqdm import tqdm
Conditional/Optional Imports
These imports are only needed under specific conditions:
from src.cleaners.hash_cleaner import HashCleaner
Condition: Required by the clean_collection function that this main function calls
Required (conditional)from src.cleaners.similarity_cleaner import SimilarityCleaner
Condition: Required by the clean_collection function that this main function calls
Required (conditional)from src.cleaners.combined_cleaner import CombinedCleaner
Condition: Required by the clean_collection function that this main function calls
Required (conditional)from src.utils.hash_utils import hash_text
Condition: Required by the cleaning utilities
Required (conditional)from src.utils.similarity_utils import calculate_similarity
Condition: Required by the cleaning utilities
Required (conditional)from src.clustering.text_clusterer import TextClusterer
Condition: Required by the cleaning utilities
Required (conditional)from src.config import Config
Condition: Required for configuration settings
Required (conditional)Usage Example
# Run from command line:
# python script.py --host localhost --port 8000 --similarity-threshold 0.95 --skip-collections collection1 collection2 --suffix _cleaned --skip-summarization
# Or call directly in Python:
if __name__ == '__main__':
main()
# Example with custom arguments:
# python cleanup_script.py --host vice_chroma --port 8000 --similarity-threshold 0.90 --skip-collections already_clean_collection --suffix _v2
Best Practices
- Ensure ChromaDB server is running before executing this function
- Use --skip-collections to avoid reprocessing already cleaned collections
- Adjust --similarity-threshold based on your data characteristics (higher values are more strict)
- The function includes a 1-second sleep between collections to avoid overwhelming the server
- Errors in individual collections are caught and logged but don't stop the entire process
- Monitor disk space as cleaned collections are created as new collections rather than modifying existing ones
- Consider using --skip-summarization for faster processing if summarization is not needed
- The function expects a clean_collection function to be defined elsewhere in the module
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v50 90.7% similar
-
function clean_collection 77.9% similar
-
function reset_collection 67.8% similar
-
function main_v32 67.4% similar
-
function test_collection_creation 65.9% similar