🔍 Code Extractor

function main_v59

Maturity: 36

Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.

File:
/tf/active/vicechatdev/chromadb-cleanup/main.py
Lines:
20 - 68
Complexity:
moderate

Purpose

This is the main entry point for a ChromaDB collection cleaning utility. It connects to a ChromaDB instance, retrieves all collections, filters out collections to skip (including already cleaned ones), and processes each collection through a cleaning pipeline that removes duplicates and optionally summarizes similar documents. The cleaned data is stored in new collections with a configurable suffix.

Source Code

def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser(description='Clean up all ChromaDB collections')
    parser.add_argument('--host', type=str, default='vice_chroma', help='ChromaDB host')
    parser.add_argument('--port', type=int, default=8000, help='ChromaDB port')
    parser.add_argument('--similarity-threshold', type=float, default=0.95, 
                        help='Similarity threshold for detecting similar documents')
    parser.add_argument('--skip-collections', type=str, nargs='+', default=[], 
                        help='Collections to skip (e.g., already cleaned ones)')
    parser.add_argument('--suffix', type=str, default='_clean', 
                        help='Suffix to add to cleaned collection names')
    parser.add_argument('--skip-summarization', action='store_true', 
                        help='Skip the summarization step')
    
    args = parser.parse_args()
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=args.host,
        port=args.port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Get all available collections
    collection_names = client.list_collections()
    
    # Filter out collections to skip (e.g., already cleaned ones)
    skip_suffix = args.suffix
    to_process = [name for name in collection_names 
                 if not name.endswith(skip_suffix) and name not in args.skip_collections]
    
    print(f"Found {len(collection_names)} total collections")
    print(f"Will clean {len(to_process)} collections (skipping {len(collection_names) - len(to_process)})")
    
    # Process each collection
    for collection_name in tqdm(to_process, desc="Cleaning collections"):
        try:
            clean_collection(
                collection_name=collection_name,
                output_collection=f"{collection_name}{args.suffix}",
                host=args.host,
                port=args.port,
                similarity_threshold=args.similarity_threshold,
                skip_summarization=args.skip_summarization
            )
            # Sleep briefly to avoid overwhelming the server
            time.sleep(1)
        except Exception as e:
            print(f"Error cleaning collection {collection_name}: {e}")

Return Value

Returns None. This function performs side effects by creating new cleaned collections in ChromaDB and printing progress information to stdout. Errors during collection cleaning are caught and printed but do not stop the overall process.

Dependencies

  • argparse
  • chromadb
  • time
  • tqdm
  • src.cleaners.hash_cleaner
  • src.cleaners.similarity_cleaner
  • src.cleaners.combined_cleaner
  • src.utils.hash_utils
  • src.utils.similarity_utils
  • src.clustering.text_clusterer
  • src.config

Required Imports

import argparse
import chromadb
from chromadb.config import Settings
import time
from tqdm import tqdm

Conditional/Optional Imports

These imports are only needed under specific conditions:

from src.cleaners.hash_cleaner import HashCleaner

Condition: Required by the clean_collection function that this main function calls

Required (conditional)
from src.cleaners.similarity_cleaner import SimilarityCleaner

Condition: Required by the clean_collection function that this main function calls

Required (conditional)
from src.cleaners.combined_cleaner import CombinedCleaner

Condition: Required by the clean_collection function that this main function calls

Required (conditional)
from src.utils.hash_utils import hash_text

Condition: Required by the cleaning utilities

Required (conditional)
from src.utils.similarity_utils import calculate_similarity

Condition: Required by the cleaning utilities

Required (conditional)
from src.clustering.text_clusterer import TextClusterer

Condition: Required by the cleaning utilities

Required (conditional)
from src.config import Config

Condition: Required for configuration settings

Required (conditional)

Usage Example

# Run from command line:
# python script.py --host localhost --port 8000 --similarity-threshold 0.95 --skip-collections collection1 collection2 --suffix _cleaned --skip-summarization

# Or call directly in Python:
if __name__ == '__main__':
    main()

# Example with custom arguments:
# python cleanup_script.py --host vice_chroma --port 8000 --similarity-threshold 0.90 --skip-collections already_clean_collection --suffix _v2

Best Practices

  • Ensure ChromaDB server is running before executing this function
  • Use --skip-collections to avoid reprocessing already cleaned collections
  • Adjust --similarity-threshold based on your data characteristics (higher values are more strict)
  • The function includes a 1-second sleep between collections to avoid overwhelming the server
  • Errors in individual collections are caught and logged but don't stop the entire process
  • Monitor disk space as cleaned collections are created as new collections rather than modifying existing ones
  • Consider using --skip-summarization for faster processing if summarization is not needed
  • The function expects a clean_collection function to be defined elsewhere in the module

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function main_v50 90.7% similar

    Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function clean_collection 77.9% similar

    Cleans a ChromaDB collection by removing duplicate and similar documents using hash-based and similarity-based deduplication techniques, then saves the cleaned data to a new collection.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function reset_collection 67.8% similar

    Deletes an existing ChromaDB collection and logs the operation, requiring an application restart to recreate the collection.

    From: /tf/active/vicechatdev/docchat/reset_collection.py
  • function main_v32 67.4% similar

    Entry point function that executes a comprehensive test suite for Chroma DB collections, including collection listing and creation tests, followed by troubleshooting suggestions.

    From: /tf/active/vicechatdev/test_chroma_collections.py
  • function test_collection_creation 65.9% similar

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    From: /tf/active/vicechatdev/test_chroma_collections.py
← Back to Browse