🔍 Code Extractor

function main_v49

Maturity: 40

Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

File:
/tf/active/vicechatdev/chromadb-cleanup/main copy.py
Lines:
18 - 67
Complexity:
moderate

Purpose

This is the main entry point for a ChromaDB data cleaning utility. It parses command-line arguments, initializes cleaning components (HashCleaner and SimilarityCleaner), loads documents from a ChromaDB collection, removes duplicates and near-duplicates based on configurable thresholds, and saves the cleaned data to an output collection. The function supports optional clustering/summarization (currently commented out) and allows users to specify similarity thresholds and output collection names.

Source Code

def main():
    # Parse command line arguments
    parser = argparse.ArgumentParser(description='Clean up ChromaDB collection')
    parser.add_argument('--collection', type=str, required=True, help='Name of the ChromaDB collection')
    parser.add_argument('--host', type=str, default='vice_chroma', help='ChromaDB host')
    parser.add_argument('--port', type=int, default=8000, help='ChromaDB port')
    parser.add_argument('--similarity-threshold', type=float, default=0.90, 
                        help='Similarity threshold for detecting similar documents')
    parser.add_argument('--num-clusters', type=int, default=10, 
                        help='Number of clusters for clustering')
    parser.add_argument('--skip-summarization', action='store_true', 
                        help='Skip the summarization step')
    parser.add_argument('--output-collection', type=str, default=None,
                        help='Output collection name (if not specified, will overwrite input collection)')
    
    args = parser.parse_args()
    
    # Create config object with command line arguments
    config = Config()
    config.chroma_collection = args.collection
    config.chroma_host = args.host
    config.chroma_port = args.port
    config.similarity_threshold = args.similarity_threshold
    config.num_clusters = args.num_clusters
    config.skip_summarization = args.skip_summarization
    
    output_collection = args.output_collection or f"{config.chroma_collection}_cleaned"
    
    # Initialize cleaners
    hash_cleaner = HashCleaner(config)
    similarity_cleaner = SimilarityCleaner(config)

    # Load data from ChromaDB
    data = load_data_from_chromadb(config)
    print(f"Loaded {len(data)} documents from ChromaDB collection '{config.chroma_collection}'")

    # Step 1: Remove identical text chunks using hashing
    cleaned_data_hash = hash_cleaner.clean(data)

    # Step 2: Remove nearly similar text chunks using similarity screening
    cleaned_data_similarity = similarity_cleaner.clean(cleaned_data_hash)

    # Step 3: Cluster and summarize similar text chunks
    #text_clusterer = TextClusterer(config)
    #clustered_data = text_clusterer.cluster(cleaned_data_similarity)
    clustered_data = cleaned_data_similarity

    # Save cleaned and enriched data back to ChromaDB
    save_data_to_chromadb(clustered_data, config, output_collection)
    print(f"Saved {len(clustered_data)} documents to ChromaDB collection '{output_collection}'")

Return Value

This function returns None (implicit). It performs side effects by reading from and writing to ChromaDB collections, and prints status messages to stdout indicating the number of documents loaded and saved.

Dependencies

  • argparse
  • chromadb
  • typing

Required Imports

import argparse
import chromadb
from chromadb.config import Settings
from typing import List, Dict, Any
from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config

Conditional/Optional Imports

These imports are only needed under specific conditions:

from src.clustering.text_clusterer import TextClusterer

Condition: only if clustering/summarization functionality is uncommented and enabled

Optional
from src.cleaners.combined_cleaner import CombinedCleaner

Condition: imported in source file but not used in current implementation

Optional
from src.utils.hash_utils import hash_text

Condition: imported in source file but not directly used in main function

Optional
from src.utils.similarity_utils import calculate_similarity

Condition: imported in source file but not directly used in main function

Optional
import os

Condition: imported in source file but not used in current implementation

Optional

Usage Example

# Run from command line:
# python script.py --collection my_documents --host localhost --port 8000 --similarity-threshold 0.85 --output-collection my_documents_clean

# Or call directly in Python (not recommended as it's designed for CLI):
if __name__ == '__main__':
    main()

# Example with minimal arguments:
# python script.py --collection my_collection

# Example with all options:
# python script.py --collection my_docs --host vice_chroma --port 8000 --similarity-threshold 0.90 --num-clusters 10 --skip-summarization --output-collection cleaned_docs

Best Practices

  • Always specify the --collection argument as it is required
  • Ensure ChromaDB server is running before executing this function
  • Start with default similarity threshold (0.90) and adjust based on results
  • Use --output-collection to preserve original data during testing
  • Monitor memory usage when processing large collections as all data is loaded into memory
  • The clustering/summarization step is currently commented out; uncomment if needed
  • Consider backing up your ChromaDB collection before running cleanup operations
  • Review the number of documents before and after cleaning to ensure expected behavior
  • Lower similarity thresholds (e.g., 0.80) will remove more documents but may lose unique content
  • Higher similarity thresholds (e.g., 0.95) will be more conservative in removing documents

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function main_v58 90.7% similar

    Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function clean_collection 77.7% similar

    Cleans a ChromaDB collection by removing duplicate and similar documents using hash-based and similarity-based deduplication techniques, then saves the cleaned data to a new collection.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function main_v31 65.0% similar

    Entry point function that executes a comprehensive test suite for Chroma DB collections, including collection listing and creation tests, followed by troubleshooting suggestions.

    From: /tf/active/vicechatdev/test_chroma_collections.py
  • class CombinedCleaner 62.9% similar

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
  • class HashCleaner 61.4% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
← Back to Browse