function main_v49
Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.
/tf/active/vicechatdev/chromadb-cleanup/main copy.py
18 - 67
moderate
Purpose
This is the main entry point for a ChromaDB data cleaning utility. It parses command-line arguments, initializes cleaning components (HashCleaner and SimilarityCleaner), loads documents from a ChromaDB collection, removes duplicates and near-duplicates based on configurable thresholds, and saves the cleaned data to an output collection. The function supports optional clustering/summarization (currently commented out) and allows users to specify similarity thresholds and output collection names.
Source Code
def main():
# Parse command line arguments
parser = argparse.ArgumentParser(description='Clean up ChromaDB collection')
parser.add_argument('--collection', type=str, required=True, help='Name of the ChromaDB collection')
parser.add_argument('--host', type=str, default='vice_chroma', help='ChromaDB host')
parser.add_argument('--port', type=int, default=8000, help='ChromaDB port')
parser.add_argument('--similarity-threshold', type=float, default=0.90,
help='Similarity threshold for detecting similar documents')
parser.add_argument('--num-clusters', type=int, default=10,
help='Number of clusters for clustering')
parser.add_argument('--skip-summarization', action='store_true',
help='Skip the summarization step')
parser.add_argument('--output-collection', type=str, default=None,
help='Output collection name (if not specified, will overwrite input collection)')
args = parser.parse_args()
# Create config object with command line arguments
config = Config()
config.chroma_collection = args.collection
config.chroma_host = args.host
config.chroma_port = args.port
config.similarity_threshold = args.similarity_threshold
config.num_clusters = args.num_clusters
config.skip_summarization = args.skip_summarization
output_collection = args.output_collection or f"{config.chroma_collection}_cleaned"
# Initialize cleaners
hash_cleaner = HashCleaner(config)
similarity_cleaner = SimilarityCleaner(config)
# Load data from ChromaDB
data = load_data_from_chromadb(config)
print(f"Loaded {len(data)} documents from ChromaDB collection '{config.chroma_collection}'")
# Step 1: Remove identical text chunks using hashing
cleaned_data_hash = hash_cleaner.clean(data)
# Step 2: Remove nearly similar text chunks using similarity screening
cleaned_data_similarity = similarity_cleaner.clean(cleaned_data_hash)
# Step 3: Cluster and summarize similar text chunks
#text_clusterer = TextClusterer(config)
#clustered_data = text_clusterer.cluster(cleaned_data_similarity)
clustered_data = cleaned_data_similarity
# Save cleaned and enriched data back to ChromaDB
save_data_to_chromadb(clustered_data, config, output_collection)
print(f"Saved {len(clustered_data)} documents to ChromaDB collection '{output_collection}'")
Return Value
This function returns None (implicit). It performs side effects by reading from and writing to ChromaDB collections, and prints status messages to stdout indicating the number of documents loaded and saved.
Dependencies
argparsechromadbtyping
Required Imports
import argparse
import chromadb
from chromadb.config import Settings
from typing import List, Dict, Any
from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config
Conditional/Optional Imports
These imports are only needed under specific conditions:
from src.clustering.text_clusterer import TextClusterer
Condition: only if clustering/summarization functionality is uncommented and enabled
Optionalfrom src.cleaners.combined_cleaner import CombinedCleaner
Condition: imported in source file but not used in current implementation
Optionalfrom src.utils.hash_utils import hash_text
Condition: imported in source file but not directly used in main function
Optionalfrom src.utils.similarity_utils import calculate_similarity
Condition: imported in source file but not directly used in main function
Optionalimport os
Condition: imported in source file but not used in current implementation
OptionalUsage Example
# Run from command line:
# python script.py --collection my_documents --host localhost --port 8000 --similarity-threshold 0.85 --output-collection my_documents_clean
# Or call directly in Python (not recommended as it's designed for CLI):
if __name__ == '__main__':
main()
# Example with minimal arguments:
# python script.py --collection my_collection
# Example with all options:
# python script.py --collection my_docs --host vice_chroma --port 8000 --similarity-threshold 0.90 --num-clusters 10 --skip-summarization --output-collection cleaned_docs
Best Practices
- Always specify the --collection argument as it is required
- Ensure ChromaDB server is running before executing this function
- Start with default similarity threshold (0.90) and adjust based on results
- Use --output-collection to preserve original data during testing
- Monitor memory usage when processing large collections as all data is loaded into memory
- The clustering/summarization step is currently commented out; uncomment if needed
- Consider backing up your ChromaDB collection before running cleanup operations
- Review the number of documents before and after cleaning to ensure expected behavior
- Lower similarity thresholds (e.g., 0.80) will remove more documents but may lose unique content
- Higher similarity thresholds (e.g., 0.95) will be more conservative in removing documents
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v58 90.7% similar
-
function clean_collection 77.7% similar
-
function main_v31 65.0% similar
-
class CombinedCleaner 62.9% similar
-
class HashCleaner 61.4% similar