🔍 Code Extractor

function clean_collection

Maturity: 52

Cleans a ChromaDB collection by removing duplicate and similar documents using hash-based and similarity-based deduplication techniques, then saves the cleaned data to a new collection.

File:
/tf/active/vicechatdev/chromadb-cleanup/main.py
Lines:
71 - 120
Complexity:
moderate

Purpose

This function performs data cleaning on a ChromaDB vector database collection to reduce redundancy and improve data quality. It applies two-stage deduplication: first removing exact duplicates using hash comparison, then removing near-duplicates using similarity scoring. The cleaned dataset is saved to a specified output collection. This is useful for maintaining clean vector databases, reducing storage costs, and improving retrieval quality by eliminating redundant information.

Source Code

def clean_collection(collection_name, output_collection, host, port, 
                     similarity_threshold=0.95, skip_summarization=True):
    """
    Clean a single collection.
    
    Args:
        collection_name: Name of the collection to clean
        output_collection: Name of the output collection
        host: ChromaDB host
        port: ChromaDB port
        similarity_threshold: Similarity threshold for detecting similar documents
        skip_summarization: Whether to skip the summarization step
    """
    # Create config object
    config = Config()
    config.chroma_collection = collection_name
    config.chroma_host = host
    config.chroma_port = port
    config.similarity_threshold = similarity_threshold
    config.skip_summarization = skip_summarization
    
    # Initialize cleaners
    hash_cleaner = HashCleaner(config)
    similarity_cleaner = SimilarityCleaner(config)

    # Load data from ChromaDB
    data = load_data_from_chromadb(config)
    
    if not data:
        print(f"No documents found in collection '{collection_name}', skipping")
        return
    
    print(f"Cleaning collection '{collection_name}' with {len(data)} documents")

    # Step 1: Remove identical text chunks using hashing
    cleaned_data_hash = hash_cleaner.clean(data)

    # Step 2: Remove nearly similar text chunks using similarity screening
    cleaned_data_similarity = similarity_cleaner.clean(cleaned_data_hash)

    # Skip clustering and summarization as requested
    clustered_data = cleaned_data_similarity

    # Save cleaned data back to ChromaDB
    save_data_to_chromadb(clustered_data, config, output_collection)
    print(f"Saved {len(clustered_data)} documents to collection '{output_collection}'")
    
    # Print reduction stats
    reduction = (1 - len(clustered_data) / len(data)) * 100 if data else 0
    print(f"Reduced collection size by {reduction:.2f}% (from {len(data)} to {len(clustered_data)} documents)")

Parameters

Name Type Default Kind
collection_name - - positional_or_keyword
output_collection - - positional_or_keyword
host - - positional_or_keyword
port - - positional_or_keyword
similarity_threshold - 0.95 positional_or_keyword
skip_summarization - True positional_or_keyword

Parameter Details

collection_name: String name of the source ChromaDB collection to clean. Must be an existing collection in the ChromaDB instance.

output_collection: String name of the destination ChromaDB collection where cleaned documents will be saved. Will be created if it doesn't exist.

host: String hostname or IP address of the ChromaDB server (e.g., 'localhost', '127.0.0.1').

port: Integer or string port number where ChromaDB is running (e.g., 8000).

similarity_threshold: Float value between 0.0 and 1.0 representing the cosine similarity threshold for detecting near-duplicate documents. Documents with similarity above this threshold are considered duplicates. Default is 0.95 (95% similar). Higher values are more strict.

skip_summarization: Boolean flag to skip the summarization/clustering step. When True (default), only deduplication is performed. When False, additional clustering and summarization would be applied.

Return Value

This function returns None. It performs side effects by printing progress information to stdout and saving cleaned data to the specified ChromaDB output collection. Console output includes the number of documents processed, saved, and the percentage reduction in collection size.

Dependencies

  • chromadb
  • tqdm

Required Imports

from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config

Usage Example

# Ensure ChromaDB is running on localhost:8000
# and has a collection named 'my_documents'

from your_module import clean_collection

# Basic usage with defaults
clean_collection(
    collection_name='my_documents',
    output_collection='my_documents_cleaned',
    host='localhost',
    port=8000
)

# Custom similarity threshold for stricter deduplication
clean_collection(
    collection_name='research_papers',
    output_collection='research_papers_deduplicated',
    host='192.168.1.100',
    port=8000,
    similarity_threshold=0.90,  # More aggressive deduplication
    skip_summarization=True
)

# With summarization enabled
clean_collection(
    collection_name='articles',
    output_collection='articles_cleaned_summarized',
    host='localhost',
    port=8000,
    similarity_threshold=0.95,
    skip_summarization=False  # Enable clustering/summarization
)

Best Practices

  • Always ensure the ChromaDB server is running and accessible before calling this function
  • Verify the source collection exists and contains documents before cleaning
  • Choose an appropriate similarity_threshold based on your use case: higher values (0.95-0.99) for strict deduplication, lower values (0.80-0.90) for more aggressive cleaning
  • Use a different output_collection name than the source to preserve original data until cleaning is verified
  • Monitor memory usage when processing large collections as all documents are loaded into memory
  • The function prints progress to stdout, so consider redirecting output if running in automated pipelines
  • Test with a small subset first to validate the similarity_threshold produces desired results
  • Be aware that the function modifies the output collection, potentially overwriting existing data

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function main_v58 77.9% similar

    Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function main_v49 77.7% similar

    Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function reset_collection 63.3% similar

    Deletes an existing ChromaDB collection and logs the operation, requiring an application restart to recreate the collection.

    From: /tf/active/vicechatdev/docchat/reset_collection.py
  • function test_collection_creation 58.2% similar

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    From: /tf/active/vicechatdev/test_chroma_collections.py
  • class SimilarityCleaner 57.8% similar

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
← Back to Browse