function clean_collection
Cleans a ChromaDB collection by removing duplicate and similar documents using hash-based and similarity-based deduplication techniques, then saves the cleaned data to a new collection.
/tf/active/vicechatdev/chromadb-cleanup/main.py
71 - 120
moderate
Purpose
This function performs data cleaning on a ChromaDB vector database collection to reduce redundancy and improve data quality. It applies two-stage deduplication: first removing exact duplicates using hash comparison, then removing near-duplicates using similarity scoring. The cleaned dataset is saved to a specified output collection. This is useful for maintaining clean vector databases, reducing storage costs, and improving retrieval quality by eliminating redundant information.
Source Code
def clean_collection(collection_name, output_collection, host, port,
similarity_threshold=0.95, skip_summarization=True):
"""
Clean a single collection.
Args:
collection_name: Name of the collection to clean
output_collection: Name of the output collection
host: ChromaDB host
port: ChromaDB port
similarity_threshold: Similarity threshold for detecting similar documents
skip_summarization: Whether to skip the summarization step
"""
# Create config object
config = Config()
config.chroma_collection = collection_name
config.chroma_host = host
config.chroma_port = port
config.similarity_threshold = similarity_threshold
config.skip_summarization = skip_summarization
# Initialize cleaners
hash_cleaner = HashCleaner(config)
similarity_cleaner = SimilarityCleaner(config)
# Load data from ChromaDB
data = load_data_from_chromadb(config)
if not data:
print(f"No documents found in collection '{collection_name}', skipping")
return
print(f"Cleaning collection '{collection_name}' with {len(data)} documents")
# Step 1: Remove identical text chunks using hashing
cleaned_data_hash = hash_cleaner.clean(data)
# Step 2: Remove nearly similar text chunks using similarity screening
cleaned_data_similarity = similarity_cleaner.clean(cleaned_data_hash)
# Skip clustering and summarization as requested
clustered_data = cleaned_data_similarity
# Save cleaned data back to ChromaDB
save_data_to_chromadb(clustered_data, config, output_collection)
print(f"Saved {len(clustered_data)} documents to collection '{output_collection}'")
# Print reduction stats
reduction = (1 - len(clustered_data) / len(data)) * 100 if data else 0
print(f"Reduced collection size by {reduction:.2f}% (from {len(data)} to {len(clustered_data)} documents)")
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
collection_name |
- | - | positional_or_keyword |
output_collection |
- | - | positional_or_keyword |
host |
- | - | positional_or_keyword |
port |
- | - | positional_or_keyword |
similarity_threshold |
- | 0.95 | positional_or_keyword |
skip_summarization |
- | True | positional_or_keyword |
Parameter Details
collection_name: String name of the source ChromaDB collection to clean. Must be an existing collection in the ChromaDB instance.
output_collection: String name of the destination ChromaDB collection where cleaned documents will be saved. Will be created if it doesn't exist.
host: String hostname or IP address of the ChromaDB server (e.g., 'localhost', '127.0.0.1').
port: Integer or string port number where ChromaDB is running (e.g., 8000).
similarity_threshold: Float value between 0.0 and 1.0 representing the cosine similarity threshold for detecting near-duplicate documents. Documents with similarity above this threshold are considered duplicates. Default is 0.95 (95% similar). Higher values are more strict.
skip_summarization: Boolean flag to skip the summarization/clustering step. When True (default), only deduplication is performed. When False, additional clustering and summarization would be applied.
Return Value
This function returns None. It performs side effects by printing progress information to stdout and saving cleaned data to the specified ChromaDB output collection. Console output includes the number of documents processed, saved, and the percentage reduction in collection size.
Dependencies
chromadbtqdm
Required Imports
from src.cleaners.hash_cleaner import HashCleaner
from src.cleaners.similarity_cleaner import SimilarityCleaner
from src.config import Config
Usage Example
# Ensure ChromaDB is running on localhost:8000
# and has a collection named 'my_documents'
from your_module import clean_collection
# Basic usage with defaults
clean_collection(
collection_name='my_documents',
output_collection='my_documents_cleaned',
host='localhost',
port=8000
)
# Custom similarity threshold for stricter deduplication
clean_collection(
collection_name='research_papers',
output_collection='research_papers_deduplicated',
host='192.168.1.100',
port=8000,
similarity_threshold=0.90, # More aggressive deduplication
skip_summarization=True
)
# With summarization enabled
clean_collection(
collection_name='articles',
output_collection='articles_cleaned_summarized',
host='localhost',
port=8000,
similarity_threshold=0.95,
skip_summarization=False # Enable clustering/summarization
)
Best Practices
- Always ensure the ChromaDB server is running and accessible before calling this function
- Verify the source collection exists and contains documents before cleaning
- Choose an appropriate similarity_threshold based on your use case: higher values (0.95-0.99) for strict deduplication, lower values (0.80-0.90) for more aggressive cleaning
- Use a different output_collection name than the source to preserve original data until cleaning is verified
- Monitor memory usage when processing large collections as all documents are loaded into memory
- The function prints progress to stdout, so consider redirecting output if running in automated pipelines
- Test with a small subset first to validate the similarity_threshold produces desired results
- Be aware that the function modifies the output collection, potentially overwriting existing data
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v58 77.9% similar
-
function main_v49 77.7% similar
-
function reset_collection 63.3% similar
-
function test_collection_creation 58.2% similar
-
class SimilarityCleaner 57.8% similar