🔍 Code Extractor

function save_data_to_chromadb_v1

Maturity: 50

Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.

File:
/tf/active/vicechatdev/chromadb-cleanup/main.py
Lines:
168 - 239
Complexity:
moderate

Purpose

This function provides a complete workflow for persisting document data to ChromaDB vector database. It handles connection establishment, collection management (deleting existing collections to avoid conflicts), and batch insertion of documents with their embeddings and metadata. The function is designed to work with document clustering pipelines and supports both pre-computed embeddings and automatic embedding generation by ChromaDB.

Source Code

def save_data_to_chromadb(data, config, collection_name=None):
    """
    Save documents to ChromaDB.
    
    Args:
        data: List of document dictionaries with 'id', 'text', and 'embedding' keys
        config: Configuration object
        collection_name: Name of the collection to save to (defaults to config value)
    """
    if not collection_name:
        collection_name = config.chroma_collection
    
    if not data:
        print(f"No data to save to collection '{collection_name}'")
        return
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Delete collection if it exists already (to avoid conflicts)
    try:
        client.delete_collection(name=collection_name)
    except:
        pass  # Collection doesn't exist, that's fine
    
    # Create a new collection
    collection = client.create_collection(name=collection_name)
    
    # Prepare data for adding to collection
    ids = [doc['id'] for doc in data]
    documents = [doc['text'] for doc in data]
    embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
    metadatas = [doc.get('metadata', {}) for doc in data]
    
    # Add cluster information to metadata if available
    for i, doc in enumerate(data):
        if 'cluster' in doc:
            metadatas[i]['cluster'] = str(doc['cluster'])
        
        # Add original metadata if present
        if 'metadata' in doc and isinstance(doc['metadata'], dict):
            for k, v in doc['metadata'].items():
                metadatas[i][k] = v
    
    # Add documents to collection in batches to avoid overwhelming the server
    batch_size = 100
    
    for i in range(0, len(ids), batch_size):
        end_idx = min(i + batch_size, len(ids))
        batch_ids = ids[i:end_idx]
        batch_documents = documents[i:end_idx]
        batch_metadatas = metadatas[i:end_idx]
        
        if embeddings and len(embeddings) >= end_idx:
            batch_embeddings = embeddings[i:end_idx]
            collection.add(
                ids=batch_ids,
                documents=batch_documents,
                embeddings=batch_embeddings,
                metadatas=batch_metadatas
            )
        else:
            # If embeddings weren't provided, ChromaDB will generate them
            collection.add(
                ids=batch_ids,
                documents=batch_documents,
                metadatas=batch_metadatas
            )

Parameters

Name Type Default Kind
data - - positional_or_keyword
config - - positional_or_keyword
collection_name - None positional_or_keyword

Parameter Details

data: A list of dictionaries where each dictionary represents a document. Required keys: 'id' (unique identifier), 'text' (document content). Optional keys: 'embedding' (pre-computed vector embedding), 'metadata' (dictionary of additional metadata), 'cluster' (cluster assignment number). If data is empty or None, the function returns early without performing any operations.

config: A configuration object that must have the following attributes: 'chroma_host' (ChromaDB server hostname), 'chroma_port' (ChromaDB server port number), 'chroma_collection' (default collection name to use). This is typically an instance of src.config.Config class.

collection_name: Optional string specifying the name of the ChromaDB collection to save documents to. If None or not provided, defaults to config.chroma_collection. The function will delete any existing collection with this name before creating a new one.

Return Value

This function returns None. It performs side effects by saving data to ChromaDB and printing status messages to stdout. If no data is provided, it prints a message indicating no data was saved.

Dependencies

  • chromadb

Required Imports

import chromadb
from chromadb.config import Settings

Usage Example

python
import chromadb
from chromadb.config import Settings

# Define a simple config object
class Config:
    def __init__(self):
        self.chroma_host = 'localhost'
        self.chroma_port = 8000
        self.chroma_collection = 'my_documents'

config = Config()

# Prepare sample data
data = [
    {
        'id': 'doc1',
        'text': 'This is the first document',
        'embedding': [0.1, 0.2, 0.3],
        'metadata': {'source': 'file1.txt'},
        'cluster': 0
    },
    {
        'id': 'doc2',
        'text': 'This is the second document',
        'embedding': [0.4, 0.5, 0.6],
        'metadata': {'source': 'file2.txt'},
        'cluster': 1
    }
]

# Save to ChromaDB
save_data_to_chromadb(data, config)

# Or specify a custom collection name
save_data_to_chromadb(data, config, collection_name='custom_collection')

Best Practices

  • Ensure ChromaDB server is running before calling this function
  • Be aware that this function DELETES existing collections with the same name - use with caution in production environments
  • The function uses a batch size of 100 documents to avoid overwhelming the server - adjust if needed for your use case
  • If embeddings are not provided in the data, ChromaDB will automatically generate them using its default embedding function
  • All metadata values should be JSON-serializable; cluster values are automatically converted to strings
  • Document IDs must be unique within the collection
  • Consider implementing error handling around this function call to catch connection failures or data validation issues
  • For large datasets, monitor memory usage as all data is processed in memory before batching

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function save_data_to_chromadb 98.3% similar

    Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function load_data_from_chromadb 81.5% similar

    Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function load_data_from_chromadb_v1 79.0% similar

    Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function test_collection_creation 60.9% similar

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    From: /tf/active/vicechatdev/test_chroma_collections.py
  • function reset_collection 60.4% similar

    Deletes an existing ChromaDB collection and logs the operation, requiring an application restart to recreate the collection.

    From: /tf/active/vicechatdev/docchat/reset_collection.py
← Back to Browse