🔍 Code Extractor

function save_data_to_chromadb

Maturity: 54

Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.

File:
/tf/active/vicechatdev/chromadb-cleanup/main copy.py
Lines:
109 - 167
Complexity:
moderate

Purpose

This function provides a complete workflow for persisting document data to ChromaDB. It connects to a ChromaDB server, creates a new collection (deleting any existing collection with the same name), and adds documents with their text content, embeddings, and metadata. It handles both cases where embeddings are pre-computed or need to be generated by ChromaDB. The function is designed for batch document storage in vector database applications, particularly useful for semantic search, clustering, and document retrieval systems.

Source Code

def save_data_to_chromadb(data, config, collection_name=None):
    """
    Save documents to ChromaDB.
    
    Args:
        data: List of document dictionaries with 'id', 'text', and 'embedding' keys
        config: Configuration object
        collection_name: Name of the collection to save to (defaults to config value)
    """
    if not collection_name:
        collection_name = config.chroma_collection
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Delete collection if it exists already (to avoid conflicts)
    try:
        client.delete_collection(name=collection_name)
    except:
        pass  # Collection doesn't exist, that's fine
    
    # Create a new collection
    collection = client.create_collection(name=collection_name)
    
    # Prepare data for adding to collection
    ids = [doc['id'] for doc in data]
    documents = [doc['text'] for doc in data]
    embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
    metadatas = [doc.get('metadata', {}) for doc in data]
    
    # Add cluster information to metadata if available
    for i, doc in enumerate(data):
        if 'cluster' in doc:
            metadatas[i]['cluster'] = str(doc['cluster'])
        
        # Add original metadata if present
        if 'metadata' in doc and isinstance(doc['metadata'], dict):
            for k, v in doc['metadata'].items():
                metadatas[i][k] = v
    
    # Add documents to collection
    if embeddings and len(embeddings) == len(ids):
        collection.add(
            ids=ids,
            documents=documents,
            embeddings=embeddings,
            metadatas=metadatas
        )
    else:
        # If embeddings weren't provided, ChromaDB will generate them
        collection.add(
            ids=ids,
            documents=documents,
            metadatas=metadatas
        )

Parameters

Name Type Default Kind
data - - positional_or_keyword
config - - positional_or_keyword
collection_name - None positional_or_keyword

Parameter Details

data: A list of dictionaries where each dictionary represents a document. Required keys: 'id' (unique identifier string), 'text' (document content string). Optional keys: 'embedding' (list/array of floats representing the document vector), 'cluster' (cluster assignment), 'metadata' (dictionary of additional metadata fields). All documents must have unique IDs.

config: A configuration object that must have attributes: 'chroma_host' (ChromaDB server hostname/IP), 'chroma_port' (ChromaDB server port number), and 'chroma_collection' (default collection name string). This object provides connection details and default settings for ChromaDB operations.

collection_name: Optional string specifying the name of the ChromaDB collection to create and populate. If None or not provided, uses config.chroma_collection as the default. The collection will be deleted if it already exists before creating a new one with the same name.

Return Value

This function does not return any value (implicitly returns None). The side effect is that documents are persisted to the specified ChromaDB collection. Success is indicated by no exceptions being raised.

Dependencies

  • chromadb

Required Imports

import chromadb
from chromadb.config import Settings

Usage Example

import chromadb
from chromadb.config import Settings

# Create a simple config object
class Config:
    def __init__(self):
        self.chroma_host = 'localhost'
        self.chroma_port = 8000
        self.chroma_collection = 'my_documents'

config = Config()

# Prepare document data
data = [
    {
        'id': 'doc1',
        'text': 'This is the first document about machine learning.',
        'embedding': [0.1, 0.2, 0.3, 0.4],
        'metadata': {'source': 'article', 'date': '2024-01-01'},
        'cluster': 0
    },
    {
        'id': 'doc2',
        'text': 'This is the second document about data science.',
        'embedding': [0.2, 0.3, 0.4, 0.5],
        'metadata': {'source': 'blog', 'date': '2024-01-02'},
        'cluster': 1
    }
]

# Save to ChromaDB
save_data_to_chromadb(data, config, collection_name='custom_collection')

# Or use default collection name from config
save_data_to_chromadb(data, config)

Best Practices

  • Ensure all document IDs are unique to avoid conflicts during insertion
  • Be aware that this function deletes any existing collection with the same name before creating a new one - this is destructive and will result in data loss
  • If embeddings are not provided in the data, ChromaDB will auto-generate them, which may take additional time and resources
  • All embedding vectors should have the same dimensionality for consistency
  • Metadata values should be JSON-serializable (strings, numbers, booleans, lists, dicts)
  • The cluster field is automatically converted to a string in metadata to ensure compatibility
  • Ensure ChromaDB server is running and accessible before calling this function
  • Consider implementing error handling around this function call to catch connection failures or data validation issues
  • For large datasets, consider batching the data into smaller chunks to avoid memory issues
  • The function silently ignores errors when trying to delete non-existent collections, which is intentional behavior

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function save_data_to_chromadb_v1 98.3% similar

    Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function load_data_from_chromadb 81.7% similar

    Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function load_data_from_chromadb_v1 78.9% similar

    Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function reset_collection 62.2% similar

    Deletes an existing ChromaDB collection and logs the operation, requiring an application restart to recreate the collection.

    From: /tf/active/vicechatdev/docchat/reset_collection.py
  • function test_collection_creation 61.5% similar

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    From: /tf/active/vicechatdev/test_chroma_collections.py
← Back to Browse