save_data_to_chromadb_v1 - Code Extractor

function save_data_to_chromadb_v1

Maturity: 50

Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.

File:
/tf/active/vicechatdev/chromadb-cleanup/main.py

Lines:
168 - 239

Complexity:
moderate

Purpose

This function provides a complete workflow for persisting document data to ChromaDB vector database. It handles connection establishment, collection management (deleting existing collections to avoid conflicts), and batch insertion of documents with their embeddings and metadata. The function is designed to work with document clustering pipelines and supports both pre-computed embeddings and automatic embedding generation by ChromaDB.

Source Code

def save_data_to_chromadb(data, config, collection_name=None):
    """
    Save documents to ChromaDB.
    
    Args:
        data: List of document dictionaries with 'id', 'text', and 'embedding' keys
        config: Configuration object
        collection_name: Name of the collection to save to (defaults to config value)
    """
    if not collection_name:
        collection_name = config.chroma_collection
    
    if not data:
        print(f"No data to save to collection '{collection_name}'")
        return
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Delete collection if it exists already (to avoid conflicts)
    try:
        client.delete_collection(name=collection_name)
    except:
        pass  # Collection doesn't exist, that's fine
    
    # Create a new collection
    collection = client.create_collection(name=collection_name)
    
    # Prepare data for adding to collection
    ids = [doc['id'] for doc in data]
    documents = [doc['text'] for doc in data]
    embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
    metadatas = [doc.get('metadata', {}) for doc in data]
    
    # Add cluster information to metadata if available
    for i, doc in enumerate(data):
        if 'cluster' in doc:
            metadatas[i]['cluster'] = str(doc['cluster'])
        
        # Add original metadata if present
        if 'metadata' in doc and isinstance(doc['metadata'], dict):
            for k, v in doc['metadata'].items():
                metadatas[i][k] = v
    
    # Add documents to collection in batches to avoid overwhelming the server
    batch_size = 100
    
    for i in range(0, len(ids), batch_size):
        end_idx = min(i + batch_size, len(ids))
        batch_ids = ids[i:end_idx]
        batch_documents = documents[i:end_idx]
        batch_metadatas = metadatas[i:end_idx]
        
        if embeddings and len(embeddings) >= end_idx:
            batch_embeddings = embeddings[i:end_idx]
            collection.add(
                ids=batch_ids,
                documents=batch_documents,
                embeddings=batch_embeddings,
                metadatas=batch_metadatas
            )
        else:
            # If embeddings weren't provided, ChromaDB will generate them
            collection.add(
                ids=batch_ids,
                documents=batch_documents,
                metadatas=batch_metadatas
            )

Parameters

Name	Type	Default	Kind
`data`	-	-	positional_or_keyword
`config`	-	-	positional_or_keyword
`collection_name`	-	None	positional_or_keyword

Parameter Details

data: A list of dictionaries where each dictionary represents a document. Required keys: 'id' (unique identifier), 'text' (document content). Optional keys: 'embedding' (pre-computed vector embedding), 'metadata' (dictionary of additional metadata), 'cluster' (cluster assignment number). If data is empty or None, the function returns early without performing any operations.

config: A configuration object that must have the following attributes: 'chroma_host' (ChromaDB server hostname), 'chroma_port' (ChromaDB server port number), 'chroma_collection' (default collection name to use). This is typically an instance of src.config.Config class.

collection_name: Optional string specifying the name of the ChromaDB collection to save documents to. If None or not provided, defaults to config.chroma_collection. The function will delete any existing collection with this name before creating a new one.

Return Value

This function returns None. It performs side effects by saving data to ChromaDB and printing status messages to stdout. If no data is provided, it prints a message indicating no data was saved.

Dependencies

chromadb

Required Imports

import chromadb
from chromadb.config import Settings

Usage Example

python
import chromadb
from chromadb.config import Settings

# Define a simple config object
class Config:
    def __init__(self):
        self.chroma_host = 'localhost'
        self.chroma_port = 8000
        self.chroma_collection = 'my_documents'

config = Config()

# Prepare sample data
data = [
    {
        'id': 'doc1',
        'text': 'This is the first document',
        'embedding': [0.1, 0.2, 0.3],
        'metadata': {'source': 'file1.txt'},
        'cluster': 0
    },
    {
        'id': 'doc2',
        'text': 'This is the second document',
        'embedding': [0.4, 0.5, 0.6],
        'metadata': {'source': 'file2.txt'},
        'cluster': 1
    }
]

# Save to ChromaDB
save_data_to_chromadb(data, config)

# Or specify a custom collection name
save_data_to_chromadb(data, config, collection_name='custom_collection')

Best Practices

Ensure ChromaDB server is running before calling this function
Be aware that this function DELETES existing collections with the same name - use with caution in production environments
The function uses a batch size of 100 documents to avoid overwhelming the server - adjust if needed for your use case
If embeddings are not provided in the data, ChromaDB will automatically generate them using its default embedding function
All metadata values should be JSON-serializable; cluster values are automatically converted to strings
Document IDs must be unique within the collection
Consider implementing error handling around this function call to catch connection failures or data validation issues
For large datasets, monitor memory usage as all data is processed in memory before batching

Similar Components

AI-powered semantic similarity - components with related functionality:

function save_data_to_chromadb 98.3% similar

Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.
From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
function load_data_from_chromadb 81.5% similar

Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.
From: /tf/active/vicechatdev/chromadb-cleanup/main.py
function load_data_from_chromadb_v1 79.0% similar

Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.
From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
function test_collection_creation 60.9% similar

A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.
From: /tf/active/vicechatdev/test_chroma_collections.py
function reset_collection 60.4% similar

Deletes an existing ChromaDB collection and logs the operation, requiring an application restart to recreate the collection.
From: /tf/active/vicechatdev/docchat/reset_collection.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def save_data_to_chromadb(data, config, collection_name=None):
    """
    Save documents to ChromaDB.
    
    Args:
        data: List of document dictionaries with 'id', 'text', and 'embedding' keys
        config: Configuration object
        collection_name: Name of the collection to save to (defaults to config value)
    """
    if not collection_name:
        collection_name = config.chroma_collection
    
    if not data:
        print(f"No data to save to collection '{collection_name}'")
        return
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Delete collection if it exists already (to avoid conflicts)
    try:
        client.delete_collection(name=collection_name)
    except:
        pass  # Collection doesn't exist, that's fine
    
    # Create a new collection
    collection = client.create_collection(name=collection_name)
    
    # Prepare data for adding to collection
    ids = [doc['id'] for doc in data]
    documents = [doc['text'] for doc in data]
    embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
    metadatas = [doc.get('metadata', {}) for doc in data]
    
    # Add cluster information to metadata if available
    for i, doc in enumerate(data):
        if 'cluster' in doc:
            metadatas[i]['cluster'] = str(doc['cluster'])
        
        # Add original metadata if present
        if 'metadata' in doc and isinstance(doc['metadata'], dict):
            for k, v in doc['metadata'].items():
                metadatas[i][k] = v
    
    # Add documents to collection in batches to avoid overwhelming the server
    batch_size = 100
    
    for i in range(0, len(ids), batch_size):
        end_idx = min(i + batch_size, len(ids))
        batch_ids = ids[i:end_idx]
        batch_documents = documents[i:end_idx]
        batch_metadatas = metadatas[i:end_idx]
        
        if embeddings and len(embeddings) >= end_idx:
            batch_embeddings = embeddings[i:end_idx]
            collection.add(
                ids=batch_ids,
                documents=batch_documents,
                embeddings=batch_embeddings,
                metadatas=batch_metadatas
            )
        else:
            # If embeddings weren't provided, ChromaDB will generate them
            collection.add(
                ids=batch_ids,
                documents=batch_documents,
                metadatas=batch_metadatas
            )
                        

Improved Code

🔍 Code Extractor

function save_data_to_chromadb_v1

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function save_data_to_chromadb 98.3% similar

function load_data_from_chromadb 81.5% similar

function load_data_from_chromadb_v1 79.0% similar

function test_collection_creation 60.9% similar

function reset_collection 60.4% similar

function save_data_to_chromadb_v1

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function save_data_to_chromadb 98.3% similar

function load_data_from_chromadb 81.5% similar

function load_data_from_chromadb_v1 79.0% similar

function test_collection_creation 60.9% similar

function reset_collection 60.4% similar

✨ Improve Code: save_data_to_chromadb_v1

Code Comparison