save_data_to_chromadb - Code Extractor

function save_data_to_chromadb

Maturity: 54

Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.

File:
/tf/active/vicechatdev/chromadb-cleanup/main copy.py

Lines:
109 - 167

Complexity:
moderate

Purpose

This function provides a complete workflow for persisting document data to ChromaDB. It connects to a ChromaDB server, creates a new collection (deleting any existing collection with the same name), and adds documents with their text content, embeddings, and metadata. It handles both cases where embeddings are pre-computed or need to be generated by ChromaDB. The function is designed for batch document storage in vector database applications, particularly useful for semantic search, clustering, and document retrieval systems.

Source Code

def save_data_to_chromadb(data, config, collection_name=None):
    """
    Save documents to ChromaDB.
    
    Args:
        data: List of document dictionaries with 'id', 'text', and 'embedding' keys
        config: Configuration object
        collection_name: Name of the collection to save to (defaults to config value)
    """
    if not collection_name:
        collection_name = config.chroma_collection
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Delete collection if it exists already (to avoid conflicts)
    try:
        client.delete_collection(name=collection_name)
    except:
        pass  # Collection doesn't exist, that's fine
    
    # Create a new collection
    collection = client.create_collection(name=collection_name)
    
    # Prepare data for adding to collection
    ids = [doc['id'] for doc in data]
    documents = [doc['text'] for doc in data]
    embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
    metadatas = [doc.get('metadata', {}) for doc in data]
    
    # Add cluster information to metadata if available
    for i, doc in enumerate(data):
        if 'cluster' in doc:
            metadatas[i]['cluster'] = str(doc['cluster'])
        
        # Add original metadata if present
        if 'metadata' in doc and isinstance(doc['metadata'], dict):
            for k, v in doc['metadata'].items():
                metadatas[i][k] = v
    
    # Add documents to collection
    if embeddings and len(embeddings) == len(ids):
        collection.add(
            ids=ids,
            documents=documents,
            embeddings=embeddings,
            metadatas=metadatas
        )
    else:
        # If embeddings weren't provided, ChromaDB will generate them
        collection.add(
            ids=ids,
            documents=documents,
            metadatas=metadatas
        )

Parameters

Name	Type	Default	Kind
`data`	-	-	positional_or_keyword
`config`	-	-	positional_or_keyword
`collection_name`	-	None	positional_or_keyword

Parameter Details

data: A list of dictionaries where each dictionary represents a document. Required keys: 'id' (unique identifier string), 'text' (document content string). Optional keys: 'embedding' (list/array of floats representing the document vector), 'cluster' (cluster assignment), 'metadata' (dictionary of additional metadata fields). All documents must have unique IDs.

config: A configuration object that must have attributes: 'chroma_host' (ChromaDB server hostname/IP), 'chroma_port' (ChromaDB server port number), and 'chroma_collection' (default collection name string). This object provides connection details and default settings for ChromaDB operations.

collection_name: Optional string specifying the name of the ChromaDB collection to create and populate. If None or not provided, uses config.chroma_collection as the default. The collection will be deleted if it already exists before creating a new one with the same name.

Return Value

This function does not return any value (implicitly returns None). The side effect is that documents are persisted to the specified ChromaDB collection. Success is indicated by no exceptions being raised.

Dependencies

chromadb

Required Imports

import chromadb
from chromadb.config import Settings

Usage Example

import chromadb
from chromadb.config import Settings

# Create a simple config object
class Config:
    def __init__(self):
        self.chroma_host = 'localhost'
        self.chroma_port = 8000
        self.chroma_collection = 'my_documents'

config = Config()

# Prepare document data
data = [
    {
        'id': 'doc1',
        'text': 'This is the first document about machine learning.',
        'embedding': [0.1, 0.2, 0.3, 0.4],
        'metadata': {'source': 'article', 'date': '2024-01-01'},
        'cluster': 0
    },
    {
        'id': 'doc2',
        'text': 'This is the second document about data science.',
        'embedding': [0.2, 0.3, 0.4, 0.5],
        'metadata': {'source': 'blog', 'date': '2024-01-02'},
        'cluster': 1
    }
]

# Save to ChromaDB
save_data_to_chromadb(data, config, collection_name='custom_collection')

# Or use default collection name from config
save_data_to_chromadb(data, config)

Best Practices

Ensure all document IDs are unique to avoid conflicts during insertion
Be aware that this function deletes any existing collection with the same name before creating a new one - this is destructive and will result in data loss
If embeddings are not provided in the data, ChromaDB will auto-generate them, which may take additional time and resources
All embedding vectors should have the same dimensionality for consistency
Metadata values should be JSON-serializable (strings, numbers, booleans, lists, dicts)
The cluster field is automatically converted to a string in metadata to ensure compatibility
Ensure ChromaDB server is running and accessible before calling this function
Consider implementing error handling around this function call to catch connection failures or data validation issues
For large datasets, consider batching the data into smaller chunks to avoid memory issues
The function silently ignores errors when trying to delete non-existent collections, which is intentional behavior

Similar Components

AI-powered semantic similarity - components with related functionality:

function save_data_to_chromadb_v1 98.3% similar

Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.
From: /tf/active/vicechatdev/chromadb-cleanup/main.py
function load_data_from_chromadb 81.7% similar

Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.
From: /tf/active/vicechatdev/chromadb-cleanup/main.py
function load_data_from_chromadb_v1 78.9% similar

Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.
From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
function reset_collection 62.2% similar

Deletes an existing ChromaDB collection and logs the operation, requiring an application restart to recreate the collection.
From: /tf/active/vicechatdev/docchat/reset_collection.py
function test_collection_creation 61.5% similar

A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.
From: /tf/active/vicechatdev/test_chroma_collections.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def save_data_to_chromadb(data, config, collection_name=None):
    """
    Save documents to ChromaDB.
    
    Args:
        data: List of document dictionaries with 'id', 'text', and 'embedding' keys
        config: Configuration object
        collection_name: Name of the collection to save to (defaults to config value)
    """
    if not collection_name:
        collection_name = config.chroma_collection
    
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Delete collection if it exists already (to avoid conflicts)
    try:
        client.delete_collection(name=collection_name)
    except:
        pass  # Collection doesn't exist, that's fine
    
    # Create a new collection
    collection = client.create_collection(name=collection_name)
    
    # Prepare data for adding to collection
    ids = [doc['id'] for doc in data]
    documents = [doc['text'] for doc in data]
    embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
    metadatas = [doc.get('metadata', {}) for doc in data]
    
    # Add cluster information to metadata if available
    for i, doc in enumerate(data):
        if 'cluster' in doc:
            metadatas[i]['cluster'] = str(doc['cluster'])
        
        # Add original metadata if present
        if 'metadata' in doc and isinstance(doc['metadata'], dict):
            for k, v in doc['metadata'].items():
                metadatas[i][k] = v
    
    # Add documents to collection
    if embeddings and len(embeddings) == len(ids):
        collection.add(
            ids=ids,
            documents=documents,
            embeddings=embeddings,
            metadatas=metadatas
        )
    else:
        # If embeddings weren't provided, ChromaDB will generate them
        collection.add(
            ids=ids,
            documents=documents,
            metadatas=metadatas
        )
                        

Improved Code

🔍 Code Extractor

function save_data_to_chromadb

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function save_data_to_chromadb_v1 98.3% similar

function load_data_from_chromadb 81.7% similar

function load_data_from_chromadb_v1 78.9% similar

function reset_collection 62.2% similar

function test_collection_creation 61.5% similar

function save_data_to_chromadb

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function save_data_to_chromadb_v1 98.3% similar

function load_data_from_chromadb 81.7% similar

function load_data_from_chromadb_v1 78.9% similar

function reset_collection 62.2% similar

function test_collection_creation 61.5% similar

✨ Improve Code: save_data_to_chromadb

Code Comparison