function save_data_to_chromadb
Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.
/tf/active/vicechatdev/chromadb-cleanup/main copy.py
109 - 167
moderate
Purpose
This function provides a complete workflow for persisting document data to ChromaDB. It connects to a ChromaDB server, creates a new collection (deleting any existing collection with the same name), and adds documents with their text content, embeddings, and metadata. It handles both cases where embeddings are pre-computed or need to be generated by ChromaDB. The function is designed for batch document storage in vector database applications, particularly useful for semantic search, clustering, and document retrieval systems.
Source Code
def save_data_to_chromadb(data, config, collection_name=None):
"""
Save documents to ChromaDB.
Args:
data: List of document dictionaries with 'id', 'text', and 'embedding' keys
config: Configuration object
collection_name: Name of the collection to save to (defaults to config value)
"""
if not collection_name:
collection_name = config.chroma_collection
# Connect to ChromaDB
client = chromadb.HttpClient(
host=config.chroma_host,
port=config.chroma_port,
settings=Settings(anonymized_telemetry=False)
)
# Delete collection if it exists already (to avoid conflicts)
try:
client.delete_collection(name=collection_name)
except:
pass # Collection doesn't exist, that's fine
# Create a new collection
collection = client.create_collection(name=collection_name)
# Prepare data for adding to collection
ids = [doc['id'] for doc in data]
documents = [doc['text'] for doc in data]
embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
metadatas = [doc.get('metadata', {}) for doc in data]
# Add cluster information to metadata if available
for i, doc in enumerate(data):
if 'cluster' in doc:
metadatas[i]['cluster'] = str(doc['cluster'])
# Add original metadata if present
if 'metadata' in doc and isinstance(doc['metadata'], dict):
for k, v in doc['metadata'].items():
metadatas[i][k] = v
# Add documents to collection
if embeddings and len(embeddings) == len(ids):
collection.add(
ids=ids,
documents=documents,
embeddings=embeddings,
metadatas=metadatas
)
else:
# If embeddings weren't provided, ChromaDB will generate them
collection.add(
ids=ids,
documents=documents,
metadatas=metadatas
)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
data |
- | - | positional_or_keyword |
config |
- | - | positional_or_keyword |
collection_name |
- | None | positional_or_keyword |
Parameter Details
data: A list of dictionaries where each dictionary represents a document. Required keys: 'id' (unique identifier string), 'text' (document content string). Optional keys: 'embedding' (list/array of floats representing the document vector), 'cluster' (cluster assignment), 'metadata' (dictionary of additional metadata fields). All documents must have unique IDs.
config: A configuration object that must have attributes: 'chroma_host' (ChromaDB server hostname/IP), 'chroma_port' (ChromaDB server port number), and 'chroma_collection' (default collection name string). This object provides connection details and default settings for ChromaDB operations.
collection_name: Optional string specifying the name of the ChromaDB collection to create and populate. If None or not provided, uses config.chroma_collection as the default. The collection will be deleted if it already exists before creating a new one with the same name.
Return Value
This function does not return any value (implicitly returns None). The side effect is that documents are persisted to the specified ChromaDB collection. Success is indicated by no exceptions being raised.
Dependencies
chromadb
Required Imports
import chromadb
from chromadb.config import Settings
Usage Example
import chromadb
from chromadb.config import Settings
# Create a simple config object
class Config:
def __init__(self):
self.chroma_host = 'localhost'
self.chroma_port = 8000
self.chroma_collection = 'my_documents'
config = Config()
# Prepare document data
data = [
{
'id': 'doc1',
'text': 'This is the first document about machine learning.',
'embedding': [0.1, 0.2, 0.3, 0.4],
'metadata': {'source': 'article', 'date': '2024-01-01'},
'cluster': 0
},
{
'id': 'doc2',
'text': 'This is the second document about data science.',
'embedding': [0.2, 0.3, 0.4, 0.5],
'metadata': {'source': 'blog', 'date': '2024-01-02'},
'cluster': 1
}
]
# Save to ChromaDB
save_data_to_chromadb(data, config, collection_name='custom_collection')
# Or use default collection name from config
save_data_to_chromadb(data, config)
Best Practices
- Ensure all document IDs are unique to avoid conflicts during insertion
- Be aware that this function deletes any existing collection with the same name before creating a new one - this is destructive and will result in data loss
- If embeddings are not provided in the data, ChromaDB will auto-generate them, which may take additional time and resources
- All embedding vectors should have the same dimensionality for consistency
- Metadata values should be JSON-serializable (strings, numbers, booleans, lists, dicts)
- The cluster field is automatically converted to a string in metadata to ensure compatibility
- Ensure ChromaDB server is running and accessible before calling this function
- Consider implementing error handling around this function call to catch connection failures or data validation issues
- For large datasets, consider batching the data into smaller chunks to avoid memory issues
- The function silently ignores errors when trying to delete non-existent collections, which is intentional behavior
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function save_data_to_chromadb_v1 98.3% similar
-
function load_data_from_chromadb 81.7% similar
-
function load_data_from_chromadb_v1 78.9% similar
-
function reset_collection 62.2% similar
-
function test_collection_creation 61.5% similar