function save_data_to_chromadb_v1
Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.
/tf/active/vicechatdev/chromadb-cleanup/main.py
168 - 239
moderate
Purpose
This function provides a complete workflow for persisting document data to ChromaDB vector database. It handles connection establishment, collection management (deleting existing collections to avoid conflicts), and batch insertion of documents with their embeddings and metadata. The function is designed to work with document clustering pipelines and supports both pre-computed embeddings and automatic embedding generation by ChromaDB.
Source Code
def save_data_to_chromadb(data, config, collection_name=None):
"""
Save documents to ChromaDB.
Args:
data: List of document dictionaries with 'id', 'text', and 'embedding' keys
config: Configuration object
collection_name: Name of the collection to save to (defaults to config value)
"""
if not collection_name:
collection_name = config.chroma_collection
if not data:
print(f"No data to save to collection '{collection_name}'")
return
# Connect to ChromaDB
client = chromadb.HttpClient(
host=config.chroma_host,
port=config.chroma_port,
settings=Settings(anonymized_telemetry=False)
)
# Delete collection if it exists already (to avoid conflicts)
try:
client.delete_collection(name=collection_name)
except:
pass # Collection doesn't exist, that's fine
# Create a new collection
collection = client.create_collection(name=collection_name)
# Prepare data for adding to collection
ids = [doc['id'] for doc in data]
documents = [doc['text'] for doc in data]
embeddings = [doc['embedding'] for doc in data if 'embedding' in doc]
metadatas = [doc.get('metadata', {}) for doc in data]
# Add cluster information to metadata if available
for i, doc in enumerate(data):
if 'cluster' in doc:
metadatas[i]['cluster'] = str(doc['cluster'])
# Add original metadata if present
if 'metadata' in doc and isinstance(doc['metadata'], dict):
for k, v in doc['metadata'].items():
metadatas[i][k] = v
# Add documents to collection in batches to avoid overwhelming the server
batch_size = 100
for i in range(0, len(ids), batch_size):
end_idx = min(i + batch_size, len(ids))
batch_ids = ids[i:end_idx]
batch_documents = documents[i:end_idx]
batch_metadatas = metadatas[i:end_idx]
if embeddings and len(embeddings) >= end_idx:
batch_embeddings = embeddings[i:end_idx]
collection.add(
ids=batch_ids,
documents=batch_documents,
embeddings=batch_embeddings,
metadatas=batch_metadatas
)
else:
# If embeddings weren't provided, ChromaDB will generate them
collection.add(
ids=batch_ids,
documents=batch_documents,
metadatas=batch_metadatas
)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
data |
- | - | positional_or_keyword |
config |
- | - | positional_or_keyword |
collection_name |
- | None | positional_or_keyword |
Parameter Details
data: A list of dictionaries where each dictionary represents a document. Required keys: 'id' (unique identifier), 'text' (document content). Optional keys: 'embedding' (pre-computed vector embedding), 'metadata' (dictionary of additional metadata), 'cluster' (cluster assignment number). If data is empty or None, the function returns early without performing any operations.
config: A configuration object that must have the following attributes: 'chroma_host' (ChromaDB server hostname), 'chroma_port' (ChromaDB server port number), 'chroma_collection' (default collection name to use). This is typically an instance of src.config.Config class.
collection_name: Optional string specifying the name of the ChromaDB collection to save documents to. If None or not provided, defaults to config.chroma_collection. The function will delete any existing collection with this name before creating a new one.
Return Value
This function returns None. It performs side effects by saving data to ChromaDB and printing status messages to stdout. If no data is provided, it prints a message indicating no data was saved.
Dependencies
chromadb
Required Imports
import chromadb
from chromadb.config import Settings
Usage Example
python
import chromadb
from chromadb.config import Settings
# Define a simple config object
class Config:
def __init__(self):
self.chroma_host = 'localhost'
self.chroma_port = 8000
self.chroma_collection = 'my_documents'
config = Config()
# Prepare sample data
data = [
{
'id': 'doc1',
'text': 'This is the first document',
'embedding': [0.1, 0.2, 0.3],
'metadata': {'source': 'file1.txt'},
'cluster': 0
},
{
'id': 'doc2',
'text': 'This is the second document',
'embedding': [0.4, 0.5, 0.6],
'metadata': {'source': 'file2.txt'},
'cluster': 1
}
]
# Save to ChromaDB
save_data_to_chromadb(data, config)
# Or specify a custom collection name
save_data_to_chromadb(data, config, collection_name='custom_collection')
Best Practices
- Ensure ChromaDB server is running before calling this function
- Be aware that this function DELETES existing collections with the same name - use with caution in production environments
- The function uses a batch size of 100 documents to avoid overwhelming the server - adjust if needed for your use case
- If embeddings are not provided in the data, ChromaDB will automatically generate them using its default embedding function
- All metadata values should be JSON-serializable; cluster values are automatically converted to strings
- Document IDs must be unique within the collection
- Consider implementing error handling around this function call to catch connection failures or data validation issues
- For large datasets, monitor memory usage as all data is processed in memory before batching
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function save_data_to_chromadb 98.3% similar
-
function load_data_from_chromadb 81.5% similar
-
function load_data_from_chromadb_v1 79.0% similar
-
function test_collection_creation 60.9% similar
-
function reset_collection 60.4% similar