function load_data_from_chromadb
Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.
/tf/active/vicechatdev/chromadb-cleanup/main.py
123 - 165
moderate
Purpose
This function serves as a data loader for ChromaDB vector databases. It establishes an HTTP connection to a ChromaDB server, retrieves a specified collection, and fetches all documents with their embeddings and metadata. The function is designed to integrate with document cleaning pipelines, converting ChromaDB's native format into a standardized dictionary format suitable for downstream processing tasks like deduplication, similarity analysis, or clustering.
Source Code
def load_data_from_chromadb(config):
"""
Load documents from ChromaDB.
Args:
config: Configuration object
Returns:
List of document dictionaries with 'id', 'text', and 'embedding' keys
"""
# Connect to ChromaDB
client = chromadb.HttpClient(
host=config.chroma_host,
port=config.chroma_port,
settings=Settings(anonymized_telemetry=False)
)
# Get the collection
try:
collection = client.get_collection(name=config.chroma_collection)
except ValueError:
print(f"Collection '{config.chroma_collection}' not found")
return []
# Get all documents from the collection
try:
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
# Convert to the format required by our cleaners
documents = []
for i in range(len(result['ids'])):
doc = {
'id': result['ids'][i],
'text': result['documents'][i],
'embedding': result['embeddings'][i] if 'embeddings' in result else None,
'metadata': result['metadatas'][i] if 'metadatas' in result else {}
}
documents.append(doc)
return documents
except Exception as e:
print(f"Error loading collection '{config.chroma_collection}': {e}")
return []
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
config |
- | - | positional_or_keyword |
Parameter Details
config: A configuration object that must contain the following attributes: 'chroma_host' (string, hostname/IP of ChromaDB server), 'chroma_port' (integer, port number for ChromaDB HTTP API), and 'chroma_collection' (string, name of the collection to retrieve). This is typically an instance of src.config.Config class.
Return Value
Returns a list of dictionaries, where each dictionary represents a document with the following keys: 'id' (string, unique document identifier), 'text' (string, document content), 'embedding' (list of floats or None, vector embedding of the document), and 'metadata' (dictionary, additional metadata associated with the document). Returns an empty list if the collection is not found or if an error occurs during retrieval.
Dependencies
chromadb
Required Imports
import chromadb
from chromadb.config import Settings
Usage Example
python
import chromadb
from chromadb.config import Settings
# Create a simple config object
class Config:
def __init__(self):
self.chroma_host = 'localhost'
self.chroma_port = 8000
self.chroma_collection = 'my_documents'
config = Config()
# Load documents from ChromaDB
documents = load_data_from_chromadb(config)
# Process the loaded documents
if documents:
print(f"Loaded {len(documents)} documents")
for doc in documents[:3]: # Print first 3 documents
print(f"ID: {doc['id']}")
print(f"Text: {doc['text'][:100]}...") # First 100 chars
print(f"Has embedding: {doc['embedding'] is not None}")
print(f"Metadata: {doc['metadata']}")
print("---")
else:
print("No documents loaded or collection not found")
Best Practices
- Ensure ChromaDB server is running before calling this function to avoid connection errors
- Handle the empty list return value appropriately in your code, as it indicates either a missing collection or an error
- The function prints error messages to stdout; consider capturing or logging these in production environments
- Verify that the config object has all required attributes (chroma_host, chroma_port, chroma_collection) before passing it to the function
- Be aware that this function loads ALL documents from the collection into memory, which may be problematic for very large collections
- The function disables anonymized telemetry in ChromaDB settings; adjust this if telemetry is desired
- Consider implementing pagination or batch processing for large collections to avoid memory issues
- The returned embeddings may be None if they were not stored in the collection; always check before using them
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function load_data_from_chromadb_v1 96.1% similar
-
function save_data_to_chromadb 81.7% similar
-
function save_data_to_chromadb_v1 81.5% similar
-
function test_chroma_collections 62.6% similar
-
function test_collection_creation 56.8% similar