function load_data_from_chromadb_v1
Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.
/tf/active/vicechatdev/chromadb-cleanup/main copy.py
69 - 107
simple
Purpose
This function connects to a ChromaDB instance via HTTP client and fetches all documents from a specified collection. It transforms the ChromaDB response format into a standardized document dictionary format suitable for downstream processing by cleaning and clustering components. The function handles collection not found errors gracefully by returning an empty list.
Source Code
def load_data_from_chromadb(config):
"""
Load documents from ChromaDB.
Args:
config: Configuration object
Returns:
List of document dictionaries with 'id', 'text', and 'embedding' keys
"""
# Connect to ChromaDB
client = chromadb.HttpClient(
host=config.chroma_host,
port=config.chroma_port,
settings=Settings(anonymized_telemetry=False)
)
# Get the collection
try:
collection = client.get_collection(name=config.chroma_collection)
except ValueError:
print(f"Collection '{config.chroma_collection}' not found")
return []
# Get all documents from the collection
result = collection.get(include=['embeddings', 'documents', 'metadatas'])
# Convert to the format required by our cleaners
documents = []
for i in range(len(result['ids'])):
doc = {
'id': result['ids'][i],
'text': result['documents'][i],
'embedding': result['embeddings'][i] if 'embeddings' in result else None,
'metadata': result['metadatas'][i] if 'metadatas' in result else {}
}
documents.append(doc)
return documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
config |
- | - | positional_or_keyword |
Parameter Details
config: Configuration object that must contain the following attributes: 'chroma_host' (string, ChromaDB server hostname), 'chroma_port' (integer, ChromaDB server port number), and 'chroma_collection' (string, name of the ChromaDB collection to query). This is typically an instance of src.config.Config class.
Return Value
Returns a list of dictionaries, where each dictionary represents a document with the following keys: 'id' (string, unique document identifier), 'text' (string, document content), 'embedding' (list of floats or None, vector embedding of the document), and 'metadata' (dictionary, additional document metadata or empty dict). Returns an empty list if the collection is not found or contains no documents.
Dependencies
chromadb
Required Imports
import chromadb
from chromadb.config import Settings
Usage Example
python
import chromadb
from chromadb.config import Settings
# Create a simple config object
class Config:
def __init__(self):
self.chroma_host = 'localhost'
self.chroma_port = 8000
self.chroma_collection = 'my_documents'
config = Config()
# Load documents from ChromaDB
documents = load_data_from_chromadb(config)
# Process the results
if documents:
print(f"Loaded {len(documents)} documents")
for doc in documents:
print(f"ID: {doc['id']}, Text length: {len(doc['text'])}")
if doc['embedding']:
print(f"Embedding dimension: {len(doc['embedding'])}")
else:
print("No documents found or collection does not exist")
Best Practices
- Ensure ChromaDB server is running before calling this function to avoid connection errors
- Handle the empty list return value when the collection doesn't exist
- Be aware that this function loads ALL documents from the collection into memory, which may be problematic for very large collections
- The function disables anonymized telemetry in ChromaDB settings; adjust if telemetry is desired
- Consider implementing pagination or filtering for large collections to avoid memory issues
- Verify that the config object has all required attributes (chroma_host, chroma_port, chroma_collection) before calling
- The function prints to stdout when collection is not found; consider using logging for production environments
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function load_data_from_chromadb 96.1% similar
-
function save_data_to_chromadb_v1 79.0% similar
-
function save_data_to_chromadb 78.9% similar
-
function test_chroma_collections 65.2% similar
-
function test_collection_creation 59.4% similar