🔍 Code Extractor

function load_data_from_chromadb

Maturity: 49

Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.

File:
/tf/active/vicechatdev/chromadb-cleanup/main.py
Lines:
123 - 165
Complexity:
moderate

Purpose

This function serves as a data loader for ChromaDB vector databases. It establishes an HTTP connection to a ChromaDB server, retrieves a specified collection, and fetches all documents with their embeddings and metadata. The function is designed to integrate with document cleaning pipelines, converting ChromaDB's native format into a standardized dictionary format suitable for downstream processing tasks like deduplication, similarity analysis, or clustering.

Source Code

def load_data_from_chromadb(config):
    """
    Load documents from ChromaDB.
    
    Args:
        config: Configuration object
        
    Returns:
        List of document dictionaries with 'id', 'text', and 'embedding' keys
    """
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Get the collection
    try:
        collection = client.get_collection(name=config.chroma_collection)
    except ValueError:
        print(f"Collection '{config.chroma_collection}' not found")
        return []
    
    # Get all documents from the collection
    try:
        result = collection.get(include=['embeddings', 'documents', 'metadatas'])
        
        # Convert to the format required by our cleaners
        documents = []
        for i in range(len(result['ids'])):
            doc = {
                'id': result['ids'][i],
                'text': result['documents'][i],
                'embedding': result['embeddings'][i] if 'embeddings' in result else None,
                'metadata': result['metadatas'][i] if 'metadatas' in result else {}
            }
            documents.append(doc)
        
        return documents
    except Exception as e:
        print(f"Error loading collection '{config.chroma_collection}': {e}")
        return []

Parameters

Name Type Default Kind
config - - positional_or_keyword

Parameter Details

config: A configuration object that must contain the following attributes: 'chroma_host' (string, hostname/IP of ChromaDB server), 'chroma_port' (integer, port number for ChromaDB HTTP API), and 'chroma_collection' (string, name of the collection to retrieve). This is typically an instance of src.config.Config class.

Return Value

Returns a list of dictionaries, where each dictionary represents a document with the following keys: 'id' (string, unique document identifier), 'text' (string, document content), 'embedding' (list of floats or None, vector embedding of the document), and 'metadata' (dictionary, additional metadata associated with the document). Returns an empty list if the collection is not found or if an error occurs during retrieval.

Dependencies

  • chromadb

Required Imports

import chromadb
from chromadb.config import Settings

Usage Example

python
import chromadb
from chromadb.config import Settings

# Create a simple config object
class Config:
    def __init__(self):
        self.chroma_host = 'localhost'
        self.chroma_port = 8000
        self.chroma_collection = 'my_documents'

config = Config()

# Load documents from ChromaDB
documents = load_data_from_chromadb(config)

# Process the loaded documents
if documents:
    print(f"Loaded {len(documents)} documents")
    for doc in documents[:3]:  # Print first 3 documents
        print(f"ID: {doc['id']}")
        print(f"Text: {doc['text'][:100]}...")  # First 100 chars
        print(f"Has embedding: {doc['embedding'] is not None}")
        print(f"Metadata: {doc['metadata']}")
        print("---")
else:
    print("No documents loaded or collection not found")

Best Practices

  • Ensure ChromaDB server is running before calling this function to avoid connection errors
  • Handle the empty list return value appropriately in your code, as it indicates either a missing collection or an error
  • The function prints error messages to stdout; consider capturing or logging these in production environments
  • Verify that the config object has all required attributes (chroma_host, chroma_port, chroma_collection) before passing it to the function
  • Be aware that this function loads ALL documents from the collection into memory, which may be problematic for very large collections
  • The function disables anonymized telemetry in ChromaDB settings; adjust this if telemetry is desired
  • Consider implementing pagination or batch processing for large collections to avoid memory issues
  • The returned embeddings may be None if they were not stored in the collection; always check before using them

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function load_data_from_chromadb_v1 96.1% similar

    Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function save_data_to_chromadb 81.7% similar

    Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function save_data_to_chromadb_v1 81.5% similar

    Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function test_chroma_collections 62.6% similar

    A diagnostic function that tests connectivity to ChromaDB instances across multiple connection methods and lists all available collections with their metadata.

    From: /tf/active/vicechatdev/test_chroma_collections.py
  • function test_collection_creation 56.8% similar

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    From: /tf/active/vicechatdev/test_chroma_collections.py
← Back to Browse