🔍 Code Extractor

function load_data_from_chromadb_v1

Maturity: 47

Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.

File:
/tf/active/vicechatdev/chromadb-cleanup/main copy.py
Lines:
69 - 107
Complexity:
simple

Purpose

This function connects to a ChromaDB instance via HTTP client and fetches all documents from a specified collection. It transforms the ChromaDB response format into a standardized document dictionary format suitable for downstream processing by cleaning and clustering components. The function handles collection not found errors gracefully by returning an empty list.

Source Code

def load_data_from_chromadb(config):
    """
    Load documents from ChromaDB.
    
    Args:
        config: Configuration object
        
    Returns:
        List of document dictionaries with 'id', 'text', and 'embedding' keys
    """
    # Connect to ChromaDB
    client = chromadb.HttpClient(
        host=config.chroma_host,
        port=config.chroma_port,
        settings=Settings(anonymized_telemetry=False)
    )
    
    # Get the collection
    try:
        collection = client.get_collection(name=config.chroma_collection)
    except ValueError:
        print(f"Collection '{config.chroma_collection}' not found")
        return []
    
    # Get all documents from the collection
    result = collection.get(include=['embeddings', 'documents', 'metadatas'])
    
    # Convert to the format required by our cleaners
    documents = []
    for i in range(len(result['ids'])):
        doc = {
            'id': result['ids'][i],
            'text': result['documents'][i],
            'embedding': result['embeddings'][i] if 'embeddings' in result else None,
            'metadata': result['metadatas'][i] if 'metadatas' in result else {}
        }
        documents.append(doc)
    
    return documents

Parameters

Name Type Default Kind
config - - positional_or_keyword

Parameter Details

config: Configuration object that must contain the following attributes: 'chroma_host' (string, ChromaDB server hostname), 'chroma_port' (integer, ChromaDB server port number), and 'chroma_collection' (string, name of the ChromaDB collection to query). This is typically an instance of src.config.Config class.

Return Value

Returns a list of dictionaries, where each dictionary represents a document with the following keys: 'id' (string, unique document identifier), 'text' (string, document content), 'embedding' (list of floats or None, vector embedding of the document), and 'metadata' (dictionary, additional document metadata or empty dict). Returns an empty list if the collection is not found or contains no documents.

Dependencies

  • chromadb

Required Imports

import chromadb
from chromadb.config import Settings

Usage Example

python
import chromadb
from chromadb.config import Settings

# Create a simple config object
class Config:
    def __init__(self):
        self.chroma_host = 'localhost'
        self.chroma_port = 8000
        self.chroma_collection = 'my_documents'

config = Config()

# Load documents from ChromaDB
documents = load_data_from_chromadb(config)

# Process the results
if documents:
    print(f"Loaded {len(documents)} documents")
    for doc in documents:
        print(f"ID: {doc['id']}, Text length: {len(doc['text'])}")
        if doc['embedding']:
            print(f"Embedding dimension: {len(doc['embedding'])}")
else:
    print("No documents found or collection does not exist")

Best Practices

  • Ensure ChromaDB server is running before calling this function to avoid connection errors
  • Handle the empty list return value when the collection doesn't exist
  • Be aware that this function loads ALL documents from the collection into memory, which may be problematic for very large collections
  • The function disables anonymized telemetry in ChromaDB settings; adjust if telemetry is desired
  • Consider implementing pagination or filtering for large collections to avoid memory issues
  • Verify that the config object has all required attributes (chroma_host, chroma_port, chroma_collection) before calling
  • The function prints to stdout when collection is not found; consider using logging for production environments

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function load_data_from_chromadb 96.1% similar

    Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function save_data_to_chromadb_v1 79.0% similar

    Saves a list of document dictionaries to a ChromaDB collection, with support for batch processing, embeddings, and metadata storage.

    From: /tf/active/vicechatdev/chromadb-cleanup/main.py
  • function save_data_to_chromadb 78.9% similar

    Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
  • function test_chroma_collections 65.2% similar

    A diagnostic function that tests connectivity to ChromaDB instances across multiple connection methods and lists all available collections with their metadata.

    From: /tf/active/vicechatdev/test_chroma_collections.py
  • function test_collection_creation 59.4% similar

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    From: /tf/active/vicechatdev/test_chroma_collections.py
← Back to Browse