🔍 Code Extractor

class ChromaManager

Maturity: 51

ChromaManager is a class that manages interactions with a Chroma vector database, providing methods to create collections, add documents with embeddings, and query for similar documents.

File:
/tf/active/vicechatdev/QA_updater/knowledge_store/chroma_manager.py
Lines:
80 - 168
Complexity:
moderate

Purpose

This class serves as an abstraction layer for working with ChromaDB, a vector database used for semantic search and retrieval. It handles database initialization, collection management, document insertion with embeddings, and similarity-based querying. The class is designed to work with a remote ChromaDB instance via HTTP and uses custom embedding functions for document vectorization. It's particularly useful for building RAG (Retrieval-Augmented Generation) systems or any application requiring semantic document search.

Source Code

class ChromaManager:
    """Manages interactions with the Chroma vector database."""

    def __init__(self, config: ConfigParser):
        """
        Initializes the ChromaManager with the database path specified in the config.

        Args:
            config (ConfigParser): Configuration object containing database settings.
        """
        self.logger = logging.getLogger(__name__)
        self.db_path = config.get('database', 'chroma_db_path', fallback='./chroma_db')

        api_key = "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"  # Replace with your actual API key
        chroma_embedder=MyEmbeddingFunction("gpt-4o-mini","text-embedding-3-small",api_key)
        chroma_client=chromadb.HttpClient(host='vice_chroma', port=8000)
        self.client = chroma_client.get_collection( 
            name=self.db_path,
            embedding_function=chroma_embedder,
        )

        self.logger.info(f"ChromaManager initialized with database path: {self.db_path}")

    def get_or_create_collection(self, collection_name: str):
        """
        Gets or creates a Chroma collection with the given name.

        Args:
            collection_name (str): The name of the collection.

        Returns:
            chromadb.api.models.Collection.Collection: The Chroma collection object.
        """
        try:
            collection = self.client.get_collection(name=collection_name)
            self.logger.info(f"Collection '{collection_name}' found.")
            return collection
        except ValueError:
            self.logger.info(f"Collection '{collection_name}' not found. Creating...")
            collection = self.client.create_collection(name=collection_name)
            return collection

    def add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None):
        """
        Adds documents to the specified Chroma collection.

        Args:
            collection_name (str): The name of the collection.
            documents (List[str]): A list of document texts to add.
            ids (List[str]): A list of unique IDs for the documents.
            embeddings (List[List[float]]): A list of embeddings for the documents.
            metadatas (List[Dict[str, Any]], optional): A list of metadata dictionaries for the documents. Defaults to None.
        """
        try:
            collection = self.get_or_create_collection(collection_name)
            collection.add(
                documents=documents,
                ids=ids,
                embeddings=embeddings,
                metadatas=metadatas if metadatas else None
            )
            self.logger.info(f"Added {len(documents)} documents to collection '{collection_name}'.")
        except Exception as e:
            self.logger.exception(f"Error adding documents to Chroma collection: {e}")

    def query_collection(self, query, n_results: int = 4) -> List[str]:
        """
        Queries the specified Chroma collection with the given embeddings.

        Args:
            collection_name (str): The name of the collection.
            query_embeddings (List[List[float]]): A list of query embeddings.
            n_results (int, optional): The number of results to return. Defaults to 10.

        Returns:
            List[str]: A list of document texts that are the most similar to the query embeddings.
        """
        try:
            #collection = self.get_or_create_collection(collection_name)
            results = self.client.query(
                query_texts=[query],
                n_results=n_results
            )
            self.logger.info(f"Queried collection sp_archives and retrieved {len(results.get('documents', []))} results.")
            self.logger.info(f"Results : '{results.get('documents', [])[0]}'.")
            return results.get('documents', [])[0]
        except Exception as e:
            self.logger.exception(f"Error querying Chroma collection: {e}")
            return []

Parameters

Name Type Default Kind
bases - -

Parameter Details

config: A ConfigParser object containing database configuration settings. Must include a 'database' section with 'chroma_db_path' key (defaults to './chroma_db' if not provided). This path is used as the collection name when connecting to the ChromaDB instance.

Return Value

The constructor returns a ChromaManager instance. The get_or_create_collection method returns a chromadb.api.models.Collection.Collection object. The add_documents method returns None (void). The query_collection method returns a List[str] containing document texts that are most similar to the query, or an empty list if an error occurs.

Class Interface

Methods

__init__(self, config: ConfigParser)

Purpose: Initializes the ChromaManager with database configuration, establishes connection to ChromaDB server, and retrieves the specified collection

Parameters:

  • config: ConfigParser object containing database settings, specifically 'chroma_db_path' under 'database' section

Returns: None (constructor)

get_or_create_collection(self, collection_name: str) -> chromadb.api.models.Collection.Collection

Purpose: Retrieves an existing Chroma collection by name, or creates it if it doesn't exist

Parameters:

  • collection_name: String name of the collection to get or create

Returns: A chromadb.api.models.Collection.Collection object representing the collection

add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None)

Purpose: Adds documents with their embeddings and metadata to a specified Chroma collection

Parameters:

  • collection_name: Name of the collection to add documents to
  • documents: List of document text strings to add
  • ids: List of unique string identifiers for each document (must match length of documents)
  • embeddings: List of embedding vectors (list of floats) for each document
  • metadatas: Optional list of metadata dictionaries for each document, defaults to None

Returns: None (void method, logs success or errors)

query_collection(self, query: str, n_results: int = 4) -> List[str]

Purpose: Queries the collection specified in config with a text query and returns the most similar documents

Parameters:

  • query: Text query string to search for similar documents
  • n_results: Number of similar documents to return, defaults to 4

Returns: List of document text strings that are most similar to the query, or empty list on error

Attributes

Name Type Description Scope
logger logging.Logger Logger instance for logging ChromaManager operations and errors instance
db_path str Path/name of the database collection, retrieved from config or defaults to './chroma_db' instance
client chromadb.api.models.Collection.Collection ChromaDB collection object representing the active collection for queries and operations instance

Dependencies

  • chromadb
  • logging
  • typing
  • configparser
  • langchain_openai
  • tiktoken
  • openai

Required Imports

import chromadb
from chromadb import Documents
from chromadb import EmbeddingFunction
from chromadb import Embeddings
import logging
from typing import List, Dict, Any
from configparser import ConfigParser
from langchain_openai import ChatOpenAI
from langchain_openai import AzureChatOpenAI
import tiktoken
import openai

Usage Example

from configparser import ConfigParser
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)

# Create configuration
config = ConfigParser()
config.add_section('database')
config.set('database', 'chroma_db_path', 'my_collection')

# Initialize ChromaManager
chroma_manager = ChromaManager(config)

# Add documents to a collection
documents = ['This is document 1', 'This is document 2']
ids = ['doc1', 'doc2']
embeddings = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
metadatas = [{'source': 'file1.txt'}, {'source': 'file2.txt'}]

chroma_manager.add_documents(
    collection_name='my_docs',
    documents=documents,
    ids=ids,
    embeddings=embeddings,
    metadatas=metadatas
)

# Query the collection
query_text = 'search query'
results = chroma_manager.query_collection(query=query_text, n_results=4)
print(f'Found {len(results)} similar documents')
for doc in results:
    print(doc)

Best Practices

  • Replace the hardcoded API key with environment variables or secure configuration management
  • Ensure ChromaDB server is running and accessible at 'vice_chroma:8000' before instantiation
  • The MyEmbeddingFunction class must be defined before using ChromaManager
  • Always provide unique IDs when adding documents to avoid conflicts
  • Handle exceptions when calling methods as they may fail due to network or database issues
  • The class maintains a persistent connection to a single collection specified in the config
  • Use get_or_create_collection for collection management to avoid errors
  • Embeddings must match the dimensionality expected by the embedding function
  • The query_collection method queries the collection specified in the config, not a parameter
  • Consider implementing connection pooling or retry logic for production use

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class MyEmbeddingFunction 68.4% similar

    Custom embedding function class that integrates OpenAI's embedding API with Chroma DB for generating vector embeddings from text documents.

    From: /tf/active/vicechatdev/project_victoria_disclosure_generator.py
  • class DocumentIndexer 67.2% similar

    A class for indexing documents into ChromaDB with support for multiple file formats (PDF, Word, PowerPoint, Excel, text files), smart incremental indexing, and document chunk management.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
  • class MyEmbeddingFunction_v1 63.8% similar

    A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.

    From: /tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
  • class DocChatEmbeddingFunction 63.0% similar

    A custom ChromaDB embedding function that generates OpenAI embeddings with automatic text summarization for documents exceeding token limits.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
  • function save_data_to_chromadb 61.8% similar

    Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.

    From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py
← Back to Browse