ChromaManager - Code Extractor

class ChromaManager

Maturity: 51

ChromaManager is a class that manages interactions with a Chroma vector database, providing methods to create collections, add documents with embeddings, and query for similar documents.

File:
/tf/active/vicechatdev/QA_updater/knowledge_store/chroma_manager.py

Lines:
80 - 168

Complexity:
moderate

Purpose

This class serves as an abstraction layer for working with ChromaDB, a vector database used for semantic search and retrieval. It handles database initialization, collection management, document insertion with embeddings, and similarity-based querying. The class is designed to work with a remote ChromaDB instance via HTTP and uses custom embedding functions for document vectorization. It's particularly useful for building RAG (Retrieval-Augmented Generation) systems or any application requiring semantic document search.

Source Code

class ChromaManager:
    """Manages interactions with the Chroma vector database."""

    def __init__(self, config: ConfigParser):
        """
        Initializes the ChromaManager with the database path specified in the config.

        Args:
            config (ConfigParser): Configuration object containing database settings.
        """
        self.logger = logging.getLogger(__name__)
        self.db_path = config.get('database', 'chroma_db_path', fallback='./chroma_db')

        api_key = "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"  # Replace with your actual API key
        chroma_embedder=MyEmbeddingFunction("gpt-4o-mini","text-embedding-3-small",api_key)
        chroma_client=chromadb.HttpClient(host='vice_chroma', port=8000)
        self.client = chroma_client.get_collection( 
            name=self.db_path,
            embedding_function=chroma_embedder,
        )

        self.logger.info(f"ChromaManager initialized with database path: {self.db_path}")

    def get_or_create_collection(self, collection_name: str):
        """
        Gets or creates a Chroma collection with the given name.

        Args:
            collection_name (str): The name of the collection.

        Returns:
            chromadb.api.models.Collection.Collection: The Chroma collection object.
        """
        try:
            collection = self.client.get_collection(name=collection_name)
            self.logger.info(f"Collection '{collection_name}' found.")
            return collection
        except ValueError:
            self.logger.info(f"Collection '{collection_name}' not found. Creating...")
            collection = self.client.create_collection(name=collection_name)
            return collection

    def add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None):
        """
        Adds documents to the specified Chroma collection.

        Args:
            collection_name (str): The name of the collection.
            documents (List[str]): A list of document texts to add.
            ids (List[str]): A list of unique IDs for the documents.
            embeddings (List[List[float]]): A list of embeddings for the documents.
            metadatas (List[Dict[str, Any]], optional): A list of metadata dictionaries for the documents. Defaults to None.
        """
        try:
            collection = self.get_or_create_collection(collection_name)
            collection.add(
                documents=documents,
                ids=ids,
                embeddings=embeddings,
                metadatas=metadatas if metadatas else None
            )
            self.logger.info(f"Added {len(documents)} documents to collection '{collection_name}'.")
        except Exception as e:
            self.logger.exception(f"Error adding documents to Chroma collection: {e}")

    def query_collection(self, query, n_results: int = 4) -> List[str]:
        """
        Queries the specified Chroma collection with the given embeddings.

        Args:
            collection_name (str): The name of the collection.
            query_embeddings (List[List[float]]): A list of query embeddings.
            n_results (int, optional): The number of results to return. Defaults to 10.

        Returns:
            List[str]: A list of document texts that are the most similar to the query embeddings.
        """
        try:
            #collection = self.get_or_create_collection(collection_name)
            results = self.client.query(
                query_texts=[query],
                n_results=n_results
            )
            self.logger.info(f"Queried collection sp_archives and retrieved {len(results.get('documents', []))} results.")
            self.logger.info(f"Results : '{results.get('documents', [])[0]}'.")
            return results.get('documents', [])[0]
        except Exception as e:
            self.logger.exception(f"Error querying Chroma collection: {e}")
            return []

Parameters

Name	Type	Default	Kind
`bases`	-	-

Parameter Details

config: A ConfigParser object containing database configuration settings. Must include a 'database' section with 'chroma_db_path' key (defaults to './chroma_db' if not provided). This path is used as the collection name when connecting to the ChromaDB instance.

Return Value

The constructor returns a ChromaManager instance. The get_or_create_collection method returns a chromadb.api.models.Collection.Collection object. The add_documents method returns None (void). The query_collection method returns a List[str] containing document texts that are most similar to the query, or an empty list if an error occurs.

Class Interface

Methods

`init(self, config: ConfigParser)`

Purpose: Initializes the ChromaManager with database configuration, establishes connection to ChromaDB server, and retrieves the specified collection

Parameters:

config: ConfigParser object containing database settings, specifically 'chroma_db_path' under 'database' section

Returns: None (constructor)

`get_or_create_collection(self, collection_name: str) -> chromadb.api.models.Collection.Collection`

Purpose: Retrieves an existing Chroma collection by name, or creates it if it doesn't exist

Parameters:

collection_name: String name of the collection to get or create

Returns: A chromadb.api.models.Collection.Collection object representing the collection

`add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None)`

Purpose: Adds documents with their embeddings and metadata to a specified Chroma collection

Parameters:

collection_name: Name of the collection to add documents to
documents: List of document text strings to add
ids: List of unique string identifiers for each document (must match length of documents)
embeddings: List of embedding vectors (list of floats) for each document
metadatas: Optional list of metadata dictionaries for each document, defaults to None

Returns: None (void method, logs success or errors)

`query_collection(self, query: str, n_results: int = 4) -> List[str]`

Purpose: Queries the collection specified in config with a text query and returns the most similar documents

Parameters:

query: Text query string to search for similar documents
n_results: Number of similar documents to return, defaults to 4

Returns: List of document text strings that are most similar to the query, or empty list on error

Attributes

Name	Type	Description	Scope
`logger`	logging.Logger	Logger instance for logging ChromaManager operations and errors	instance
`db_path`	str	Path/name of the database collection, retrieved from config or defaults to './chroma_db'	instance
`client`	chromadb.api.models.Collection.Collection	ChromaDB collection object representing the active collection for queries and operations	instance

Dependencies

chromadb
logging
typing
configparser
langchain_openai
tiktoken
openai

Required Imports

import chromadb
from chromadb import Documents
from chromadb import EmbeddingFunction
from chromadb import Embeddings
import logging
from typing import List, Dict, Any
from configparser import ConfigParser
from langchain_openai import ChatOpenAI
from langchain_openai import AzureChatOpenAI
import tiktoken
import openai

Usage Example

from configparser import ConfigParser
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)

# Create configuration
config = ConfigParser()
config.add_section('database')
config.set('database', 'chroma_db_path', 'my_collection')

# Initialize ChromaManager
chroma_manager = ChromaManager(config)

# Add documents to a collection
documents = ['This is document 1', 'This is document 2']
ids = ['doc1', 'doc2']
embeddings = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
metadatas = [{'source': 'file1.txt'}, {'source': 'file2.txt'}]

chroma_manager.add_documents(
    collection_name='my_docs',
    documents=documents,
    ids=ids,
    embeddings=embeddings,
    metadatas=metadatas
)

# Query the collection
query_text = 'search query'
results = chroma_manager.query_collection(query=query_text, n_results=4)
print(f'Found {len(results)} similar documents')
for doc in results:
    print(doc)

Best Practices

Replace the hardcoded API key with environment variables or secure configuration management
Ensure ChromaDB server is running and accessible at 'vice_chroma:8000' before instantiation
The MyEmbeddingFunction class must be defined before using ChromaManager
Always provide unique IDs when adding documents to avoid conflicts
Handle exceptions when calling methods as they may fail due to network or database issues
The class maintains a persistent connection to a single collection specified in the config
Use get_or_create_collection for collection management to avoid errors
Embeddings must match the dimensionality expected by the embedding function
The query_collection method queries the collection specified in the config, not a parameter
Consider implementing connection pooling or retry logic for production use

Similar Components

AI-powered semantic similarity - components with related functionality:

class MyEmbeddingFunction 68.4% similar

Custom embedding function class that integrates OpenAI's embedding API with Chroma DB for generating vector embeddings from text documents.
From: /tf/active/vicechatdev/project_victoria_disclosure_generator.py
class DocumentIndexer 67.2% similar

A class for indexing documents into ChromaDB with support for multiple file formats (PDF, Word, PowerPoint, Excel, text files), smart incremental indexing, and document chunk management.
From: /tf/active/vicechatdev/docchat/document_indexer.py
class MyEmbeddingFunction_v1 63.8% similar

A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.
From: /tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
class DocChatEmbeddingFunction 63.0% similar

A custom ChromaDB embedding function that generates OpenAI embeddings with automatic text summarization for documents exceeding token limits.
From: /tf/active/vicechatdev/docchat/document_indexer.py
function save_data_to_chromadb 61.8% similar

Saves a list of document dictionaries to a ChromaDB vector database collection, optionally including embeddings and metadata.
From: /tf/active/vicechatdev/chromadb-cleanup/main copy.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class ChromaManager:
    """Manages interactions with the Chroma vector database."""

    def __init__(self, config: ConfigParser):
        """
        Initializes the ChromaManager with the database path specified in the config.

        Args:
            config (ConfigParser): Configuration object containing database settings.
        """
        self.logger = logging.getLogger(__name__)
        self.db_path = config.get('database', 'chroma_db_path', fallback='./chroma_db')

        api_key = "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"  # Replace with your actual API key
        chroma_embedder=MyEmbeddingFunction("gpt-4o-mini","text-embedding-3-small",api_key)
        chroma_client=chromadb.HttpClient(host='vice_chroma', port=8000)
        self.client = chroma_client.get_collection( 
            name=self.db_path,
            embedding_function=chroma_embedder,
        )

        self.logger.info(f"ChromaManager initialized with database path: {self.db_path}")

    def get_or_create_collection(self, collection_name: str):
        """
        Gets or creates a Chroma collection with the given name.

        Args:
            collection_name (str): The name of the collection.

        Returns:
            chromadb.api.models.Collection.Collection: The Chroma collection object.
        """
        try:
            collection = self.client.get_collection(name=collection_name)
            self.logger.info(f"Collection '{collection_name}' found.")
            return collection
        except ValueError:
            self.logger.info(f"Collection '{collection_name}' not found. Creating...")
            collection = self.client.create_collection(name=collection_name)
            return collection

    def add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None):
        """
        Adds documents to the specified Chroma collection.

        Args:
            collection_name (str): The name of the collection.
            documents (List[str]): A list of document texts to add.
            ids (List[str]): A list of unique IDs for the documents.
            embeddings (List[List[float]]): A list of embeddings for the documents.
            metadatas (List[Dict[str, Any]], optional): A list of metadata dictionaries for the documents. Defaults to None.
        """
        try:
            collection = self.get_or_create_collection(collection_name)
            collection.add(
                documents=documents,
                ids=ids,
                embeddings=embeddings,
                metadatas=metadatas if metadatas else None
            )
            self.logger.info(f"Added {len(documents)} documents to collection '{collection_name}'.")
        except Exception as e:
            self.logger.exception(f"Error adding documents to Chroma collection: {e}")

    def query_collection(self, query, n_results: int = 4) -> List[str]:
        """
        Queries the specified Chroma collection with the given embeddings.

        Args:
            collection_name (str): The name of the collection.
            query_embeddings (List[List[float]]): A list of query embeddings.
            n_results (int, optional): The number of results to return. Defaults to 10.

        Returns:
            List[str]: A list of document texts that are the most similar to the query embeddings.
        """
        try:
            #collection = self.get_or_create_collection(collection_name)
            results = self.client.query(
                query_texts=[query],
                n_results=n_results
            )
            self.logger.info(f"Queried collection sp_archives and retrieved {len(results.get('documents', []))} results.")
            self.logger.info(f"Results : '{results.get('documents', [])[0]}'.")
            return results.get('documents', [])[0]
        except Exception as e:
            self.logger.exception(f"Error querying Chroma collection: {e}")
            return []
                        

Improved Code

🔍 Code Extractor

class ChromaManager

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, config: ConfigParser)`

`get_or_create_collection(self, collection_name: str) -> chromadb.api.models.Collection.Collection`

`add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None)`

`query_collection(self, query: str, n_results: int = 4) -> List[str]`

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class MyEmbeddingFunction 68.4% similar

class DocumentIndexer 67.2% similar

class MyEmbeddingFunction_v1 63.8% similar

class DocChatEmbeddingFunction 63.0% similar

function save_data_to_chromadb 61.8% similar

class ChromaManager

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, config: ConfigParser)

get_or_create_collection(self, collection_name: str) -> chromadb.api.models.Collection.Collection

add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None)

query_collection(self, query: str, n_results: int = 4) -> List[str]

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class MyEmbeddingFunction 68.4% similar

class DocumentIndexer 67.2% similar

class MyEmbeddingFunction_v1 63.8% similar

class DocChatEmbeddingFunction 63.0% similar

function save_data_to_chromadb 61.8% similar

✨ Improve Code: ChromaManager

Code Comparison

`init(self, config: ConfigParser)`

`get_or_create_collection(self, collection_name: str) -> chromadb.api.models.Collection.Collection`

`add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None)`

`query_collection(self, query: str, n_results: int = 4) -> List[str]`