class ChromaManager
ChromaManager is a class that manages interactions with a Chroma vector database, providing methods to create collections, add documents with embeddings, and query for similar documents.
/tf/active/vicechatdev/QA_updater/knowledge_store/chroma_manager.py
80 - 168
moderate
Purpose
This class serves as an abstraction layer for working with ChromaDB, a vector database used for semantic search and retrieval. It handles database initialization, collection management, document insertion with embeddings, and similarity-based querying. The class is designed to work with a remote ChromaDB instance via HTTP and uses custom embedding functions for document vectorization. It's particularly useful for building RAG (Retrieval-Augmented Generation) systems or any application requiring semantic document search.
Source Code
class ChromaManager:
"""Manages interactions with the Chroma vector database."""
def __init__(self, config: ConfigParser):
"""
Initializes the ChromaManager with the database path specified in the config.
Args:
config (ConfigParser): Configuration object containing database settings.
"""
self.logger = logging.getLogger(__name__)
self.db_path = config.get('database', 'chroma_db_path', fallback='./chroma_db')
api_key = "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" # Replace with your actual API key
chroma_embedder=MyEmbeddingFunction("gpt-4o-mini","text-embedding-3-small",api_key)
chroma_client=chromadb.HttpClient(host='vice_chroma', port=8000)
self.client = chroma_client.get_collection(
name=self.db_path,
embedding_function=chroma_embedder,
)
self.logger.info(f"ChromaManager initialized with database path: {self.db_path}")
def get_or_create_collection(self, collection_name: str):
"""
Gets or creates a Chroma collection with the given name.
Args:
collection_name (str): The name of the collection.
Returns:
chromadb.api.models.Collection.Collection: The Chroma collection object.
"""
try:
collection = self.client.get_collection(name=collection_name)
self.logger.info(f"Collection '{collection_name}' found.")
return collection
except ValueError:
self.logger.info(f"Collection '{collection_name}' not found. Creating...")
collection = self.client.create_collection(name=collection_name)
return collection
def add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None):
"""
Adds documents to the specified Chroma collection.
Args:
collection_name (str): The name of the collection.
documents (List[str]): A list of document texts to add.
ids (List[str]): A list of unique IDs for the documents.
embeddings (List[List[float]]): A list of embeddings for the documents.
metadatas (List[Dict[str, Any]], optional): A list of metadata dictionaries for the documents. Defaults to None.
"""
try:
collection = self.get_or_create_collection(collection_name)
collection.add(
documents=documents,
ids=ids,
embeddings=embeddings,
metadatas=metadatas if metadatas else None
)
self.logger.info(f"Added {len(documents)} documents to collection '{collection_name}'.")
except Exception as e:
self.logger.exception(f"Error adding documents to Chroma collection: {e}")
def query_collection(self, query, n_results: int = 4) -> List[str]:
"""
Queries the specified Chroma collection with the given embeddings.
Args:
collection_name (str): The name of the collection.
query_embeddings (List[List[float]]): A list of query embeddings.
n_results (int, optional): The number of results to return. Defaults to 10.
Returns:
List[str]: A list of document texts that are the most similar to the query embeddings.
"""
try:
#collection = self.get_or_create_collection(collection_name)
results = self.client.query(
query_texts=[query],
n_results=n_results
)
self.logger.info(f"Queried collection sp_archives and retrieved {len(results.get('documents', []))} results.")
self.logger.info(f"Results : '{results.get('documents', [])[0]}'.")
return results.get('documents', [])[0]
except Exception as e:
self.logger.exception(f"Error querying Chroma collection: {e}")
return []
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
config: A ConfigParser object containing database configuration settings. Must include a 'database' section with 'chroma_db_path' key (defaults to './chroma_db' if not provided). This path is used as the collection name when connecting to the ChromaDB instance.
Return Value
The constructor returns a ChromaManager instance. The get_or_create_collection method returns a chromadb.api.models.Collection.Collection object. The add_documents method returns None (void). The query_collection method returns a List[str] containing document texts that are most similar to the query, or an empty list if an error occurs.
Class Interface
Methods
__init__(self, config: ConfigParser)
Purpose: Initializes the ChromaManager with database configuration, establishes connection to ChromaDB server, and retrieves the specified collection
Parameters:
config: ConfigParser object containing database settings, specifically 'chroma_db_path' under 'database' section
Returns: None (constructor)
get_or_create_collection(self, collection_name: str) -> chromadb.api.models.Collection.Collection
Purpose: Retrieves an existing Chroma collection by name, or creates it if it doesn't exist
Parameters:
collection_name: String name of the collection to get or create
Returns: A chromadb.api.models.Collection.Collection object representing the collection
add_documents(self, collection_name: str, documents: List[str], ids: List[str], embeddings: List[List[float]], metadatas: List[Dict[str, Any]] = None)
Purpose: Adds documents with their embeddings and metadata to a specified Chroma collection
Parameters:
collection_name: Name of the collection to add documents todocuments: List of document text strings to addids: List of unique string identifiers for each document (must match length of documents)embeddings: List of embedding vectors (list of floats) for each documentmetadatas: Optional list of metadata dictionaries for each document, defaults to None
Returns: None (void method, logs success or errors)
query_collection(self, query: str, n_results: int = 4) -> List[str]
Purpose: Queries the collection specified in config with a text query and returns the most similar documents
Parameters:
query: Text query string to search for similar documentsn_results: Number of similar documents to return, defaults to 4
Returns: List of document text strings that are most similar to the query, or empty list on error
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
logger |
logging.Logger | Logger instance for logging ChromaManager operations and errors | instance |
db_path |
str | Path/name of the database collection, retrieved from config or defaults to './chroma_db' | instance |
client |
chromadb.api.models.Collection.Collection | ChromaDB collection object representing the active collection for queries and operations | instance |
Dependencies
chromadbloggingtypingconfigparserlangchain_openaitiktokenopenai
Required Imports
import chromadb
from chromadb import Documents
from chromadb import EmbeddingFunction
from chromadb import Embeddings
import logging
from typing import List, Dict, Any
from configparser import ConfigParser
from langchain_openai import ChatOpenAI
from langchain_openai import AzureChatOpenAI
import tiktoken
import openai
Usage Example
from configparser import ConfigParser
import logging
# Setup logging
logging.basicConfig(level=logging.INFO)
# Create configuration
config = ConfigParser()
config.add_section('database')
config.set('database', 'chroma_db_path', 'my_collection')
# Initialize ChromaManager
chroma_manager = ChromaManager(config)
# Add documents to a collection
documents = ['This is document 1', 'This is document 2']
ids = ['doc1', 'doc2']
embeddings = [[0.1, 0.2, 0.3], [0.4, 0.5, 0.6]]
metadatas = [{'source': 'file1.txt'}, {'source': 'file2.txt'}]
chroma_manager.add_documents(
collection_name='my_docs',
documents=documents,
ids=ids,
embeddings=embeddings,
metadatas=metadatas
)
# Query the collection
query_text = 'search query'
results = chroma_manager.query_collection(query=query_text, n_results=4)
print(f'Found {len(results)} similar documents')
for doc in results:
print(doc)
Best Practices
- Replace the hardcoded API key with environment variables or secure configuration management
- Ensure ChromaDB server is running and accessible at 'vice_chroma:8000' before instantiation
- The MyEmbeddingFunction class must be defined before using ChromaManager
- Always provide unique IDs when adding documents to avoid conflicts
- Handle exceptions when calling methods as they may fail due to network or database issues
- The class maintains a persistent connection to a single collection specified in the config
- Use get_or_create_collection for collection management to avoid errors
- Embeddings must match the dimensionality expected by the embedding function
- The query_collection method queries the collection specified in the config, not a parameter
- Consider implementing connection pooling or retry logic for production use
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class MyEmbeddingFunction 68.4% similar
-
class DocumentIndexer 67.2% similar
-
class MyEmbeddingFunction_v1 63.8% similar
-
class DocChatEmbeddingFunction 63.0% similar
-
function save_data_to_chromadb 61.8% similar