🔍 Code Extractor

class MyEmbeddingFunction

Maturity: 51

Custom embedding function class that integrates OpenAI's embedding API with Chroma DB for generating vector embeddings from text documents.

File:
/tf/active/vicechatdev/project_victoria_disclosure_generator.py
Lines:
819 - 856
Complexity:
moderate

Purpose

This class serves as an adapter between Chroma DB's EmbeddingFunction interface and OpenAI's embedding API. It enables Chroma DB to use OpenAI's embedding models (like text-embedding-3-small) for converting text documents into vector representations. The class handles API authentication, embedding generation, and error fallback scenarios. It's designed to be used as a custom embedding function when initializing Chroma collections.

Source Code

class MyEmbeddingFunction(EmbeddingFunction):
    """
    Custom embedding function for Chroma DB using OpenAI embeddings.
    """
    
    def __init__(self, model_name: str, embedding_model: str, api_key: str):
        self.model_name = model_name
        self.embedding_model = embedding_model
        self.api_key = api_key
        
        # Set up OpenAI client
        os.environ["OPENAI_API_KEY"] = api_key
        from openai import OpenAI
        self.client = OpenAI(api_key=api_key)
    
    def __call__(self, input: Documents) -> Embeddings:
        """
        Generate embeddings for input documents.
        
        Args:
            input: List of document texts
            
        Returns:
            List of embedding vectors
        """
        try:
            response = self.client.embeddings.create(
                input=input,
                model=self.embedding_model
            )
            
            embeddings = [data.embedding for data in response.data]
            return embeddings
            
        except Exception as e:
            print(f"Error generating embeddings: {e}")
            # Return zero embeddings as fallback
            return [[0.0] * 1536 for _ in input]  # 1536 is dimension for text-embedding-3-small

Parameters

Name Type Default Kind
bases EmbeddingFunction -

Parameter Details

model_name: Name identifier for the model being used. This parameter is stored but not actively used in the current implementation - it appears to be for tracking or logging purposes.

embedding_model: The specific OpenAI embedding model to use (e.g., 'text-embedding-3-small', 'text-embedding-ada-002'). This determines the embedding dimensions and quality.

api_key: OpenAI API key for authentication. This key is used to initialize the OpenAI client and is also set as an environment variable.

Return Value

Instantiation returns a MyEmbeddingFunction object that can be called as a function. When called (via __call__), it returns a list of embedding vectors (Embeddings type), where each embedding is a list of floats representing the vector for the corresponding input document. On error, returns zero-filled vectors with dimension 1536.

Class Interface

Methods

__init__(self, model_name: str, embedding_model: str, api_key: str)

Purpose: Initializes the embedding function with OpenAI credentials and model configuration

Parameters:

  • model_name: Name identifier for the model (stored but not actively used)
  • embedding_model: OpenAI embedding model name (e.g., 'text-embedding-3-small')
  • api_key: OpenAI API key for authentication

Returns: None (constructor)

__call__(self, input: Documents) -> Embeddings

Purpose: Generates embedding vectors for the provided input documents using OpenAI's API

Parameters:

  • input: List of document texts (strings) to generate embeddings for

Returns: List of embedding vectors, where each vector is a list of floats. Returns zero-filled vectors (dimension 1536) if an error occurs.

Attributes

Name Type Description Scope
model_name str Stores the model name identifier passed during initialization instance
embedding_model str The OpenAI embedding model name used for generating embeddings instance
api_key str The OpenAI API key used for authentication instance
client OpenAI OpenAI client instance used to make API calls for embedding generation instance

Dependencies

  • os
  • openai
  • chromadb

Required Imports

import os
from chromadb import Documents
from chromadb import EmbeddingFunction
from chromadb import Embeddings

Conditional/Optional Imports

These imports are only needed under specific conditions:

from openai import OpenAI

Condition: imported inside __init__ method when the class is instantiated

Required (conditional)

Usage Example

# Initialize the embedding function
api_key = 'your-openai-api-key'
embedding_fn = MyEmbeddingFunction(
    model_name='my-model',
    embedding_model='text-embedding-3-small',
    api_key=api_key
)

# Use with Chroma DB
import chromadb
client = chromadb.Client()
collection = client.create_collection(
    name='my_collection',
    embedding_function=embedding_fn
)

# Or call directly to generate embeddings
documents = ['Hello world', 'Another document']
embeddings = embedding_fn(documents)
print(f'Generated {len(embeddings)} embeddings')
print(f'Embedding dimension: {len(embeddings[0])}')

Best Practices

  • Always provide a valid OpenAI API key to avoid authentication errors
  • The class modifies the global environment variable OPENAI_API_KEY, which may affect other parts of your application
  • Error handling returns zero-filled embeddings (1536 dimensions) as fallback - ensure your application can handle these gracefully
  • The hardcoded dimension of 1536 in the error handler is specific to text-embedding-3-small; if using different models, this may need adjustment
  • The model_name parameter is stored but unused - consider removing it or implementing logging/tracking functionality
  • This class is designed to be instantiated once and reused for multiple embedding operations
  • The __call__ method makes instances callable, allowing them to be used directly as functions
  • Consider implementing retry logic for transient API failures instead of immediately falling back to zero embeddings
  • The class creates a new OpenAI client on each instantiation - avoid creating multiple instances unnecessarily

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class MyEmbeddingFunction_v1 87.0% similar

    A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.

    From: /tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
  • class DocChatEmbeddingFunction 82.8% similar

    A custom ChromaDB embedding function that generates OpenAI embeddings with automatic text summarization for documents exceeding token limits.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
  • class MyEmbeddingFunction_v2 80.5% similar

    A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token management for large documents.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
  • class MyEmbeddingFunction_v3 78.5% similar

    A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token limit handling for large documents.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • class DocumentIndexer 58.3% similar

    A class for indexing documents into ChromaDB with support for multiple file formats (PDF, Word, PowerPoint, Excel, text files), smart incremental indexing, and document chunk management.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
← Back to Browse