🔍 Code Extractor

class MyEmbeddingFunction_v1

Maturity: 44

A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.

File:
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
Lines:
589 - 683
Complexity:
moderate

Purpose

This class extends EmbeddingFunction to provide a complete embedding solution for ChromaDB or similar vector databases. It handles long documents by automatically summarizing them when they exceed token limits, sanitizes text to prevent encoding errors, and generates embeddings using OpenAI's embedding models. The class is designed to work with ChromaDB's embedding function interface and manages the entire pipeline from raw text to embeddings, including preprocessing, summarization, and token counting.

Source Code

class MyEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model_name: str, embed_model_name: str, api_key: str):
        """
        Initialize the embedding function with specific models and API key.
        
        Args:
            model_name: Model name for the LLM summarizer
            embed_model_name: Model name for embeddings
            api_key: OpenAI API key
        """
        self.model_name = model_name
        self.api_key = api_key
        
        # Set up the OpenAI client directly 
        import openai
        self.client = openai.OpenAI(api_key=api_key)
        
        # Set up the LLM for summarization
        self.llm = ChatOpenAI(
            model_name=model_name, 
            temperature=0,
            api_key=api_key
        )
        
        self.embed_model_name = embed_model_name
        
        # Ensure we're using standard OpenAI (not Azure)
        # This is critical for avoiding the ambiguous API error
        import os
        os.environ["OPENAI_API_TYPE"] = "openai"
        os.environ["OPENAI_API_KEY"] = api_key

    def summarize_text(self,text, max_tokens_summary=8192):
        """
        Summarize the input text using the GPT-4o-mini summarizer.
        The summary will be limited to under max_tokens_summary tokens.
        """
        # Prepare the summarization prompt
        text=self.sanitize_text(text)

        prompt = (
            f"Please summarize the following text such that the summary is under {max_tokens_summary} tokens:\n\n{text}"
        )
        
        # Call the ChatCompletion API with the GPT-4o-mini model
        response = self.llm.invoke(prompt)
        
        summary = response.content.strip()
        return summary
    
    def sanitize_text(self,text):
        """
        Sanitize text by encoding to UTF-8 with error replacement and decoding back.
        This replaces any characters that might cause ASCII encoding errors.
        """
        return text.encode("utf-8", errors="replace").decode("utf-8")
    
    def count_tokens(self,text):
        encoding = tiktoken.get_encoding("cl100k_base")
        return len(encoding.encode(text))

    def __call__(self, input: Documents) -> Embeddings:
        """
        Generate embeddings for the input documents.
        
        Args:
            input: List of document strings to embed
            
        Returns:
            List of embeddings for each document
        """
        # Embed the documents somehow
        embeddings = []
        
        for content in input:
            # Handle very long content
            if len(content) > 1000000:
                content = content[:1000000]
                while self.count_tokens(content) > 110000:
                    content = content[:-1000]
    
            # Create embedding
            if self.count_tokens(content) > 8192:
                content = self.summarize_text(content)
    
            # Use the direct client instead of the module-level API
            response = self.client.embeddings.create(
                model=self.embed_model_name,
                input=content,
            )
            embedding = response.data[0].embedding
            embeddings.append(embedding)
                    
        return embeddings

Parameters

Name Type Default Kind
bases EmbeddingFunction -

Parameter Details

model_name: The name of the OpenAI language model to use for text summarization (e.g., 'gpt-4o-mini', 'gpt-4'). This model is used when documents exceed the embedding model's token limit.

embed_model_name: The name of the OpenAI embedding model to use for generating embeddings (e.g., 'text-embedding-ada-002', 'text-embedding-3-small'). This model converts text into vector representations.

api_key: Your OpenAI API key for authentication. This key is used for both the summarization LLM and the embedding model.

Return Value

The __init__ method returns an instance of MyEmbeddingFunction. The __call__ method returns a list of embeddings (Embeddings type), where each embedding is a list of floats representing the vector representation of the corresponding input document. The summarize_text method returns a string containing the summarized text. The count_tokens method returns an integer representing the number of tokens in the input text.

Class Interface

Methods

__init__(self, model_name: str, embed_model_name: str, api_key: str)

Purpose: Initialize the embedding function with OpenAI models and API credentials, setting up the LLM for summarization and the embedding client

Parameters:

  • model_name: Name of the OpenAI model for text summarization
  • embed_model_name: Name of the OpenAI embedding model
  • api_key: OpenAI API key for authentication

Returns: None (constructor)

summarize_text(self, text, max_tokens_summary=8192)

Purpose: Summarize long text using the configured LLM to reduce it to under the specified token limit

Parameters:

  • text: The input text to summarize
  • max_tokens_summary: Maximum number of tokens for the summary (default: 8192)

Returns: A string containing the summarized text

sanitize_text(self, text)

Purpose: Clean text by encoding to UTF-8 with error replacement to handle problematic characters

Parameters:

  • text: The input text to sanitize

Returns: A sanitized string with problematic characters replaced

count_tokens(self, text)

Purpose: Count the number of tokens in the input text using the cl100k_base encoding

Parameters:

  • text: The input text to count tokens for

Returns: An integer representing the number of tokens

__call__(self, input: Documents) -> Embeddings

Purpose: Generate embeddings for a list of documents, automatically handling long documents through summarization and truncation

Parameters:

  • input: A list of document strings (Documents type) to generate embeddings for

Returns: A list of embeddings (Embeddings type), where each embedding is a list of floats

Attributes

Name Type Description Scope
model_name str The name of the OpenAI model used for text summarization instance
api_key str The OpenAI API key used for authentication instance
client openai.OpenAI The OpenAI client instance used for making API calls to generate embeddings instance
llm ChatOpenAI The LangChain ChatOpenAI instance used for text summarization instance
embed_model_name str The name of the OpenAI embedding model used to generate vector embeddings instance

Dependencies

  • openai
  • langchain_openai
  • tiktoken
  • chromadb

Required Imports

import openai
from langchain_openai import ChatOpenAI
import tiktoken
from chromadb import Documents, EmbeddingFunction, Embeddings

Conditional/Optional Imports

These imports are only needed under specific conditions:

import os

Condition: Used internally to set environment variables for OpenAI API configuration

Required (conditional)

Usage Example

# Instantiate the embedding function
api_key = 'your-openai-api-key'
embedding_fn = MyEmbeddingFunction(
    model_name='gpt-4o-mini',
    embed_model_name='text-embedding-ada-002',
    api_key=api_key
)

# Generate embeddings for documents
documents = ['This is a short document.', 'This is another document with more content.']
embeddings = embedding_fn(documents)

# Use with ChromaDB
import chromadb
client = chromadb.Client()
collection = client.create_collection(
    name='my_collection',
    embedding_function=embedding_fn
)
collection.add(
    documents=documents,
    ids=['doc1', 'doc2']
)

# Manually summarize long text
long_text = 'Very long document content...'
summary = embedding_fn.summarize_text(long_text, max_tokens_summary=4096)

# Count tokens in text
token_count = embedding_fn.count_tokens('Some text to count')

Best Practices

  • Always provide a valid OpenAI API key during instantiation to avoid authentication errors
  • The class automatically handles long documents by summarizing them, but be aware this adds latency and API costs
  • Documents longer than 1,000,000 characters are truncated before processing
  • Documents exceeding 8,192 tokens are automatically summarized before embedding
  • The class sets environment variables (OPENAI_API_TYPE, OPENAI_API_KEY) which may affect other parts of your application
  • Use the same instance for multiple embedding operations to avoid recreating the OpenAI client
  • The summarize_text method can be called independently for preprocessing long texts
  • Token counting uses the 'cl100k_base' encoding, which is appropriate for GPT-4 and newer models
  • Text sanitization is automatically applied during summarization to prevent encoding errors
  • The class is designed to be used as a ChromaDB embedding function via the __call__ method
  • Be mindful of API rate limits and costs when processing large batches of documents

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class MyEmbeddingFunction_v2 93.9% similar

    A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token management for large documents.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
  • class DocChatEmbeddingFunction 93.1% similar

    A custom ChromaDB embedding function that generates OpenAI embeddings with automatic text summarization for documents exceeding token limits.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
  • class MyEmbeddingFunction_v3 92.5% similar

    A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token limit handling for large documents.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • class MyEmbeddingFunction 87.0% similar

    Custom embedding function class that integrates OpenAI's embedding API with Chroma DB for generating vector embeddings from text documents.

    From: /tf/active/vicechatdev/project_victoria_disclosure_generator.py
  • class DocumentIndexer 60.9% similar

    A class for indexing documents into ChromaDB with support for multiple file formats (PDF, Word, PowerPoint, Excel, text files), smart incremental indexing, and document chunk management.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
← Back to Browse