MyEmbeddingFunction_v1 - Code Extractor

class MyEmbeddingFunction_v1

Maturity: 44

A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.

File:
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py

Lines:
589 - 683

Complexity:
moderate

Purpose

This class extends EmbeddingFunction to provide a complete embedding solution for ChromaDB or similar vector databases. It handles long documents by automatically summarizing them when they exceed token limits, sanitizes text to prevent encoding errors, and generates embeddings using OpenAI's embedding models. The class is designed to work with ChromaDB's embedding function interface and manages the entire pipeline from raw text to embeddings, including preprocessing, summarization, and token counting.

Source Code

class MyEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model_name: str, embed_model_name: str, api_key: str):
        """
        Initialize the embedding function with specific models and API key.
        
        Args:
            model_name: Model name for the LLM summarizer
            embed_model_name: Model name for embeddings
            api_key: OpenAI API key
        """
        self.model_name = model_name
        self.api_key = api_key
        
        # Set up the OpenAI client directly 
        import openai
        self.client = openai.OpenAI(api_key=api_key)
        
        # Set up the LLM for summarization
        self.llm = ChatOpenAI(
            model_name=model_name, 
            temperature=0,
            api_key=api_key
        )
        
        self.embed_model_name = embed_model_name
        
        # Ensure we're using standard OpenAI (not Azure)
        # This is critical for avoiding the ambiguous API error
        import os
        os.environ["OPENAI_API_TYPE"] = "openai"
        os.environ["OPENAI_API_KEY"] = api_key

    def summarize_text(self,text, max_tokens_summary=8192):
        """
        Summarize the input text using the GPT-4o-mini summarizer.
        The summary will be limited to under max_tokens_summary tokens.
        """
        # Prepare the summarization prompt
        text=self.sanitize_text(text)

        prompt = (
            f"Please summarize the following text such that the summary is under {max_tokens_summary} tokens:\n\n{text}"
        )
        
        # Call the ChatCompletion API with the GPT-4o-mini model
        response = self.llm.invoke(prompt)
        
        summary = response.content.strip()
        return summary
    
    def sanitize_text(self,text):
        """
        Sanitize text by encoding to UTF-8 with error replacement and decoding back.
        This replaces any characters that might cause ASCII encoding errors.
        """
        return text.encode("utf-8", errors="replace").decode("utf-8")
    
    def count_tokens(self,text):
        encoding = tiktoken.get_encoding("cl100k_base")
        return len(encoding.encode(text))

    def __call__(self, input: Documents) -> Embeddings:
        """
        Generate embeddings for the input documents.
        
        Args:
            input: List of document strings to embed
            
        Returns:
            List of embeddings for each document
        """
        # Embed the documents somehow
        embeddings = []
        
        for content in input:
            # Handle very long content
            if len(content) > 1000000:
                content = content[:1000000]
                while self.count_tokens(content) > 110000:
                    content = content[:-1000]
    
            # Create embedding
            if self.count_tokens(content) > 8192:
                content = self.summarize_text(content)
    
            # Use the direct client instead of the module-level API
            response = self.client.embeddings.create(
                model=self.embed_model_name,
                input=content,
            )
            embedding = response.data[0].embedding
            embeddings.append(embedding)
                    
        return embeddings

Parameters

Name	Type	Default	Kind
`bases`	EmbeddingFunction	-

Parameter Details

model_name: The name of the OpenAI language model to use for text summarization (e.g., 'gpt-4o-mini', 'gpt-4'). This model is used when documents exceed the embedding model's token limit.

embed_model_name: The name of the OpenAI embedding model to use for generating embeddings (e.g., 'text-embedding-ada-002', 'text-embedding-3-small'). This model converts text into vector representations.

api_key: Your OpenAI API key for authentication. This key is used for both the summarization LLM and the embedding model.

Return Value

The __init__ method returns an instance of MyEmbeddingFunction. The __call__ method returns a list of embeddings (Embeddings type), where each embedding is a list of floats representing the vector representation of the corresponding input document. The summarize_text method returns a string containing the summarized text. The count_tokens method returns an integer representing the number of tokens in the input text.

Class Interface

Methods

`init(self, model_name: str, embed_model_name: str, api_key: str)`

Purpose: Initialize the embedding function with OpenAI models and API credentials, setting up the LLM for summarization and the embedding client

Parameters:

model_name: Name of the OpenAI model for text summarization
embed_model_name: Name of the OpenAI embedding model
api_key: OpenAI API key for authentication

Returns: None (constructor)

`summarize_text(self, text, max_tokens_summary=8192)`

Purpose: Summarize long text using the configured LLM to reduce it to under the specified token limit

Parameters:

text: The input text to summarize
max_tokens_summary: Maximum number of tokens for the summary (default: 8192)

Returns: A string containing the summarized text

`sanitize_text(self, text)`

Purpose: Clean text by encoding to UTF-8 with error replacement to handle problematic characters

Parameters:

text: The input text to sanitize

Returns: A sanitized string with problematic characters replaced

`count_tokens(self, text)`

Purpose: Count the number of tokens in the input text using the cl100k_base encoding

Parameters:

text: The input text to count tokens for

Returns: An integer representing the number of tokens

`call(self, input: Documents) -> Embeddings`

Purpose: Generate embeddings for a list of documents, automatically handling long documents through summarization and truncation

Parameters:

input: A list of document strings (Documents type) to generate embeddings for

Returns: A list of embeddings (Embeddings type), where each embedding is a list of floats

Attributes

Name	Type	Description	Scope
`model_name`	str	The name of the OpenAI model used for text summarization	instance
`api_key`	str	The OpenAI API key used for authentication	instance
`client`	openai.OpenAI	The OpenAI client instance used for making API calls to generate embeddings	instance
`llm`	ChatOpenAI	The LangChain ChatOpenAI instance used for text summarization	instance
`embed_model_name`	str	The name of the OpenAI embedding model used to generate vector embeddings	instance

Dependencies

openai
langchain_openai
tiktoken
chromadb

Required Imports

import openai
from langchain_openai import ChatOpenAI
import tiktoken
from chromadb import Documents, EmbeddingFunction, Embeddings

Conditional/Optional Imports

These imports are only needed under specific conditions:

import os

Condition: Used internally to set environment variables for OpenAI API configuration

Required (conditional)

Usage Example

# Instantiate the embedding function
api_key = 'your-openai-api-key'
embedding_fn = MyEmbeddingFunction(
    model_name='gpt-4o-mini',
    embed_model_name='text-embedding-ada-002',
    api_key=api_key
)

# Generate embeddings for documents
documents = ['This is a short document.', 'This is another document with more content.']
embeddings = embedding_fn(documents)

# Use with ChromaDB
import chromadb
client = chromadb.Client()
collection = client.create_collection(
    name='my_collection',
    embedding_function=embedding_fn
)
collection.add(
    documents=documents,
    ids=['doc1', 'doc2']
)

# Manually summarize long text
long_text = 'Very long document content...'
summary = embedding_fn.summarize_text(long_text, max_tokens_summary=4096)

# Count tokens in text
token_count = embedding_fn.count_tokens('Some text to count')

Best Practices

Always provide a valid OpenAI API key during instantiation to avoid authentication errors
The class automatically handles long documents by summarizing them, but be aware this adds latency and API costs
Documents longer than 1,000,000 characters are truncated before processing
Documents exceeding 8,192 tokens are automatically summarized before embedding
The class sets environment variables (OPENAI_API_TYPE, OPENAI_API_KEY) which may affect other parts of your application
Use the same instance for multiple embedding operations to avoid recreating the OpenAI client
The summarize_text method can be called independently for preprocessing long texts
Token counting uses the 'cl100k_base' encoding, which is appropriate for GPT-4 and newer models
Text sanitization is automatically applied during summarization to prevent encoding errors
The class is designed to be used as a ChromaDB embedding function via the __call__ method
Be mindful of API rate limits and costs when processing large batches of documents

Similar Components

AI-powered semantic similarity - components with related functionality:

class MyEmbeddingFunction_v2 93.9% similar

A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token management for large documents.
From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
class DocChatEmbeddingFunction 93.1% similar

A custom ChromaDB embedding function that generates OpenAI embeddings with automatic text summarization for documents exceeding token limits.
From: /tf/active/vicechatdev/docchat/document_indexer.py
class MyEmbeddingFunction_v3 92.5% similar

A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token limit handling for large documents.
From: /tf/active/vicechatdev/offline_docstore_multi.py
class MyEmbeddingFunction 87.0% similar

Custom embedding function class that integrates OpenAI's embedding API with Chroma DB for generating vector embeddings from text documents.
From: /tf/active/vicechatdev/project_victoria_disclosure_generator.py
class DocumentIndexer 60.9% similar

A class for indexing documents into ChromaDB with support for multiple file formats (PDF, Word, PowerPoint, Excel, text files), smart incremental indexing, and document chunk management.
From: /tf/active/vicechatdev/docchat/document_indexer.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class MyEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model_name: str, embed_model_name: str, api_key: str):
        """
        Initialize the embedding function with specific models and API key.
        
        Args:
            model_name: Model name for the LLM summarizer
            embed_model_name: Model name for embeddings
            api_key: OpenAI API key
        """
        self.model_name = model_name
        self.api_key = api_key
        
        # Set up the OpenAI client directly 
        import openai
        self.client = openai.OpenAI(api_key=api_key)
        
        # Set up the LLM for summarization
        self.llm = ChatOpenAI(
            model_name=model_name, 
            temperature=0,
            api_key=api_key
        )
        
        self.embed_model_name = embed_model_name
        
        # Ensure we're using standard OpenAI (not Azure)
        # This is critical for avoiding the ambiguous API error
        import os
        os.environ["OPENAI_API_TYPE"] = "openai"
        os.environ["OPENAI_API_KEY"] = api_key

    def summarize_text(self,text, max_tokens_summary=8192):
        """
        Summarize the input text using the GPT-4o-mini summarizer.
        The summary will be limited to under max_tokens_summary tokens.
        """
        # Prepare the summarization prompt
        text=self.sanitize_text(text)

        prompt = (
            f"Please summarize the following text such that the summary is under {max_tokens_summary} tokens:\n\n{text}"
        )
        
        # Call the ChatCompletion API with the GPT-4o-mini model
        response = self.llm.invoke(prompt)
        
        summary = response.content.strip()
        return summary
    
    def sanitize_text(self,text):
        """
        Sanitize text by encoding to UTF-8 with error replacement and decoding back.
        This replaces any characters that might cause ASCII encoding errors.
        """
        return text.encode("utf-8", errors="replace").decode("utf-8")
    
    def count_tokens(self,text):
        encoding = tiktoken.get_encoding("cl100k_base")
        return len(encoding.encode(text))

    def __call__(self, input: Documents) -> Embeddings:
        """
        Generate embeddings for the input documents.
        
        Args:
            input: List of document strings to embed
            
        Returns:
            List of embeddings for each document
        """
        # Embed the documents somehow
        embeddings = []
        
        for content in input:
            # Handle very long content
            if len(content) > 1000000:
                content = content[:1000000]
                while self.count_tokens(content) > 110000:
                    content = content[:-1000]
    
            # Create embedding
            if self.count_tokens(content) > 8192:
                content = self.summarize_text(content)
    
            # Use the direct client instead of the module-level API
            response = self.client.embeddings.create(
                model=self.embed_model_name,
                input=content,
            )
            embedding = response.data[0].embedding
            embeddings.append(embedding)
                    
        return embeddings
                        

Improved Code

🔍 Code Extractor

class MyEmbeddingFunction_v1

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, model_name: str, embed_model_name: str, api_key: str)`

`summarize_text(self, text, max_tokens_summary=8192)`

`sanitize_text(self, text)`

`count_tokens(self, text)`

`call(self, input: Documents) -> Embeddings`

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class MyEmbeddingFunction_v2 93.9% similar

class DocChatEmbeddingFunction 93.1% similar

class MyEmbeddingFunction_v3 92.5% similar

class MyEmbeddingFunction 87.0% similar

class DocumentIndexer 60.9% similar

class MyEmbeddingFunction_v1

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, model_name: str, embed_model_name: str, api_key: str)

summarize_text(self, text, max_tokens_summary=8192)

sanitize_text(self, text)

count_tokens(self, text)

__call__(self, input: Documents) -> Embeddings

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class MyEmbeddingFunction_v2 93.9% similar

class DocChatEmbeddingFunction 93.1% similar

class MyEmbeddingFunction_v3 92.5% similar

class MyEmbeddingFunction 87.0% similar

class DocumentIndexer 60.9% similar

✨ Improve Code: MyEmbeddingFunction_v1

Code Comparison

`init(self, model_name: str, embed_model_name: str, api_key: str)`

`summarize_text(self, text, max_tokens_summary=8192)`

`sanitize_text(self, text)`

`count_tokens(self, text)`

`call(self, input: Documents) -> Embeddings`