MyEmbeddingFunction_v3 - Code Extractor

class MyEmbeddingFunction_v3

Maturity: 40

A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token limit handling for large documents.

File:
/tf/active/vicechatdev/offline_docstore_multi.py

Lines:
127 - 187

Complexity:
moderate

Purpose

This class extends the EmbeddingFunction interface to provide a robust embedding generation system that handles large documents by automatically summarizing content that exceeds token limits. It integrates OpenAI's chat models for summarization and embedding models for vector generation, making it suitable for use with vector databases like ChromaDB. The class manages token counting, text sanitization, and implements intelligent content truncation strategies to ensure documents fit within model constraints.

Source Code

class MyEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model_name: str, embed_model_name: str, api_key: str):
        self.model_name = model_name
        self.api_key = api_key
        self.llm = ChatOpenAI(model_name=model_name, temperature=0,api_key=api_key)
        self.embed_model_name = embed_model_name

    def summarize_text(self,text, max_tokens_summary=8192):
        """
        Summarize the input text using the GPT-4o-mini summarizer.
        The summary will be limited to under max_tokens_summary tokens.
        """
        # Prepare the summarization prompt
        text=self.sanitize_text(text)

        prompt = (
            f"Please summarize the following text such that the summary is under {max_tokens_summary} tokens:\n\n{text}"
        )
        
        # Call the ChatCompletion API with the GPT-4o-mini model
        response = self.llm.invoke(prompt)
        
        summary = response.content.strip()
        return summary
    
    def sanitize_text(self,text):
        """
        Sanitize text by encoding to UTF-8 with error replacement and decoding back.
        This replaces any characters that might cause ASCII encoding errors.
        """
        return text.encode("utf-8", errors="replace").decode("utf-8")
    
    def count_tokens(self,text):
        encoding = tiktoken.get_encoding("cl100k_base")
        return len(encoding.encode(text))

    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        ## expect a list of str and return a list of embeddings
        embeddings=[]
        for content in input:
            if len(content) > 1000000:
                content = content[:1000000]
                logger.warning(f"Shrinking content due to token limit")
                while self.count_tokens(content) > 110000:
                    content = content[:-1000]

            # Create embedding
            if self.count_tokens(content) > 8192:
                logger.warning(f"Summarizing text due to token limit")
                content=self.summarize_text(content, api_key)

            response = openai.embeddings.create(
                model=self.embed_model_name,
                input=content,
            )
            embedding = response.data[0].embedding
            embeddings.append(embedding)

        return embeddings

Parameters

Name	Type	Default	Kind
`bases`	EmbeddingFunction	-

Parameter Details

model_name: The name of the OpenAI chat model to use for text summarization (e.g., 'gpt-4o-mini', 'gpt-4'). This model is used when documents exceed the embedding model's token limit.

embed_model_name: The name of the OpenAI embedding model to use for generating vector embeddings (e.g., 'text-embedding-ada-002', 'text-embedding-3-small'). This model converts text into numerical vectors.

api_key: The OpenAI API key required for authentication with OpenAI services. Must be a valid API key with access to both chat and embedding endpoints.

Return Value

The class instantiation returns a MyEmbeddingFunction object. The __call__ method returns a list of embeddings (Embeddings type), where each embedding is a list of floating-point numbers representing the vector representation of the input document. The summarize_text method returns a string containing the summarized text. The count_tokens method returns an integer representing the token count. The sanitize_text method returns a sanitized string.

Class Interface

Methods

`init(self, model_name: str, embed_model_name: str, api_key: str)`

Purpose: Initializes the embedding function with OpenAI models and API credentials

Parameters:

model_name: Name of the OpenAI chat model for summarization
embed_model_name: Name of the OpenAI embedding model for vector generation
api_key: OpenAI API key for authentication

Returns: None (constructor)

`summarize_text(self, text, max_tokens_summary=8192) -> str`

Purpose: Summarizes input text using the configured chat model to reduce token count below the specified limit

Parameters:

text: The text content to summarize
max_tokens_summary: Maximum number of tokens for the summary output (default: 8192)

Returns: A string containing the summarized text that fits within the token limit

`sanitize_text(self, text) -> str`

Purpose: Sanitizes text by handling encoding issues and replacing problematic characters

Parameters:

text: The text string to sanitize

Returns: A sanitized string with UTF-8 encoding and replaced error characters

`count_tokens(self, text) -> int`

Purpose: Counts the number of tokens in the given text using the cl100k_base encoding

Parameters:

text: The text string to count tokens for

Returns: An integer representing the number of tokens in the text

`call(self, input: Documents) -> Embeddings`

Purpose: Generates embeddings for a list of documents, automatically handling token limits through truncation and summarization

Parameters:

input: A list of document strings (Documents type from ChromaDB) to generate embeddings for

Returns: A list of embeddings (Embeddings type), where each embedding is a list of floats representing the vector for each document

Attributes

Name	Type	Description	Scope
`model_name`	str	Stores the name of the OpenAI chat model used for text summarization	instance
`api_key`	str	Stores the OpenAI API key for authentication	instance
`llm`	ChatOpenAI	An instance of ChatOpenAI configured with the specified model and API key, used for text summarization	instance
`embed_model_name`	str	Stores the name of the OpenAI embedding model used for generating vector embeddings	instance

Dependencies

langchain_openai
tiktoken
openai
chromadb
logging

Required Imports

from langchain_openai import ChatOpenAI
import tiktoken
import openai
from chromadb import Documents, EmbeddingFunction, Embeddings
import logging

Usage Example

import openai
from langchain_openai import ChatOpenAI
import tiktoken
from chromadb import Documents, EmbeddingFunction, Embeddings
import logging

logger = logging.getLogger(__name__)

# Instantiate the embedding function
embedding_func = MyEmbeddingFunction(
    model_name='gpt-4o-mini',
    embed_model_name='text-embedding-ada-002',
    api_key='your-openai-api-key'
)

# Use with a list of documents
documents = [
    'This is the first document to embed.',
    'This is the second document with more content.'
]

# Generate embeddings (calls __call__ method)
embeddings = embedding_func(documents)

# Use individual methods
token_count = embedding_func.count_tokens('Sample text')
sanitized = embedding_func.sanitize_text('Text with special chars')
summary = embedding_func.summarize_text('Long text to summarize', max_tokens_summary=4096)

Best Practices

Always provide a valid OpenAI API key during instantiation to avoid authentication errors
Be aware that the __call__ method may invoke the summarization API for documents exceeding 8192 tokens, which incurs additional API costs
The class automatically truncates documents to 1,000,000 characters and then to 110,000 tokens if needed, which may result in data loss for very large documents
Use appropriate embedding models that match your use case (e.g., text-embedding-3-small for cost efficiency, text-embedding-3-large for better quality)
Monitor API usage as the class makes multiple API calls for large documents (summarization + embedding)
The class uses temperature=0 for the chat model to ensure deterministic summarization results
Handle potential API errors (rate limits, network issues) when calling the __call__ method in production
The sanitize_text method should be called before processing user-generated content to avoid encoding issues
Note that there's a bug in the summarize_text call within __call__ method - it passes 'api_key' as a parameter but should pass 'content'

Similar Components

AI-powered semantic similarity - components with related functionality:

class MyEmbeddingFunction_v2 97.2% similar

A custom embedding function class that generates embeddings for text documents using OpenAI's embedding models, with automatic text summarization and token management for large documents.
From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
class MyEmbeddingFunction_v1 92.5% similar

A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.
From: /tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
class DocChatEmbeddingFunction 86.6% similar

A custom ChromaDB embedding function that generates OpenAI embeddings with automatic text summarization for documents exceeding token limits.
From: /tf/active/vicechatdev/docchat/document_indexer.py
class MyEmbeddingFunction 78.5% similar

Custom embedding function class that integrates OpenAI's embedding API with Chroma DB for generating vector embeddings from text documents.
From: /tf/active/vicechatdev/project_victoria_disclosure_generator.py
class TextClusterer 53.4% similar

A class that clusters similar documents based on their embeddings using various clustering algorithms (K-means, Agglomerative, DBSCAN) and optionally generates summaries for each cluster.
From: /tf/active/vicechatdev/chromadb-cleanup/src/clustering/text_clusterer.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class MyEmbeddingFunction(EmbeddingFunction):

    def __init__(self, model_name: str, embed_model_name: str, api_key: str):
        self.model_name = model_name
        self.api_key = api_key
        self.llm = ChatOpenAI(model_name=model_name, temperature=0,api_key=api_key)
        self.embed_model_name = embed_model_name

    def summarize_text(self,text, max_tokens_summary=8192):
        """
        Summarize the input text using the GPT-4o-mini summarizer.
        The summary will be limited to under max_tokens_summary tokens.
        """
        # Prepare the summarization prompt
        text=self.sanitize_text(text)

        prompt = (
            f"Please summarize the following text such that the summary is under {max_tokens_summary} tokens:\n\n{text}"
        )
        
        # Call the ChatCompletion API with the GPT-4o-mini model
        response = self.llm.invoke(prompt)
        
        summary = response.content.strip()
        return summary
    
    def sanitize_text(self,text):
        """
        Sanitize text by encoding to UTF-8 with error replacement and decoding back.
        This replaces any characters that might cause ASCII encoding errors.
        """
        return text.encode("utf-8", errors="replace").decode("utf-8")
    
    def count_tokens(self,text):
        encoding = tiktoken.get_encoding("cl100k_base")
        return len(encoding.encode(text))

    def __call__(self, input: Documents) -> Embeddings:
        # embed the documents somehow
        ## expect a list of str and return a list of embeddings
        embeddings=[]
        for content in input:
            if len(content) > 1000000:
                content = content[:1000000]
                logger.warning(f"Shrinking content due to token limit")
                while self.count_tokens(content) > 110000:
                    content = content[:-1000]

            # Create embedding
            if self.count_tokens(content) > 8192:
                logger.warning(f"Summarizing text due to token limit")
                content=self.summarize_text(content, api_key)

            response = openai.embeddings.create(
                model=self.embed_model_name,
                input=content,
            )
            embedding = response.data[0].embedding
            embeddings.append(embedding)

        return embeddings
                        

Improved Code

🔍 Code Extractor

class MyEmbeddingFunction_v3

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, model_name: str, embed_model_name: str, api_key: str)`

`summarize_text(self, text, max_tokens_summary=8192) -> str`

`sanitize_text(self, text) -> str`

`count_tokens(self, text) -> int`

`call(self, input: Documents) -> Embeddings`

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class MyEmbeddingFunction_v2 97.2% similar

class MyEmbeddingFunction_v1 92.5% similar

class DocChatEmbeddingFunction 86.6% similar

class MyEmbeddingFunction 78.5% similar

class TextClusterer 53.4% similar

class MyEmbeddingFunction_v3

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, model_name: str, embed_model_name: str, api_key: str)

summarize_text(self, text, max_tokens_summary=8192) -> str

sanitize_text(self, text) -> str

count_tokens(self, text) -> int

__call__(self, input: Documents) -> Embeddings

Attributes

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

class MyEmbeddingFunction_v2 97.2% similar

class MyEmbeddingFunction_v1 92.5% similar

class DocChatEmbeddingFunction 86.6% similar

class MyEmbeddingFunction 78.5% similar

class TextClusterer 53.4% similar

✨ Improve Code: MyEmbeddingFunction_v3

Code Comparison

`init(self, model_name: str, embed_model_name: str, api_key: str)`

`summarize_text(self, text, max_tokens_summary=8192) -> str`

`sanitize_text(self, text) -> str`

`count_tokens(self, text) -> int`

`call(self, input: Documents) -> Embeddings`