class MyEmbeddingFunction_v1
A custom embedding function class that generates embeddings for documents using OpenAI's API, with built-in text summarization for long documents and token management.
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
589 - 683
moderate
Purpose
This class extends EmbeddingFunction to provide a complete embedding solution for ChromaDB or similar vector databases. It handles long documents by automatically summarizing them when they exceed token limits, sanitizes text to prevent encoding errors, and generates embeddings using OpenAI's embedding models. The class is designed to work with ChromaDB's embedding function interface and manages the entire pipeline from raw text to embeddings, including preprocessing, summarization, and token counting.
Source Code
class MyEmbeddingFunction(EmbeddingFunction):
def __init__(self, model_name: str, embed_model_name: str, api_key: str):
"""
Initialize the embedding function with specific models and API key.
Args:
model_name: Model name for the LLM summarizer
embed_model_name: Model name for embeddings
api_key: OpenAI API key
"""
self.model_name = model_name
self.api_key = api_key
# Set up the OpenAI client directly
import openai
self.client = openai.OpenAI(api_key=api_key)
# Set up the LLM for summarization
self.llm = ChatOpenAI(
model_name=model_name,
temperature=0,
api_key=api_key
)
self.embed_model_name = embed_model_name
# Ensure we're using standard OpenAI (not Azure)
# This is critical for avoiding the ambiguous API error
import os
os.environ["OPENAI_API_TYPE"] = "openai"
os.environ["OPENAI_API_KEY"] = api_key
def summarize_text(self,text, max_tokens_summary=8192):
"""
Summarize the input text using the GPT-4o-mini summarizer.
The summary will be limited to under max_tokens_summary tokens.
"""
# Prepare the summarization prompt
text=self.sanitize_text(text)
prompt = (
f"Please summarize the following text such that the summary is under {max_tokens_summary} tokens:\n\n{text}"
)
# Call the ChatCompletion API with the GPT-4o-mini model
response = self.llm.invoke(prompt)
summary = response.content.strip()
return summary
def sanitize_text(self,text):
"""
Sanitize text by encoding to UTF-8 with error replacement and decoding back.
This replaces any characters that might cause ASCII encoding errors.
"""
return text.encode("utf-8", errors="replace").decode("utf-8")
def count_tokens(self,text):
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
def __call__(self, input: Documents) -> Embeddings:
"""
Generate embeddings for the input documents.
Args:
input: List of document strings to embed
Returns:
List of embeddings for each document
"""
# Embed the documents somehow
embeddings = []
for content in input:
# Handle very long content
if len(content) > 1000000:
content = content[:1000000]
while self.count_tokens(content) > 110000:
content = content[:-1000]
# Create embedding
if self.count_tokens(content) > 8192:
content = self.summarize_text(content)
# Use the direct client instead of the module-level API
response = self.client.embeddings.create(
model=self.embed_model_name,
input=content,
)
embedding = response.data[0].embedding
embeddings.append(embedding)
return embeddings
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
EmbeddingFunction | - |
Parameter Details
model_name: The name of the OpenAI language model to use for text summarization (e.g., 'gpt-4o-mini', 'gpt-4'). This model is used when documents exceed the embedding model's token limit.
embed_model_name: The name of the OpenAI embedding model to use for generating embeddings (e.g., 'text-embedding-ada-002', 'text-embedding-3-small'). This model converts text into vector representations.
api_key: Your OpenAI API key for authentication. This key is used for both the summarization LLM and the embedding model.
Return Value
The __init__ method returns an instance of MyEmbeddingFunction. The __call__ method returns a list of embeddings (Embeddings type), where each embedding is a list of floats representing the vector representation of the corresponding input document. The summarize_text method returns a string containing the summarized text. The count_tokens method returns an integer representing the number of tokens in the input text.
Class Interface
Methods
__init__(self, model_name: str, embed_model_name: str, api_key: str)
Purpose: Initialize the embedding function with OpenAI models and API credentials, setting up the LLM for summarization and the embedding client
Parameters:
model_name: Name of the OpenAI model for text summarizationembed_model_name: Name of the OpenAI embedding modelapi_key: OpenAI API key for authentication
Returns: None (constructor)
summarize_text(self, text, max_tokens_summary=8192)
Purpose: Summarize long text using the configured LLM to reduce it to under the specified token limit
Parameters:
text: The input text to summarizemax_tokens_summary: Maximum number of tokens for the summary (default: 8192)
Returns: A string containing the summarized text
sanitize_text(self, text)
Purpose: Clean text by encoding to UTF-8 with error replacement to handle problematic characters
Parameters:
text: The input text to sanitize
Returns: A sanitized string with problematic characters replaced
count_tokens(self, text)
Purpose: Count the number of tokens in the input text using the cl100k_base encoding
Parameters:
text: The input text to count tokens for
Returns: An integer representing the number of tokens
__call__(self, input: Documents) -> Embeddings
Purpose: Generate embeddings for a list of documents, automatically handling long documents through summarization and truncation
Parameters:
input: A list of document strings (Documents type) to generate embeddings for
Returns: A list of embeddings (Embeddings type), where each embedding is a list of floats
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
model_name |
str | The name of the OpenAI model used for text summarization | instance |
api_key |
str | The OpenAI API key used for authentication | instance |
client |
openai.OpenAI | The OpenAI client instance used for making API calls to generate embeddings | instance |
llm |
ChatOpenAI | The LangChain ChatOpenAI instance used for text summarization | instance |
embed_model_name |
str | The name of the OpenAI embedding model used to generate vector embeddings | instance |
Dependencies
openailangchain_openaitiktokenchromadb
Required Imports
import openai
from langchain_openai import ChatOpenAI
import tiktoken
from chromadb import Documents, EmbeddingFunction, Embeddings
Conditional/Optional Imports
These imports are only needed under specific conditions:
import os
Condition: Used internally to set environment variables for OpenAI API configuration
Required (conditional)Usage Example
# Instantiate the embedding function
api_key = 'your-openai-api-key'
embedding_fn = MyEmbeddingFunction(
model_name='gpt-4o-mini',
embed_model_name='text-embedding-ada-002',
api_key=api_key
)
# Generate embeddings for documents
documents = ['This is a short document.', 'This is another document with more content.']
embeddings = embedding_fn(documents)
# Use with ChromaDB
import chromadb
client = chromadb.Client()
collection = client.create_collection(
name='my_collection',
embedding_function=embedding_fn
)
collection.add(
documents=documents,
ids=['doc1', 'doc2']
)
# Manually summarize long text
long_text = 'Very long document content...'
summary = embedding_fn.summarize_text(long_text, max_tokens_summary=4096)
# Count tokens in text
token_count = embedding_fn.count_tokens('Some text to count')
Best Practices
- Always provide a valid OpenAI API key during instantiation to avoid authentication errors
- The class automatically handles long documents by summarizing them, but be aware this adds latency and API costs
- Documents longer than 1,000,000 characters are truncated before processing
- Documents exceeding 8,192 tokens are automatically summarized before embedding
- The class sets environment variables (OPENAI_API_TYPE, OPENAI_API_KEY) which may affect other parts of your application
- Use the same instance for multiple embedding operations to avoid recreating the OpenAI client
- The summarize_text method can be called independently for preprocessing long texts
- Token counting uses the 'cl100k_base' encoding, which is appropriate for GPT-4 and newer models
- Text sanitization is automatically applied during summarization to prevent encoding errors
- The class is designed to be used as a ChromaDB embedding function via the __call__ method
- Be mindful of API rate limits and costs when processing large batches of documents
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class MyEmbeddingFunction_v2 93.9% similar
-
class DocChatEmbeddingFunction 93.1% similar
-
class MyEmbeddingFunction_v3 92.5% similar
-
class MyEmbeddingFunction 87.0% similar
-
class DocumentIndexer 60.9% similar