class SimpleDataHandle
A data handler class that manages multiple data sources with different types (dataframes, vector stores, databases) and their associated processing configurations.
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
718 - 787
moderate
Purpose
SimpleDataHandle provides a centralized registry for managing heterogeneous data sources in a data processing or RAG (Retrieval-Augmented Generation) pipeline. It stores data along with metadata including type, filters, processing steps, inclusion limits, and instructions for how to use each data source. The class automatically configures default settings based on data type and can convert documents to vector stores using FAISS and OpenAI embeddings.
Source Code
class SimpleDataHandle:
def __init__(self):
self.handlers = {}
return
def add_data(self, name:str, type:str, data:Any, filters:str="", processing_steps:List[str]=[], inclusions:int=10,instructions:str=""):
## Default values for type, filters, processing_steps, instructions
if type == "":
type = "text"
if type=="dataframe":
filters=""
if processing_steps==[]:
processing_steps=["markdown"]
if instructions=="":
instructions="""Start with a summary of the internal data, using summary tables when possible. If the internal data is presented as chemical formulas in SMILES format, try to find the corresponding chemical names and properties and report those in your answer.
Use them to compare it to other chemical data in the external sources."""
if type=="vectorstore" or "to_vectorstore":
if processing_steps==[]:
processing_steps=["similarity"]
if instructions=="":
instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
"""
if type =="to_vectorstore":
embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
vector_store = FAISS(
embedding_function=embeddings,
docstore=InMemoryDocstore(),
index_to_docstore_id={},
index=index
)
uuids = [str(uuid4()) for _ in range(len(data))]
vector_store.add_documents(
documents=data,
ids=uuids,
)
data=vector_store
type="vectorstore"
if type == "db_search":
if processing_steps==[]:
processing_steps=["similarity"]
if instructions=="":
instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
"""
if type=="chromaDB":
if processing_steps==[]:
processing_steps=["similarity"]
if instructions=="":
instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
"""
self.handlers[name] = {
"type" : type,
"data" : data,
"filters" : filters,
"processing_steps" : processing_steps,
"inclusions" : inclusions,
"instructions" : instructions
}
return
def remove_data(self, name:str):
if name in self.handlers:
del self.handlers[name]
return
def clear_data(self):
self.handlers = {}
return
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
__init__: No parameters required. Initializes an empty handlers dictionary to store data sources.
Return Value
The class constructor returns None. The add_data, remove_data, and clear_data methods all return None (implicit). The class maintains state through the handlers dictionary which stores data source configurations as nested dictionaries with keys: type, data, filters, processing_steps, inclusions, and instructions.
Class Interface
Methods
__init__(self) -> None
Purpose: Initialize a new SimpleDataHandle instance with an empty handlers dictionary
Returns: None
add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None
Purpose: Add a data source to the handler with associated configuration. Automatically sets defaults based on type and converts 'to_vectorstore' type to FAISS vector stores.
Parameters:
name: Unique identifier for this data source, used as dictionary keytype: Data type: 'text', 'dataframe', 'vectorstore', 'to_vectorstore', 'db_search', or 'chromaDB'data: The actual data object (DataFrame, list of Documents, vector store, etc.)filters: Filter criteria for the data (empty string by default, forced empty for dataframes)processing_steps: List of processing steps to apply (e.g., ['markdown'], ['similarity']). Defaults set by type.inclusions: Number of items to include in processing (default 10)instructions: Instructions for how to use this data source in downstream processing. Defaults set by type.
Returns: None (modifies self.handlers dictionary in place)
remove_data(self, name: str) -> None
Purpose: Remove a data source from the handler by name
Parameters:
name: The name/key of the data source to remove
Returns: None (modifies self.handlers dictionary in place, silently does nothing if name not found)
clear_data(self) -> None
Purpose: Remove all data sources from the handler, resetting to empty state
Returns: None (resets self.handlers to empty dictionary)
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
handlers |
Dict[str, Dict[str, Any]] | Dictionary mapping data source names to their configuration dictionaries. Each configuration contains keys: 'type', 'data', 'filters', 'processing_steps', 'inclusions', 'instructions' | instance |
Dependencies
typingpanellangchain_communitylangchain_openaiuuidpandassentence_transformersfaissnumpyneo4jopenaichromadbtiktokenpybtex
Required Imports
from typing import List, Any, Dict
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from uuid import uuid4
import faiss
Conditional/Optional Imports
These imports are only needed under specific conditions:
from langchain_community.embeddings import OpenAIEmbeddings
Condition: only when adding data with type='to_vectorstore'
Required (conditional)from langchain_community.vectorstores import FAISS
Condition: only when adding data with type='to_vectorstore'
Required (conditional)import faiss
Condition: only when adding data with type='to_vectorstore'
Required (conditional)Usage Example
# Initialize the data handler
handler = SimpleDataHandle()
# Add a dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
handler.add_data(
name='my_dataframe',
type='dataframe',
data=df,
inclusions=5
)
# Add documents to be converted to vector store
from langchain_core.documents import Document
docs = [Document(page_content='text1'), Document(page_content='text2')]
handler.add_data(
name='my_vectors',
type='to_vectorstore',
data=docs
)
# Access stored data
df_config = handler.handlers['my_dataframe']
print(df_config['type']) # 'dataframe'
print(df_config['processing_steps']) # ['markdown']
# Remove a data source
handler.remove_data('my_dataframe')
# Clear all data
handler.clear_data()
Best Practices
- Always initialize the class before adding data sources
- Use descriptive unique names for each data source as they serve as dictionary keys
- The 'to_vectorstore' type requires documents in LangChain Document format and will automatically convert them to FAISS vector stores
- Default processing_steps and instructions are automatically set based on data type, but can be overridden
- The handlers dictionary is the primary state - access it directly to retrieve stored configurations
- When using 'to_vectorstore', ensure OpenAI API credentials are configured before calling add_data
- The inclusions parameter (default 10) likely controls how many items to include in processing
- Remove unused data sources with remove_data() to free memory, especially for large vector stores
- Use clear_data() to reset the entire handler state
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class DataSource 60.9% similar
-
class DataSource_v2 58.4% similar
-
class DataSource_v1 57.9% similar
-
class DataProcessor 54.7% similar
-
class DataProcessor_v1 54.5% similar