🔍 Code Extractor

class SimpleDataHandle

Maturity: 38

A data handler class that manages multiple data sources with different types (dataframes, vector stores, databases) and their associated processing configurations.

File:
/tf/active/vicechatdev/OneCo_hybrid_RAG copy.py
Lines:
718 - 787
Complexity:
moderate

Purpose

SimpleDataHandle provides a centralized registry for managing heterogeneous data sources in a data processing or RAG (Retrieval-Augmented Generation) pipeline. It stores data along with metadata including type, filters, processing steps, inclusion limits, and instructions for how to use each data source. The class automatically configures default settings based on data type and can convert documents to vector stores using FAISS and OpenAI embeddings.

Source Code

class SimpleDataHandle:
     
    def __init__(self):
        self.handlers = {}
        return
     
    def add_data(self, name:str, type:str, data:Any, filters:str="", processing_steps:List[str]=[], inclusions:int=10,instructions:str=""):
        ## Default values for type, filters, processing_steps, instructions
        if type == "":
            type = "text"
        if type=="dataframe":
            filters=""
            if processing_steps==[]:
                processing_steps=["markdown"]
            if instructions=="":
                instructions="""Start with a summary of the internal data, using summary tables when possible. If the internal data is presented as chemical formulas in SMILES format, try to find the corresponding chemical names and properties and report those in your answer.
                            Use them to compare it to other chemical data in the external sources."""
        if type=="vectorstore" or "to_vectorstore":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type =="to_vectorstore":
            embeddings = OpenAIEmbeddings()
            index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
            vector_store = FAISS(
                embedding_function=embeddings,
                docstore=InMemoryDocstore(),
                index_to_docstore_id={},
                index=index
            )
            uuids = [str(uuid4()) for _ in range(len(data))]
            vector_store.add_documents(
            documents=data,  
            ids=uuids,
            )
            data=vector_store
            type="vectorstore"
        if type == "db_search":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        if type=="chromaDB":
            if processing_steps==[]:
                processing_steps=["similarity"]
            if instructions=="":    
                instructions="""Provide a summary of the given context data extracted from lab data and reports and from scientific literature, using summary tables when possible.
                            """
        
        self.handlers[name] = {
            "type" : type,
            "data" : data,
            "filters" : filters,
            "processing_steps" : processing_steps,
            "inclusions" : inclusions,
            "instructions" : instructions
        }
        return
     
    def remove_data(self, name:str):
        if name in self.handlers:
            del self.handlers[name]
        return
    
    def clear_data(self):
        self.handlers = {}
        return      

Parameters

Name Type Default Kind
bases - -

Parameter Details

__init__: No parameters required. Initializes an empty handlers dictionary to store data sources.

Return Value

The class constructor returns None. The add_data, remove_data, and clear_data methods all return None (implicit). The class maintains state through the handlers dictionary which stores data source configurations as nested dictionaries with keys: type, data, filters, processing_steps, inclusions, and instructions.

Class Interface

Methods

__init__(self) -> None

Purpose: Initialize a new SimpleDataHandle instance with an empty handlers dictionary

Returns: None

add_data(self, name: str, type: str, data: Any, filters: str = '', processing_steps: List[str] = [], inclusions: int = 10, instructions: str = '') -> None

Purpose: Add a data source to the handler with associated configuration. Automatically sets defaults based on type and converts 'to_vectorstore' type to FAISS vector stores.

Parameters:

  • name: Unique identifier for this data source, used as dictionary key
  • type: Data type: 'text', 'dataframe', 'vectorstore', 'to_vectorstore', 'db_search', or 'chromaDB'
  • data: The actual data object (DataFrame, list of Documents, vector store, etc.)
  • filters: Filter criteria for the data (empty string by default, forced empty for dataframes)
  • processing_steps: List of processing steps to apply (e.g., ['markdown'], ['similarity']). Defaults set by type.
  • inclusions: Number of items to include in processing (default 10)
  • instructions: Instructions for how to use this data source in downstream processing. Defaults set by type.

Returns: None (modifies self.handlers dictionary in place)

remove_data(self, name: str) -> None

Purpose: Remove a data source from the handler by name

Parameters:

  • name: The name/key of the data source to remove

Returns: None (modifies self.handlers dictionary in place, silently does nothing if name not found)

clear_data(self) -> None

Purpose: Remove all data sources from the handler, resetting to empty state

Returns: None (resets self.handlers to empty dictionary)

Attributes

Name Type Description Scope
handlers Dict[str, Dict[str, Any]] Dictionary mapping data source names to their configuration dictionaries. Each configuration contains keys: 'type', 'data', 'filters', 'processing_steps', 'inclusions', 'instructions' instance

Dependencies

  • typing
  • panel
  • langchain_community
  • langchain_openai
  • uuid
  • pandas
  • sentence_transformers
  • faiss
  • numpy
  • neo4j
  • openai
  • chromadb
  • tiktoken
  • pybtex

Required Imports

from typing import List, Any, Dict
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from uuid import uuid4
import faiss

Conditional/Optional Imports

These imports are only needed under specific conditions:

from langchain_community.embeddings import OpenAIEmbeddings

Condition: only when adding data with type='to_vectorstore'

Required (conditional)
from langchain_community.vectorstores import FAISS

Condition: only when adding data with type='to_vectorstore'

Required (conditional)
import faiss

Condition: only when adding data with type='to_vectorstore'

Required (conditional)

Usage Example

# Initialize the data handler
handler = SimpleDataHandle()

# Add a dataframe
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
handler.add_data(
    name='my_dataframe',
    type='dataframe',
    data=df,
    inclusions=5
)

# Add documents to be converted to vector store
from langchain_core.documents import Document
docs = [Document(page_content='text1'), Document(page_content='text2')]
handler.add_data(
    name='my_vectors',
    type='to_vectorstore',
    data=docs
)

# Access stored data
df_config = handler.handlers['my_dataframe']
print(df_config['type'])  # 'dataframe'
print(df_config['processing_steps'])  # ['markdown']

# Remove a data source
handler.remove_data('my_dataframe')

# Clear all data
handler.clear_data()

Best Practices

  • Always initialize the class before adding data sources
  • Use descriptive unique names for each data source as they serve as dictionary keys
  • The 'to_vectorstore' type requires documents in LangChain Document format and will automatically convert them to FAISS vector stores
  • Default processing_steps and instructions are automatically set based on data type, but can be overridden
  • The handlers dictionary is the primary state - access it directly to retrieve stored configurations
  • When using 'to_vectorstore', ensure OpenAI API credentials are configured before calling add_data
  • The inclusions parameter (default 10) likely controls how many items to include in processing
  • Remove unused data sources with remove_data() to free memory, especially for large vector stores
  • Use clear_data() to reset the entire handler state

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class DataSource 60.9% similar

    A dataclass that represents configuration for various data sources, supporting file-based, SQL database, and query-based data access patterns.

    From: /tf/active/vicechatdev/vice_ai/models.py
  • class DataSource_v2 58.4% similar

    A dataclass that encapsulates configuration for various data sources including files, SQL databases, and SQL workflow metadata.

    From: /tf/active/vicechatdev/vice_ai/smartstat_models.py
  • class DataSource_v1 57.9% similar

    A dataclass that encapsulates configuration for various data sources used in analysis, supporting file-based, SQL database, and query-based data sources.

    From: /tf/active/vicechatdev/vice_ai/models.py
  • class DataProcessor 54.7% similar

    Handles data loading, validation, and preprocessing

    From: /tf/active/vicechatdev/full_smartstat/data_processor.py
  • class DataProcessor_v1 54.5% similar

    Handles data loading, validation, and preprocessing

    From: /tf/active/vicechatdev/smartstat/data_processor.py
← Back to Browse