🔍 Code Extractor

class HashableJSON

Maturity: 68

A JSON encoder extension that generates hashable string representations for a wide variety of Python objects, including those not normally JSON-serializable like sets, numpy arrays, and pandas DataFrames.

File:
/tf/active/vicechatdev/patches/util.py
Lines:
151 - 217
Complexity:
complex

Purpose

HashableJSON extends json.JSONEncoder to create unique, hashable string representations of complex Python objects for use in memoization, caching, and deep equality testing. It handles standard JSON types plus additional types like sets, datetime objects, numpy arrays, and pandas DataFrames by converting them to hashable representations. For large arrays/DataFrames, it uses sampling to maintain performance. Unrecognized types fall back to using their hash() or id().

Source Code

class HashableJSON(json.JSONEncoder):
    """
    Extends JSONEncoder to generate a hashable string for as many types
    of object as possible including nested objects and objects that are
    not normally hashable. The purpose of this class is to generate
    unique strings that once hashed are suitable for use in memoization
    and other cases where deep equality must be tested without storing
    the entire object.

    By default JSONEncoder supports booleans, numbers, strings, lists,
    tuples and dictionaries. In order to support other types such as
    sets, datetime objects and mutable objects such as pandas Dataframes
    or numpy arrays, HashableJSON has to convert these types to
    datastructures that can normally be represented as JSON.

    Support for other object types may need to be introduced in
    future. By default, unrecognized object types are represented by
    their id.

    One limitation of this approach is that dictionaries with composite
    keys (e.g. tuples) are not supported due to the JSON spec.
    """
    string_hashable = (dt.datetime,)
    repr_hashable = ()

    def default(self, obj):
        if isinstance(obj, set):
            return hash(frozenset(obj))
        elif isinstance(obj, np.ndarray):
            h = hashlib.new("md5")
            for s in obj.shape:
                h.update(_int_to_bytes(s))
            if obj.size >= _NP_SIZE_LARGE:
                state = np.random.RandomState(0)
                obj = state.choice(obj.flat, size=_NP_SAMPLE_SIZE)
            h.update(obj.tobytes())
            return h.hexdigest()
        if pd and isinstance(obj, (pd.Series, pd.DataFrame)):
            if len(obj) > _PANDAS_ROWS_LARGE:
                obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
            try:
                pd_values = list(pd.util.hash_pandas_object(obj, index=True).values)
            except TypeError:
                # Use pickle if pandas cannot hash the object for example if
                # it contains unhashable objects.
                pd_values = [pickle.dumps(obj, pickle.HIGHEST_PROTOCOL)]
            if isinstance(obj, pd.Series):
                columns = [obj.name]
            elif isinstance(obj.columns, pd.MultiIndex):
                columns = [name for cols in obj.columns for name in cols]
            else:
                columns = list(obj.columns)
            all_vals = pd_values + columns + list(obj.index.names)
            h = hashlib.md5()
            for val in all_vals:
                if not isinstance(val, bytes):
                    val = str(val).encode("utf-8")
                h.update(val)
            return h.hexdigest()
        elif isinstance(obj, self.string_hashable):
            return str(obj)
        elif isinstance(obj, self.repr_hashable):
            return repr(obj)
        try:
            return hash(obj)
        except:
            return id(obj)

Parameters

Name Type Default Kind
bases json.JSONEncoder -

Parameter Details

obj: The object to be converted to a hashable representation. This parameter is used in the default() method and can be any Python object including sets, numpy arrays, pandas Series/DataFrames, datetime objects, or any other type.

Return Value

The class itself returns a HashableJSON encoder instance when instantiated. The default() method returns a hashable representation of the input object: for sets it returns a hash of frozenset, for numpy arrays and pandas objects it returns MD5 hexdigest strings, for datetime objects it returns string representations, and for unrecognized types it returns either their hash() or id().

Class Interface

Methods

default(self, obj) -> Union[int, str]

Purpose: Converts non-standard JSON types to hashable representations. This method is called by JSONEncoder for objects that cannot be serialized by the default encoder.

Parameters:

  • obj: The object to convert to a hashable representation. Can be a set, numpy array, pandas Series/DataFrame, datetime object, or any other Python object.

Returns: Returns a hashable representation: integer hash for sets and hashable objects, MD5 hexdigest string for numpy arrays and pandas objects, string representation for datetime objects, or id() for unrecognized types.

Attributes

Name Type Description Scope
string_hashable tuple Tuple of types that should be converted to strings using str(). Default contains datetime.datetime. class
repr_hashable tuple Tuple of types that should be converted to strings using repr(). Default is empty tuple. class

Dependencies

  • json
  • hashlib
  • numpy
  • pandas
  • pickle
  • datetime

Required Imports

import json
import hashlib
import numpy as np
import pandas as pd
import pickle
import datetime as dt

Conditional/Optional Imports

These imports are only needed under specific conditions:

import pandas as pd

Condition: Required for handling pandas Series and DataFrame objects. The code checks 'if pd' before using pandas functionality.

Optional

Usage Example

import json
import hashlib
import numpy as np
import pandas as pd
import datetime as dt

# Define required constants
_NP_SIZE_LARGE = 1000000
_NP_SAMPLE_SIZE = 100000
_PANDAS_ROWS_LARGE = 400000
_PANDAS_SAMPLE_SIZE = 100000

def _int_to_bytes(x):
    return x.to_bytes((x.bit_length() + 7) // 8, 'big')

# Instantiate the encoder
encoder = HashableJSON()

# Create various objects to hash
data = {
    'numbers': [1, 2, 3],
    'set': {1, 2, 3},
    'array': np.array([1, 2, 3]),
    'dataframe': pd.DataFrame({'a': [1, 2], 'b': [3, 4]}),
    'datetime': dt.datetime.now()
}

# Encode to JSON string
json_str = json.dumps(data, cls=HashableJSON)

# Generate hash for memoization
hash_value = hashlib.md5(json_str.encode()).hexdigest()
print(f"Hash: {hash_value}")

# Use directly with default method
array_hash = encoder.default(np.array([1, 2, 3]))
print(f"Array hash: {array_hash}")

Best Practices

  • Use HashableJSON as the cls parameter when calling json.dumps() to automatically handle complex objects
  • Be aware that large numpy arrays (>= _NP_SIZE_LARGE elements) and pandas DataFrames (> _PANDAS_ROWS_LARGE rows) are sampled rather than fully hashed for performance
  • The sampling uses fixed random seeds (0) to ensure deterministic hashing across runs
  • Dictionaries with composite keys (e.g., tuples) are not supported due to JSON specification limitations
  • For unrecognized object types, the encoder falls back to id() which means the hash will be instance-specific, not value-specific
  • When using for memoization, ensure the constants _NP_SIZE_LARGE, _NP_SAMPLE_SIZE, _PANDAS_ROWS_LARGE, and _PANDAS_SAMPLE_SIZE are appropriately configured for your use case
  • The class uses MD5 hashing for numpy arrays and pandas objects - while not cryptographically secure, it's sufficient for memoization purposes
  • Extend string_hashable or repr_hashable class attributes to add custom types that should be converted via str() or repr()
  • The encoder attempts pandas hashing first and falls back to pickle for unhashable pandas objects

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function deephash 69.0% similar

    Computes a hash value for any Python object by serializing it to JSON using a custom HashableJSON encoder and returning the hash of the resulting string.

    From: /tf/active/vicechatdev/patches/util.py
  • class Neo4jEncoder 59.4% similar

    A custom JSON encoder that extends json.JSONEncoder to handle Neo4j-specific data types and Python objects that are not natively JSON serializable.

    From: /tf/active/vicechatdev/neo4j_schema_report.py
  • function clean_for_json_v10 54.3% similar

    Recursively converts Python objects containing NumPy and Pandas data types into JSON-serializable native Python types.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/c385e1f5-fbf6-4832-8fd4-78ef8b72fc53/project_1/analysis.py
  • function clean_for_json_v12 53.6% similar

    Recursively sanitizes Python objects to make them JSON-serializable by converting non-serializable types (NumPy types, pandas objects, tuples, NaN/Inf values) into JSON-compatible formats.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/290a39ea-3ae0-4301-8e2f-9d5c3bf80e6e/project_3/analysis.py
  • function safe_json_dumps 53.5% similar

    Safely serializes Python objects to JSON format, handling NaN values and datetime objects that would otherwise cause serialization errors.

    From: /tf/active/vicechatdev/full_smartstat/services.py
← Back to Browse