class HashableJSON
A JSON encoder extension that generates hashable string representations for a wide variety of Python objects, including those not normally JSON-serializable like sets, numpy arrays, and pandas DataFrames.
/tf/active/vicechatdev/patches/util.py
151 - 217
complex
Purpose
HashableJSON extends json.JSONEncoder to create unique, hashable string representations of complex Python objects for use in memoization, caching, and deep equality testing. It handles standard JSON types plus additional types like sets, datetime objects, numpy arrays, and pandas DataFrames by converting them to hashable representations. For large arrays/DataFrames, it uses sampling to maintain performance. Unrecognized types fall back to using their hash() or id().
Source Code
class HashableJSON(json.JSONEncoder):
"""
Extends JSONEncoder to generate a hashable string for as many types
of object as possible including nested objects and objects that are
not normally hashable. The purpose of this class is to generate
unique strings that once hashed are suitable for use in memoization
and other cases where deep equality must be tested without storing
the entire object.
By default JSONEncoder supports booleans, numbers, strings, lists,
tuples and dictionaries. In order to support other types such as
sets, datetime objects and mutable objects such as pandas Dataframes
or numpy arrays, HashableJSON has to convert these types to
datastructures that can normally be represented as JSON.
Support for other object types may need to be introduced in
future. By default, unrecognized object types are represented by
their id.
One limitation of this approach is that dictionaries with composite
keys (e.g. tuples) are not supported due to the JSON spec.
"""
string_hashable = (dt.datetime,)
repr_hashable = ()
def default(self, obj):
if isinstance(obj, set):
return hash(frozenset(obj))
elif isinstance(obj, np.ndarray):
h = hashlib.new("md5")
for s in obj.shape:
h.update(_int_to_bytes(s))
if obj.size >= _NP_SIZE_LARGE:
state = np.random.RandomState(0)
obj = state.choice(obj.flat, size=_NP_SAMPLE_SIZE)
h.update(obj.tobytes())
return h.hexdigest()
if pd and isinstance(obj, (pd.Series, pd.DataFrame)):
if len(obj) > _PANDAS_ROWS_LARGE:
obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
try:
pd_values = list(pd.util.hash_pandas_object(obj, index=True).values)
except TypeError:
# Use pickle if pandas cannot hash the object for example if
# it contains unhashable objects.
pd_values = [pickle.dumps(obj, pickle.HIGHEST_PROTOCOL)]
if isinstance(obj, pd.Series):
columns = [obj.name]
elif isinstance(obj.columns, pd.MultiIndex):
columns = [name for cols in obj.columns for name in cols]
else:
columns = list(obj.columns)
all_vals = pd_values + columns + list(obj.index.names)
h = hashlib.md5()
for val in all_vals:
if not isinstance(val, bytes):
val = str(val).encode("utf-8")
h.update(val)
return h.hexdigest()
elif isinstance(obj, self.string_hashable):
return str(obj)
elif isinstance(obj, self.repr_hashable):
return repr(obj)
try:
return hash(obj)
except:
return id(obj)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
json.JSONEncoder | - |
Parameter Details
obj: The object to be converted to a hashable representation. This parameter is used in the default() method and can be any Python object including sets, numpy arrays, pandas Series/DataFrames, datetime objects, or any other type.
Return Value
The class itself returns a HashableJSON encoder instance when instantiated. The default() method returns a hashable representation of the input object: for sets it returns a hash of frozenset, for numpy arrays and pandas objects it returns MD5 hexdigest strings, for datetime objects it returns string representations, and for unrecognized types it returns either their hash() or id().
Class Interface
Methods
default(self, obj) -> Union[int, str]
Purpose: Converts non-standard JSON types to hashable representations. This method is called by JSONEncoder for objects that cannot be serialized by the default encoder.
Parameters:
obj: The object to convert to a hashable representation. Can be a set, numpy array, pandas Series/DataFrame, datetime object, or any other Python object.
Returns: Returns a hashable representation: integer hash for sets and hashable objects, MD5 hexdigest string for numpy arrays and pandas objects, string representation for datetime objects, or id() for unrecognized types.
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
string_hashable |
tuple | Tuple of types that should be converted to strings using str(). Default contains datetime.datetime. | class |
repr_hashable |
tuple | Tuple of types that should be converted to strings using repr(). Default is empty tuple. | class |
Dependencies
jsonhashlibnumpypandaspickledatetime
Required Imports
import json
import hashlib
import numpy as np
import pandas as pd
import pickle
import datetime as dt
Conditional/Optional Imports
These imports are only needed under specific conditions:
import pandas as pd
Condition: Required for handling pandas Series and DataFrame objects. The code checks 'if pd' before using pandas functionality.
OptionalUsage Example
import json
import hashlib
import numpy as np
import pandas as pd
import datetime as dt
# Define required constants
_NP_SIZE_LARGE = 1000000
_NP_SAMPLE_SIZE = 100000
_PANDAS_ROWS_LARGE = 400000
_PANDAS_SAMPLE_SIZE = 100000
def _int_to_bytes(x):
return x.to_bytes((x.bit_length() + 7) // 8, 'big')
# Instantiate the encoder
encoder = HashableJSON()
# Create various objects to hash
data = {
'numbers': [1, 2, 3],
'set': {1, 2, 3},
'array': np.array([1, 2, 3]),
'dataframe': pd.DataFrame({'a': [1, 2], 'b': [3, 4]}),
'datetime': dt.datetime.now()
}
# Encode to JSON string
json_str = json.dumps(data, cls=HashableJSON)
# Generate hash for memoization
hash_value = hashlib.md5(json_str.encode()).hexdigest()
print(f"Hash: {hash_value}")
# Use directly with default method
array_hash = encoder.default(np.array([1, 2, 3]))
print(f"Array hash: {array_hash}")
Best Practices
- Use HashableJSON as the cls parameter when calling json.dumps() to automatically handle complex objects
- Be aware that large numpy arrays (>= _NP_SIZE_LARGE elements) and pandas DataFrames (> _PANDAS_ROWS_LARGE rows) are sampled rather than fully hashed for performance
- The sampling uses fixed random seeds (0) to ensure deterministic hashing across runs
- Dictionaries with composite keys (e.g., tuples) are not supported due to JSON specification limitations
- For unrecognized object types, the encoder falls back to id() which means the hash will be instance-specific, not value-specific
- When using for memoization, ensure the constants _NP_SIZE_LARGE, _NP_SAMPLE_SIZE, _PANDAS_ROWS_LARGE, and _PANDAS_SAMPLE_SIZE are appropriately configured for your use case
- The class uses MD5 hashing for numpy arrays and pandas objects - while not cryptographically secure, it's sufficient for memoization purposes
- Extend string_hashable or repr_hashable class attributes to add custom types that should be converted via str() or repr()
- The encoder attempts pandas hashing first and falls back to pickle for unhashable pandas objects
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function deephash 69.0% similar
-
class Neo4jEncoder 59.4% similar
-
function clean_for_json_v10 54.3% similar
-
function clean_for_json_v12 53.6% similar
-
function safe_json_dumps 53.5% similar