HashableJSON - Code Extractor

class HashableJSON

Maturity: 68

A JSON encoder extension that generates hashable string representations for a wide variety of Python objects, including those not normally JSON-serializable like sets, numpy arrays, and pandas DataFrames.

File:
/tf/active/vicechatdev/patches/util.py

Lines:
151 - 217

Complexity:
complex

Purpose

HashableJSON extends json.JSONEncoder to create unique, hashable string representations of complex Python objects for use in memoization, caching, and deep equality testing. It handles standard JSON types plus additional types like sets, datetime objects, numpy arrays, and pandas DataFrames by converting them to hashable representations. For large arrays/DataFrames, it uses sampling to maintain performance. Unrecognized types fall back to using their hash() or id().

Source Code

class HashableJSON(json.JSONEncoder):
    """
    Extends JSONEncoder to generate a hashable string for as many types
    of object as possible including nested objects and objects that are
    not normally hashable. The purpose of this class is to generate
    unique strings that once hashed are suitable for use in memoization
    and other cases where deep equality must be tested without storing
    the entire object.

    By default JSONEncoder supports booleans, numbers, strings, lists,
    tuples and dictionaries. In order to support other types such as
    sets, datetime objects and mutable objects such as pandas Dataframes
    or numpy arrays, HashableJSON has to convert these types to
    datastructures that can normally be represented as JSON.

    Support for other object types may need to be introduced in
    future. By default, unrecognized object types are represented by
    their id.

    One limitation of this approach is that dictionaries with composite
    keys (e.g. tuples) are not supported due to the JSON spec.
    """
    string_hashable = (dt.datetime,)
    repr_hashable = ()

    def default(self, obj):
        if isinstance(obj, set):
            return hash(frozenset(obj))
        elif isinstance(obj, np.ndarray):
            h = hashlib.new("md5")
            for s in obj.shape:
                h.update(_int_to_bytes(s))
            if obj.size >= _NP_SIZE_LARGE:
                state = np.random.RandomState(0)
                obj = state.choice(obj.flat, size=_NP_SAMPLE_SIZE)
            h.update(obj.tobytes())
            return h.hexdigest()
        if pd and isinstance(obj, (pd.Series, pd.DataFrame)):
            if len(obj) > _PANDAS_ROWS_LARGE:
                obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
            try:
                pd_values = list(pd.util.hash_pandas_object(obj, index=True).values)
            except TypeError:
                # Use pickle if pandas cannot hash the object for example if
                # it contains unhashable objects.
                pd_values = [pickle.dumps(obj, pickle.HIGHEST_PROTOCOL)]
            if isinstance(obj, pd.Series):
                columns = [obj.name]
            elif isinstance(obj.columns, pd.MultiIndex):
                columns = [name for cols in obj.columns for name in cols]
            else:
                columns = list(obj.columns)
            all_vals = pd_values + columns + list(obj.index.names)
            h = hashlib.md5()
            for val in all_vals:
                if not isinstance(val, bytes):
                    val = str(val).encode("utf-8")
                h.update(val)
            return h.hexdigest()
        elif isinstance(obj, self.string_hashable):
            return str(obj)
        elif isinstance(obj, self.repr_hashable):
            return repr(obj)
        try:
            return hash(obj)
        except:
            return id(obj)

Parameters

Name	Type	Default	Kind
`bases`	json.JSONEncoder	-

Parameter Details

obj: The object to be converted to a hashable representation. This parameter is used in the default() method and can be any Python object including sets, numpy arrays, pandas Series/DataFrames, datetime objects, or any other type.

Return Value

The class itself returns a HashableJSON encoder instance when instantiated. The default() method returns a hashable representation of the input object: for sets it returns a hash of frozenset, for numpy arrays and pandas objects it returns MD5 hexdigest strings, for datetime objects it returns string representations, and for unrecognized types it returns either their hash() or id().

Class Interface

Methods

`default(self, obj) -> Union[int, str]`

Purpose: Converts non-standard JSON types to hashable representations. This method is called by JSONEncoder for objects that cannot be serialized by the default encoder.

Parameters:

obj: The object to convert to a hashable representation. Can be a set, numpy array, pandas Series/DataFrame, datetime object, or any other Python object.

Returns: Returns a hashable representation: integer hash for sets and hashable objects, MD5 hexdigest string for numpy arrays and pandas objects, string representation for datetime objects, or id() for unrecognized types.

Attributes

Name	Type	Description	Scope
`string_hashable`	tuple	Tuple of types that should be converted to strings using str(). Default contains datetime.datetime.	class
`repr_hashable`	tuple	Tuple of types that should be converted to strings using repr(). Default is empty tuple.	class

Dependencies

json
hashlib
numpy
pandas
pickle
datetime

Required Imports

import json
import hashlib
import numpy as np
import pandas as pd
import pickle
import datetime as dt

Conditional/Optional Imports

These imports are only needed under specific conditions:

import pandas as pd

Condition: Required for handling pandas Series and DataFrame objects. The code checks 'if pd' before using pandas functionality.

Optional

Usage Example

import json
import hashlib
import numpy as np
import pandas as pd
import datetime as dt

# Define required constants
_NP_SIZE_LARGE = 1000000
_NP_SAMPLE_SIZE = 100000
_PANDAS_ROWS_LARGE = 400000
_PANDAS_SAMPLE_SIZE = 100000

def _int_to_bytes(x):
    return x.to_bytes((x.bit_length() + 7) // 8, 'big')

# Instantiate the encoder
encoder = HashableJSON()

# Create various objects to hash
data = {
    'numbers': [1, 2, 3],
    'set': {1, 2, 3},
    'array': np.array([1, 2, 3]),
    'dataframe': pd.DataFrame({'a': [1, 2], 'b': [3, 4]}),
    'datetime': dt.datetime.now()
}

# Encode to JSON string
json_str = json.dumps(data, cls=HashableJSON)

# Generate hash for memoization
hash_value = hashlib.md5(json_str.encode()).hexdigest()
print(f"Hash: {hash_value}")

# Use directly with default method
array_hash = encoder.default(np.array([1, 2, 3]))
print(f"Array hash: {array_hash}")

Best Practices

Use HashableJSON as the cls parameter when calling json.dumps() to automatically handle complex objects
Be aware that large numpy arrays (>= _NP_SIZE_LARGE elements) and pandas DataFrames (> _PANDAS_ROWS_LARGE rows) are sampled rather than fully hashed for performance
The sampling uses fixed random seeds (0) to ensure deterministic hashing across runs
Dictionaries with composite keys (e.g., tuples) are not supported due to JSON specification limitations
For unrecognized object types, the encoder falls back to id() which means the hash will be instance-specific, not value-specific
When using for memoization, ensure the constants _NP_SIZE_LARGE, _NP_SAMPLE_SIZE, _PANDAS_ROWS_LARGE, and _PANDAS_SAMPLE_SIZE are appropriately configured for your use case
The class uses MD5 hashing for numpy arrays and pandas objects - while not cryptographically secure, it's sufficient for memoization purposes
Extend string_hashable or repr_hashable class attributes to add custom types that should be converted via str() or repr()
The encoder attempts pandas hashing first and falls back to pickle for unhashable pandas objects

Similar Components

AI-powered semantic similarity - components with related functionality:

function deephash 69.0% similar

Computes a hash value for any Python object by serializing it to JSON using a custom HashableJSON encoder and returning the hash of the resulting string.
From: /tf/active/vicechatdev/patches/util.py
class Neo4jEncoder 59.4% similar

A custom JSON encoder that extends json.JSONEncoder to handle Neo4j-specific data types and Python objects that are not natively JSON serializable.
From: /tf/active/vicechatdev/neo4j_schema_report.py
function clean_for_json_v10 54.3% similar

Recursively converts Python objects containing NumPy and Pandas data types into JSON-serializable native Python types.
From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/c385e1f5-fbf6-4832-8fd4-78ef8b72fc53/project_1/analysis.py
function clean_for_json_v12 53.6% similar

Recursively sanitizes Python objects to make them JSON-serializable by converting non-serializable types (NumPy types, pandas objects, tuples, NaN/Inf values) into JSON-compatible formats.
From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/290a39ea-3ae0-4301-8e2f-9d5c3bf80e6e/project_3/analysis.py
function safe_json_dumps 53.5% similar

Safely serializes Python objects to JSON format, handling NaN values and datetime objects that would otherwise cause serialization errors.
From: /tf/active/vicechatdev/full_smartstat/services.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class HashableJSON(json.JSONEncoder):
    """
    Extends JSONEncoder to generate a hashable string for as many types
    of object as possible including nested objects and objects that are
    not normally hashable. The purpose of this class is to generate
    unique strings that once hashed are suitable for use in memoization
    and other cases where deep equality must be tested without storing
    the entire object.

    By default JSONEncoder supports booleans, numbers, strings, lists,
    tuples and dictionaries. In order to support other types such as
    sets, datetime objects and mutable objects such as pandas Dataframes
    or numpy arrays, HashableJSON has to convert these types to
    datastructures that can normally be represented as JSON.

    Support for other object types may need to be introduced in
    future. By default, unrecognized object types are represented by
    their id.

    One limitation of this approach is that dictionaries with composite
    keys (e.g. tuples) are not supported due to the JSON spec.
    """
    string_hashable = (dt.datetime,)
    repr_hashable = ()

    def default(self, obj):
        if isinstance(obj, set):
            return hash(frozenset(obj))
        elif isinstance(obj, np.ndarray):
            h = hashlib.new("md5")
            for s in obj.shape:
                h.update(_int_to_bytes(s))
            if obj.size >= _NP_SIZE_LARGE:
                state = np.random.RandomState(0)
                obj = state.choice(obj.flat, size=_NP_SAMPLE_SIZE)
            h.update(obj.tobytes())
            return h.hexdigest()
        if pd and isinstance(obj, (pd.Series, pd.DataFrame)):
            if len(obj) > _PANDAS_ROWS_LARGE:
                obj = obj.sample(n=_PANDAS_SAMPLE_SIZE, random_state=0)
            try:
                pd_values = list(pd.util.hash_pandas_object(obj, index=True).values)
            except TypeError:
                # Use pickle if pandas cannot hash the object for example if
                # it contains unhashable objects.
                pd_values = [pickle.dumps(obj, pickle.HIGHEST_PROTOCOL)]
            if isinstance(obj, pd.Series):
                columns = [obj.name]
            elif isinstance(obj.columns, pd.MultiIndex):
                columns = [name for cols in obj.columns for name in cols]
            else:
                columns = list(obj.columns)
            all_vals = pd_values + columns + list(obj.index.names)
            h = hashlib.md5()
            for val in all_vals:
                if not isinstance(val, bytes):
                    val = str(val).encode("utf-8")
                h.update(val)
            return h.hexdigest()
        elif isinstance(obj, self.string_hashable):
            return str(obj)
        elif isinstance(obj, self.repr_hashable):
            return repr(obj)
        try:
            return hash(obj)
        except:
            return id(obj)
                        

Improved Code

🔍 Code Extractor

class HashableJSON

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`default(self, obj) -> Union[int, str]`

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function deephash 69.0% similar

class Neo4jEncoder 59.4% similar

function clean_for_json_v10 54.3% similar

function clean_for_json_v12 53.6% similar

function safe_json_dumps 53.5% similar

class HashableJSON

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

default(self, obj) -> Union[int, str]

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function deephash 69.0% similar

class Neo4jEncoder 59.4% similar

function clean_for_json_v10 54.3% similar

function clean_for_json_v12 53.6% similar

function safe_json_dumps 53.5% similar

✨ Improve Code: HashableJSON

Code Comparison

`default(self, obj) -> Union[int, str]`