docs for
zanj
v0.4.0
The ZANJ
format is meant to be a way of saving arbitrary
objects to disk, in a way that is flexible, allows keeping configuration
and data together, and is human readable. It is very loosely inspired by
HDF5 and the derived exdir
format, and the implementation
is inspired by npz
files.
SerializableDataclass
from the muutils library and save
it to disk – any large arrays or lists will be stored efficiently as
external files in the zip archive, while the basic structure and
metadata will be stored in readable JSON files.ConfiguredModel
, which
inherits from a torch.nn.Module
which will let you save not
just your model weights, but all required configuration information,
plus any other metadata (like training logs) in a single file.This library was originally a module in muutils
Available on PyPI as zanj
pip install zanj
You can find a runnable example of this in demo.ipynb
Any SerializableDataclass
of basic types can be saved as
zanj:
import numpy as np
import pandas as pd
from muutils.json_serialize import SerializableDataclass, serializable_dataclass, serializable_field
from zanj import ZANJ
@serializable_dataclass
class BasicZanj(SerializableDataclass):
str
a: int = 42
q: list[int] = serializable_field(default_factory=list)
c:
# initialize a zanj reader/writer
= ZANJ()
zj
# create an instance
= BasicZanj("hello", 42, [1, 2, 3])
instance: BasicZanj str = "tests/junk_data/path_to_save_instance.zanj"
path:
zj.save(instance, path)= zj.read(path) recovered: BasicZanj
ZANJ will intelligently handle nested serializable dataclasses, numpy arrays, pytorch tensors, and pandas dataframes:
import torch
import pandas as pd
@serializable_dataclass
class Complicated(SerializableDataclass):
str
name:
arr1: np.ndarray
arr2: np.ndarray
iris_data: pd.DataFrame
brain_data: pd.DataFramelist[BasicZanj]
container: torch_tensor: torch.Tensor
For custom classes, you can specify a serialization_fn
and loading_fn
to handle the logic of converting to and
from a json-serializable format:
@serializable_dataclass
class Complicated(SerializableDataclass):
str
name: = serializable_field(
device: torch.device =lambda self: str(self.device),
serialization_fn=lambda data: torch.device(data["device"]),
loading_fn )
Note that loading_fn
takes the dictionary of the whole
class – this is in case you’ve stored data in multiple fields of the
dict which are needed to reconstruct the object.
First, define a configuration class for your model. This class will
hold the parameters for your model and any associated objects (like
losses and optimizers). The configuration class should be a subclass of
SerializableDataclass
and use the
serializable_field
function to define fields that need
special serialization.
Here’s an example that defines a GPT-like model configuration:
from zanj.torchutil import ConfiguredModel, set_config_class
@serializable_dataclass
class MyNNConfig(SerializableDataclass):
int
input_dim: int
hidden_dim: int
output_dim:
# store the activation function by name, reconstruct it by looking it up in torch.nn
= serializable_field(
act_fn: torch.nn.Module =lambda x: x.__name__,
serialization_fn=lambda x: getattr(torch.nn, x["act_fn"]),
loading_fn
)
# same for the loss function
dict = serializable_field(default_factory=dict)
loss_kwargs: = serializable_field(
loss_factory: torch.nn.modules.loss._Loss =lambda: torch.nn.CrossEntropyLoss,
default_factory=lambda x: x.__name__,
serialization_fn=lambda x: getattr(torch.nn, x["loss_factory"]),
loading_fn
)= property(lambda self: self.loss_factory(**self.loss_kwargs)) loss
Then, define your model class. It should be a subclass of
ConfiguredModel
, and use the set_config_class
decorator to associate it with your configuration class. The
__init__
method should take a single argument, which is an
instance of your configuration class. You must also call the superclass
__init__
method with the configuration instance.
@set_config_class(MyNNConfig)
class MyNN(ConfiguredModel[MyNNConfig]):
def __init__(self, config: MyNNConfig):
# call the superclass init!
# this will store the model in the zanj_model_config field
super().__init__(config)
# whatever you want here
self.net = torch.nn.Sequential(
torch.nn.Linear(config.input_dim, config.hidden_dim),
config.act_fn(),
torch.nn.Linear(config.hidden_dim, config.output_dim),
)
def forward(self, x):
return self.net(x)
You can now create instances of your model, save them to disk, and load them back into memory:
= MyNNConfig(
config =10,
input_dim=20,
hidden_dim=2,
output_dim=torch.nn.ReLU,
act_fn=dict(reduction="mean"),
loss_kwargs
)
# create your model from the config, and save
= MyNN(config)
model = "tests/junk_data/path_to_save_model.zanj"
fname
ZANJ().save(model, fname)# load by calling the class method `read()`
= MyNN.read(fname)
loaded_model # zanj will actually infer the type of the object in the file
# -- and will warn you if you don't have the correct package installed
= ZANJ().read(fname) loaded_another_way
When initializing a ZANJ
object, you can specify some
configuration info about saving, such as:
# how big an array or list (including pandas DataFrame) can be before moving it from the core JSON file
int = ZANJ_GLOBAL_DEFAULTS.external_array_threshold
external_array_threshold: int = ZANJ_GLOBAL_DEFAULTS.external_list_threshold
external_list_threshold: # compression settings passed to `zipfile` package
bool | int = ZANJ_GLOBAL_DEFAULTS.compress
compress: # for doing very cursed things in your own custom loading or serialization functions
dict[str, Any] | None = ZANJ_GLOBAL_DEFAULTS.custom_settings
custom_settings: # specify additional serialization handlers
= tuple()
handlers_pre: MonoTuple[SerializerHandler] = DEFAULT_SERIALIZER_HANDLERS_ZANJ, handlers_default: MonoTuple[SerializerHandler]
The on-disk format is a file <filename>.zanj
is a
zip file containing:
__zanj_meta__.json
: a file containing zanj-specific
metadata including:
__zanj__.json
: a file containing user-specified data
.npy
for numpy arrays or torch tensors.jsonl
for pandas dataframes or large sequences__zanj_meta__.json
_REF_KEY
in muutils, will have
value pointing to external file_FORMAT_KEY
key will detail an external format
typeFormat | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 |
---|---|---|---|---|---|---|---|
pickle (PyTorch) | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ |
H5 (Tensorflow) | ✅ | ❌ | ✅ | ✅ | ~ | ~ | ❌ |
HDF5 | ✅ | ? | ✅ | ✅ | ~ | ✅ | ❌ |
SavedModel (Tensorflow) | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ |
MsgPack (flax) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ |
Protobuf (ONNX) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
Cap’n’Proto | ✅ | ✅ | ~ | ✅ | ✅ | ~ | ❌ |
Numpy (npy,npz) | ✅ | ? | ? | ❌ | ✅ | ❌ | ❌ |
SafeTensors | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
exdir | ✅ | ? | ? | ? | ? | ✅ | ❌ |
ZANJ | ✅ | ❌ | ❌* | ✅ | ✅ | ✅ | ❌* |
*
denotes this feature may be coming at a future date
:)
(This table was stolen from safetensors)
zanj
The ZANJ
format is meant to be a way of saving arbitrary
objects to disk, in a way that is flexible, allows keeping configuration
and data together, and is human readable. It is very loosely inspired by
HDF5 and the derived exdir
format, and the implementation
is inspired by npz
files.
SerializableDataclass
from the muutils library and save
it to disk – any large arrays or lists will be stored efficiently as
external files in the zip archive, while the basic structure and
metadata will be stored in readable JSON files.ConfiguredModel
, which
inherits from a torch.nn.Module
which will let you save not
just your model weights, but all required configuration information,
plus any other metadata (like training logs) in a single file.This library was originally a module in muutils
Available on PyPI as zanj
pip install zanj
You can find a runnable example of this in demo.ipynb
Any SerializableDataclass
of basic types can be saved as
zanj:
import numpy as np
import pandas as pd
from muutils.json_serialize import SerializableDataclass, serializable_dataclass, serializable_field
from zanj import ZANJ
@serializable_dataclass
class BasicZanj(SerializableDataclass):
str
a: int = 42
q: list[int] = serializable_field(default_factory=list)
c:
### initialize a zanj reader/writer
= ZANJ()
zj
### create an instance
= BasicZanj("hello", 42, [1, 2, 3])
instance: BasicZanj str = "tests/junk_data/path_to_save_instance<a href="zanj/zanj.html">zanj.zanj</a>"
path:
zj.save(instance, path)= zj.read(path) recovered: BasicZanj
ZANJ will intelligently handle nested serializable dataclasses, numpy arrays, pytorch tensors, and pandas dataframes:
import torch
import pandas as pd
@serializable_dataclass
class Complicated(SerializableDataclass):
str
name:
arr1: np.ndarray
arr2: np.ndarray
iris_data: pd.DataFrame
brain_data: pd.DataFramelist[BasicZanj]
container: torch_tensor: torch.Tensor
For custom classes, you can specify a serialization_fn
and loading_fn
to handle the logic of converting to and
from a json-serializable format:
@serializable_dataclass
class Complicated(SerializableDataclass):
str
name: = serializable_field(
device: torch.device =lambda self: str(self.device),
serialization_fn=lambda data: torch.device(data["device"]),
loading_fn )
Note that loading_fn
takes the dictionary of the whole
class – this is in case you’ve stored data in multiple fields of the
dict which are needed to reconstruct the object.
First, define a configuration class for your model. This class will
hold the parameters for your model and any associated objects (like
losses and optimizers). The configuration class should be a subclass of
SerializableDataclass
and use the
serializable_field
function to define fields that need
special serialization.
Here’s an example that defines a GPT-like model configuration:
from <a href="zanj/torchutil.html">zanj.torchutil</a> import ConfiguredModel, set_config_class
@serializable_dataclass
class MyNNConfig(SerializableDataclass):
int
input_dim: int
hidden_dim: int
output_dim:
# store the activation function by name, reconstruct it by looking it up in torch.nn
= serializable_field(
act_fn: torch.nn.Module =lambda x: x.__name__,
serialization_fn=lambda x: getattr(torch.nn, x["act_fn"]),
loading_fn
)
# same for the loss function
dict = serializable_field(default_factory=dict)
loss_kwargs: = serializable_field(
loss_factory: torch.nn.modules.loss._Loss =lambda: torch.nn.CrossEntropyLoss,
default_factory=lambda x: x.__name__,
serialization_fn=lambda x: getattr(torch.nn, x["loss_factory"]),
loading_fn
)= property(lambda self: self.loss_factory(**self.loss_kwargs)) loss
Then, define your model class. It should be a subclass of
ConfiguredModel
, and use the set_config_class
decorator to associate it with your configuration class. The
__init__
method should take a single argument, which is an
instance of your configuration class. You must also call the superclass
__init__
method with the configuration instance.
@set_config_class(MyNNConfig)
class MyNN(ConfiguredModel[MyNNConfig]):
def __init__(self, config: MyNNConfig):
# call the superclass init!
# this will store the model in the zanj_model_config field
super().__init__(config)
# whatever you want here
self.net = torch.nn.Sequential(
torch.nn.Linear(config.input_dim, config.hidden_dim),
config.act_fn(),
torch.nn.Linear(config.hidden_dim, config.output_dim),
)
def forward(self, x):
return self.net(x)
You can now create instances of your model, save them to disk, and load them back into memory:
= MyNNConfig(
config =10,
input_dim=20,
hidden_dim=2,
output_dim=torch.nn.ReLU,
act_fn=dict(reduction="mean"),
loss_kwargs
)
### create your model from the config, and save
= MyNN(config)
model = "tests/junk_data/path_to_save_model<a href="zanj/zanj.html">zanj.zanj</a>"
fname
ZANJ().save(model, fname)### load by calling the class method `read()`
= MyNN.read(fname)
loaded_model ### zanj will actually infer the type of the object in the file
### -- and will warn you if you don't have the correct package installed
= ZANJ().read(fname) loaded_another_way
When initializing a ZANJ
object, you can specify some
configuration info about saving, such as:
### how big an array or list (including pandas DataFrame) can be before moving it from the core JSON file
int = ZANJ_GLOBAL_DEFAULTS.external_array_threshold
external_array_threshold: int = ZANJ_GLOBAL_DEFAULTS.external_list_threshold
external_list_threshold: ### compression settings passed to `zipfile` package
bool | int = ZANJ_GLOBAL_DEFAULTS.compress
compress: ### for doing very cursed things in your own custom loading or serialization functions
dict[str, Any] | None = ZANJ_GLOBAL_DEFAULTS.custom_settings
custom_settings: ### specify additional serialization handlers
= tuple()
handlers_pre: MonoTuple[SerializerHandler] = DEFAULT_SERIALIZER_HANDLERS_ZANJ, handlers_default: MonoTuple[SerializerHandler]
The on-disk format is a file
<filename><a href="zanj/zanj.html">zanj.zanj</a>
is a zip file containing:
__zanj_meta__.json
: a file containing zanj-specific
metadata including:
__zanj__.json
: a file containing user-specified data
.npy
for numpy arrays or torch tensors.jsonl
for pandas dataframes or large sequences__zanj_meta__.json
_REF_KEY
in muutils, will have
value pointing to external file_FORMAT_KEY
key will detail an external format
typeFormat | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 |
---|---|---|---|---|---|---|---|
pickle (PyTorch) | ❌ | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ |
H5 (Tensorflow) | ✅ | ❌ | ✅ | ✅ | ~ | ~ | ❌ |
HDF5 | ✅ | ? | ✅ | ✅ | ~ | ✅ | ❌ |
SavedModel (Tensorflow) | ✅ | ❌ | ❌ | ✅ | ✅ | ❌ | ✅ |
MsgPack (flax) | ✅ | ✅ | ❌ | ✅ | ❌ | ❌ | ✅ |
Protobuf (ONNX) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | ✅ |
Cap’n’Proto | ✅ | ✅ | ~ | ✅ | ✅ | ~ | ❌ |
Numpy (npy,npz) | ✅ | ? | ? | ❌ | ✅ | ❌ | ❌ |
SafeTensors | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ | ✅ |
exdir | ✅ | ? | ? | ? | ? | ✅ | ❌ |
ZANJ | ✅ | ❌ | ❌* | ✅ | ✅ | ✅ | ❌* |
*
denotes this feature may be coming at a future date
:)
(This table was stolen from safetensors)
def register_loader_handler
(handler: zanj.loading.LoaderHandler)
register a custom loader handler
class ZANJ(muutils.json_serialize.json_serialize.JsonSerializer):
Zip up: Arrays in Numpy, JSON for everything else
given an arbitrary object, throw into a zip file, with arrays stored in .npy files, and everything else stored in a json file
(basically npz file with json)
zanj.json
in the root of the archive, via
muutils.json_serialize.JsonSerializer
__zanj_meta__.json
file in the
root of the archivecreate a ZANJ-class via z_cls = ZANJ().create(obj)
, and
save/read instances of the object via
z_cls.save(obj, path)
, z_cls.load(path)
. be
sure to pass an instance of the object, to make sure
that the attributes of the class can be correctly recognized
ZANJ
(= ErrorMode.Except,
error_mode: muutils.errormode.ErrorMode 'list', 'array_list_meta', 'array_hex_meta', 'array_b64_meta', 'external', 'zero_dim'] = 'array_list_meta',
internal_array_mode: Literal[int = 256,
external_array_threshold: int = 256,
external_list_threshold: bool | int = True,
compress: dict[str, typing.Any] | None = None,
custom_settings: None = (),
handlers_pre: None = (ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray:external', desc='external numpy array', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor:external', desc='external torch tensor', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='list:external', desc='external list', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='tuple:external', desc='external tuple', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame:external', desc='external pandas DataFrame', source_pckg='zanj'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='base types', desc='base types (bool, int, float, str, None)'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dictionaries', desc='dictionaries'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(list, tuple) -> list', desc='lists and tuples as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function _serialize_override_serialize_func>, uid='.serialize override', desc='objects with .serialize method'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='namedtuple -> dict', desc='namedtuples as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dataclass -> dict', desc='dataclasses as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='path -> str', desc='Path objects as posix strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='obj -> str(obj)', desc='directly serialize objects in `SERIALIZE_DIRECT_AS_STR` to strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray', desc='numpy arrays'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor', desc='pytorch tensors'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame', desc='pandas DataFrames'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(set, list, tuple, Iterable) -> list', desc='sets, lists, tuples, and Iterables as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='fallback', desc='fallback handler -- serialize object attributes and special functions as strings'))
handlers_default: )
external_array_threshold: int
external_list_threshold: int
custom_settings: dict
compress
def externals_info
self) -> dict[str, dict[str, str | int | list[int]]] (
return information about the current externals
def meta
(self
-> Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]] )
return the metadata of the ZANJ archive
def save
self, obj: Any, file_path: str | pathlib.Path) -> str (
save the object to a ZANJ archive. returns the path to the archive
def read
self, file_path: Union[str, pathlib.Path]) -> Any (
load the object from a ZANJ archive ### TODO: load only some part of the zanj file by passing an ObjectPath
docs for
zanj
v0.4.0
for storing/retrieving an item externally in a ZANJ archive
ZANJ_MAIN
ZANJ_META
ExternalItemType
ExternalItemType_vals
ExternalItem
load_jsonl
load_npy
EXTERNAL_LOAD_FUNCS
GET_EXTERNAL_LOAD_FUNC
zanj.externals
for storing/retrieving an item externally in a ZANJ archive
ZANJ_MAIN: str = '__zanj__.json'
ZANJ_META: str = '__zanj_meta__.json'
ExternalItemType = typing.Literal['jsonl', 'npy']
ExternalItemType_vals = ('jsonl', 'npy')
class ExternalItem(typing.NamedTuple):
ExternalItem(item_type, data, path)
ExternalItem
('jsonl', 'npy'],
item_type: Literal[
data: Any,tuple[typing.Union[str, int], ...]
path: )
Create new instance of ExternalItem(item_type, data, path)
item_type: Literal['jsonl', 'npy']
Alias for field number 0
data: Any
Alias for field number 1
path: tuple[typing.Union[str, int], ...]
Alias for field number 2
def load_jsonl
("'LoadedZANJ'",
zanj: bytes]
fp: IO[-> list[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]]]] )
def load_npy
"'LoadedZANJ'", fp: IO[bytes]) -> numpy.ndarray (zanj:
EXTERNAL_LOAD_FUNCS: dict[typing.Literal['jsonl', 'npy'], typing.Callable[[zanj.zanj.ZANJ, typing.IO[bytes]], typing.Any]] = {'jsonl': <function load_jsonl>, 'npy': <function load_npy>}
def GET_EXTERNAL_LOAD_FUNC
str) -> Callable[[zanj.zanj.ZANJ, IO[bytes]], Any] (item_type:
docs for
zanj
v0.4.0
LoaderHandler
LOADER_MAP_LOCK
LOADER_MAP
register_loader_handler
get_item_loader
load_item_recursive
LoadedZANJ
zanj.loading
class LoaderHandler:
handler for loading an object from a json file or a ZANJ archive
LoaderHandler
(bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], bool],
check: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], Any],
load: Callable[[Union[str,
uid: str,
source_pckg: int = 0,
priority: str = '(no description)'
desc: )
check: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], bool]
load: Callable[[Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]], tuple[Union[str, int], ...], Any], Any]
uid: str
source_pckg: str
priority: int = 0
desc: str = '(no description)'
def serialize
(self
-> Dict[str, Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]] )
serialize the handler info
def from_formattedclass
type, priority: int = 0) (cls, fc:
create a loader from a class with serialize
,
load
methods and __muutils_format__
attribute
LOADER_MAP_LOCK = <unlocked _thread.lock object>
LOADER_MAP: dict[str, zanj.loading.LoaderHandler] = {'numpy.ndarray': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='numpy.ndarray', source_pckg='zanj', priority=0, desc='numpy.ndarray loader'), 'torch.Tensor': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='torch.Tensor', source_pckg='zanj', priority=0, desc='torch.Tensor loader'), 'pandas.DataFrame': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='pandas.DataFrame', source_pckg='zanj', priority=0, desc='pandas.DataFrame loader'), 'list': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='list', source_pckg='zanj', priority=0, desc='list loader, for externals'), 'tuple': LoaderHandler(check=<function <lambda>>, load=<function <lambda>>, uid='tuple', source_pckg='zanj', priority=0, desc='tuple loader, for externals')}
def register_loader_handler
(handler: zanj.loading.LoaderHandler)
register a custom loader handler
def get_item_loader
(bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]],
json_item: Union[tuple[typing.Union[str, int], ...],
path: | None = None,
zanj: typing.Any = ErrorMode.Warn
error_mode: muutils.errormode.ErrorMode -> zanj.loading.LoaderHandler | None )
get the loader for a json item
def load_item_recursive
(bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]],
json_item: Union[tuple[typing.Union[str, int], ...],
path: | None = None,
zanj: typing.Any = ErrorMode.Warn,
error_mode: muutils.errormode.ErrorMode bool = True
allow_not_loading: -> Any )
class LoadedZANJ:
for loading a zanj file
LoadedZANJ
str | pathlib.Path, zanj: Any) (path:
def populate_externals
self) -> None (
put all external items into the main json data
docs for
zanj
v0.4.0
KW_ONLY_KWARGS
jsonl_metadata
store_npy
store_jsonl
EXTERNAL_STORE_FUNCS
ZANJSerializerHandler
zanj_external_serialize
DEFAULT_SERIALIZER_HANDLERS_ZANJ
zanj.serializing
KW_ONLY_KWARGS: dict = {'kw_only': True}
def jsonl_metadata
(list[typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]]]]]
data: -> dict )
metadata about a jsonl object
def store_npy
(self: Any,
bytes],
fp: IO[<locals>._BaseArray
data: muutils.tensor_utils.jaxtype_factory.-> None )
store numpy array to given file as .npy
def store_jsonl
(self: Any,
bytes],
fp: IO[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]]
data: Sequence[Union[-> None )
store sequence to given file as .jsonl
EXTERNAL_STORE_FUNCS: dict[typing.Literal['jsonl', 'npy'], typing.Callable[[typing.Any, typing.IO[bytes], typing.Any], NoneType]] = {'npy': <function store_npy>, 'jsonl': <function store_jsonl>}
class ZANJSerializerHandler(muutils.json_serialize.json_serialize.SerializerHandler):
a handler for ZANJ serialization
ZANJSerializerHandler
(str,
uid: str,
desc: *,
tuple[Union[str, int], ...]], bool],
check: Callable[[Any, Any, tuple[Union[str, int], ...]], Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]],
serialize_func: Callable[[Any, Any, str
source_pckg: )
source_pckg: str
check: Callable[[Any, Any, tuple[Union[str, int], ...]], bool]
serialize_func: Callable[[Any, Any, tuple[Union[str, int], ...]], Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]]]
def zanj_external_serialize
(
jser: Any,
data: Any,tuple[typing.Union[str, int], ...],
path: 'jsonl', 'npy'],
item_type: Literal[str
_format: -> Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]] )
stores a numpy array or jsonl externally in a ZANJ object
jser: ZANJ
data: Any
path: ObjectPath
item_type: ExternalItemType
JSONitem
json data with referencemodifies jser._externals
DEFAULT_SERIALIZER_HANDLERS_ZANJ: None = (ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray:external', desc='external numpy array', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor:external', desc='external torch tensor', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='list:external', desc='external list', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='tuple:external', desc='external tuple', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame:external', desc='external pandas DataFrame', source_pckg='zanj'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='base types', desc='base types (bool, int, float, str, None)'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dictionaries', desc='dictionaries'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(list, tuple) -> list', desc='lists and tuples as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function _serialize_override_serialize_func>, uid='.serialize override', desc='objects with .serialize method'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='namedtuple -> dict', desc='namedtuples as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dataclass -> dict', desc='dataclasses as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='path -> str', desc='Path objects as posix strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='obj -> str(obj)', desc='directly serialize objects in
SERIALIZE_DIRECT_AS_STRto strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray', desc='numpy arrays'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor', desc='pytorch tensors'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame', desc='pandas DataFrames'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(set, list, tuple, Iterable) -> list', desc='sets, lists, tuples, and Iterables as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='fallback', desc='fallback handler -- serialize object attributes and special functions as strings'))
docs for
zanj
v0.4.0
KWArgs
num_params
get_module_device
ConfiguredModel
set_config_class
ConfigMismatchException
assert_model_cfg_equality
assert_model_exact_equality
zanj.torchutil
KWArgs = typing.Any
def num_params
bool = True) (m: torch.nn.modules.module.Module, only_trainable:
return total number of parameters in a model
only_trainable
is False, will include parameters
with requires_grad = False
https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model
def get_module_device
(
m: torch.nn.modules.module.Module-> tuple[bool, torch.device | dict[str, torch.device]] )
get the current devices
class ConfiguredModel(torch.nn.modules.module.Module, typing.Generic[~T_config]):
a model that has a configuration, for saving with ZANJ
@set_config_class(YourConfig)
class YourModule(ConfiguredModel[YourConfig]):
def __init__(self, cfg: YourConfig):
super().__init__(cfg)
__init__()
must initialize the model from a config
object only, and call
super().__init__(zanj_model_config)
If you are inheriting from another class + ConfiguredModel, ConfiguredModel must be the first class in the inheritance list
zanj_config_class
zanj_model_config: ~T_config
training_records: dict | None
def serialize
(self,
tuple[typing.Union[str, int], ...] = (),
path: | None = None
zanj: zanj.zanj.ZANJ -> dict[str, typing.Any] )
def save
self, file_path: str, zanj: zanj.zanj.ZANJ | None = None) (
def load
(
cls,dict[str, typing.Any],
obj: tuple[typing.Union[str, int], ...],
path: | None = None
zanj: zanj.zanj.ZANJ -> zanj.torchutil.ConfiguredModel )
load a model from a serialized object
def read
(
cls,str,
file_path: | None = None
zanj: zanj.zanj.ZANJ -> zanj.torchutil.ConfiguredModel )
read a model from a file
def load_file
(
cls,str,
file_path: | None = None
zanj: zanj.zanj.ZANJ -> zanj.torchutil.ConfiguredModel )
read a model from a file
def get_handler
-> zanj.loading.LoaderHandler (cls)
def num_params
self) -> int (
Module
dump_patches
training
call_super_init
forward
register_buffer
register_parameter
add_module
register_module
get_submodule
set_submodule
get_parameter
get_buffer
get_extra_state
set_extra_state
apply
cuda
ipu
xpu
mtia
cpu
type
float
double
half
bfloat16
to_empty
to
register_full_backward_pre_hook
register_backward_hook
register_full_backward_hook
register_forward_pre_hook
register_forward_hook
register_state_dict_post_hook
register_state_dict_pre_hook
state_dict
register_load_state_dict_pre_hook
register_load_state_dict_post_hook
load_state_dict
parameters
named_parameters
buffers
named_buffers
children
named_children
modules
named_modules
train
eval
requires_grad_
zero_grad
share_memory
extra_repr
compile
def set_config_class
(
config_class: Type[muutils.json_serialize.serializable_dataclass.SerializableDataclass]-> Callable[[Type[zanj.torchutil.ConfiguredModel]], Type[zanj.torchutil.ConfiguredModel]] )
class ConfigMismatchException(builtins.ValueError):
Inappropriate argument value (of correct type).
ConfigMismatchException
str, diff) (msg:
diff
def assert_model_cfg_equality
(
model_a: zanj.torchutil.ConfiguredModel,
model_b: zanj.torchutil.ConfiguredModel )
check both models are correct instances and have the same config
Raises: ConfigMismatchException: if the configs don’t match, e.diff will contain the diff
def assert_model_exact_equality
(
model_a: zanj.torchutil.ConfiguredModel,
model_b: zanj.torchutil.ConfiguredModel )
check the models are exactly equal, including state dict contents
docs for
zanj
v0.4.0
an HDF5/exdir file alternative, which uses json for attributes, allows serialization of arbitrary data
for large arrays, the output is a .tar.gz file with most data in a json file, but with sufficiently large arrays stored in binary .npy files
“ZANJ” is an acronym that the AI tool Elicit came up with for me. not to be confused with:
zanj.zanj
an HDF5/exdir file alternative, which uses json for attributes, allows serialization of arbitrary data
for large arrays, the output is a .tar.gz file with most data in a json file, but with sufficiently large arrays stored in binary .npy files
“ZANJ” is an acronym that the AI tool Elicit came up with for me. not to be confused with:
ZANJitem = typing.Union[bool, int, float, str, NoneType, typing.List[typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], typing.Dict[str, typing.Union[bool, int, float, str, NoneType, typing.List[typing.Any], typing.Dict[str, typing.Any]]], numpy.ndarray, ForwardRef('pd.DataFrame')]
ZANJ_GLOBAL_DEFAULTS: zanj.zanj._ZANJ_GLOBAL_DEFAULTS_CLASS = _ZANJ_GLOBAL_DEFAULTS_CLASS(error_mode=ErrorMode.Except, internal_array_mode='array_list_meta', external_array_threshold=256, external_list_threshold=256, compress=True, custom_settings=None)
class ZANJ(muutils.json_serialize.json_serialize.JsonSerializer):
Zip up: Arrays in Numpy, JSON for everything else
given an arbitrary object, throw into a zip file, with arrays stored in .npy files, and everything else stored in a json file
(basically npz file with json)
zanj.json
in the root of the archive, via
muutils.json_serialize.JsonSerializer
__zanj_meta__.json
file in the
root of the archivecreate a ZANJ-class via z_cls = ZANJ().create(obj)
, and
save/read instances of the object via
z_cls.save(obj, path)
, z_cls.load(path)
. be
sure to pass an instance of the object, to make sure
that the attributes of the class can be correctly recognized
ZANJ
(= ErrorMode.Except,
error_mode: muutils.errormode.ErrorMode 'list', 'array_list_meta', 'array_hex_meta', 'array_b64_meta', 'external', 'zero_dim'] = 'array_list_meta',
internal_array_mode: Literal[int = 256,
external_array_threshold: int = 256,
external_list_threshold: bool | int = True,
compress: dict[str, typing.Any] | None = None,
custom_settings: None = (),
handlers_pre: None = (ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray:external', desc='external numpy array', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor:external', desc='external torch tensor', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='list:external', desc='external list', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='tuple:external', desc='external tuple', source_pckg='zanj'), ZANJSerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame:external', desc='external pandas DataFrame', source_pckg='zanj'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='base types', desc='base types (bool, int, float, str, None)'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dictionaries', desc='dictionaries'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(list, tuple) -> list', desc='lists and tuples as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function _serialize_override_serialize_func>, uid='.serialize override', desc='objects with .serialize method'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='namedtuple -> dict', desc='namedtuples as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='dataclass -> dict', desc='dataclasses as dicts'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='path -> str', desc='Path objects as posix strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='obj -> str(obj)', desc='directly serialize objects in `SERIALIZE_DIRECT_AS_STR` to strings'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='numpy.ndarray', desc='numpy arrays'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='torch.Tensor', desc='pytorch tensors'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='pandas.DataFrame', desc='pandas DataFrames'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='(set, list, tuple, Iterable) -> list', desc='sets, lists, tuples, and Iterables as lists'), SerializerHandler(check=<function <lambda>>, serialize_func=<function <lambda>>, uid='fallback', desc='fallback handler -- serialize object attributes and special functions as strings'))
handlers_default: )
external_array_threshold: int
external_list_threshold: int
custom_settings: dict
compress
def externals_info
self) -> dict[str, dict[str, str | int | list[int]]] (
return information about the current externals
def meta
(self
-> Union[bool, int, float, str, NoneType, List[Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]], Dict[str, Union[bool, int, float, str, NoneType, List[Any], Dict[str, Any]]]] )
return the metadata of the ZANJ archive
def save
self, obj: Any, file_path: str | pathlib.Path) -> str (
save the object to a ZANJ archive. returns the path to the archive
def read
self, file_path: Union[str, pathlib.Path]) -> Any (
load the object from a ZANJ archive ### TODO: load only some part of the zanj file by passing an ObjectPath