Supported Data Types¶
ParquetDB is designed to handle a variety of data types, making it flexible for diverse applications. This includes:
Empty dictionaries
Nested dictionaries
Lists
NumPy arrays
Python functions
Class instances (objects)
In this notebook, we’ll demonstrate how to create (or update) records in ParquetDB with these data types.
[27]:
import pprint
import shutil
import os
import pandas as pd
import pyarrow as pa
from parquetdb import ParquetDB, NormalizeConfig
db_path = "ParquetDB"
if os.path.exists(db_path):
shutil.rmtree(db_path)
db = ParquetDB(db_path)
Empty Dictionaries¶
Empty dictionaries can be part of your records without issues. This can be useful for placeholders or optional fields.
Parquet files cannot natively handle empty dictionaries, so ParquetDB handles them by creating a dummy field with a null value.
[28]:
data_with_empty_dict = [
{
"empty_dict": {},
}
]
db.create(data_with_empty_dict)
df = db.read().to_pandas()
print(df)
empty_dict.dummy_field id
0 NaN 0
Nested Dictionaries¶
You can store nested dictionaries directly. ParuqetDB will flatten the nested dictionaries over multiple columns, by giving child fields the following syntax:
parent_field_name.child_field_name
[29]:
data_with_nested_dict = [
{
"nested_dict": {
"level1_key": "value1",
"level1_subdict": {
"level2_key": 123,
"level2_subdict": {"level3_key": True},
},
},
}
]
db.create(data_with_nested_dict)
df = db.read().to_pandas()
print(df)
empty_dict.dummy_field id nested_dict.level1_key \
0 NaN 0 None
1 NaN 1 value1
nested_dict.level1_subdict.level2_key \
0 NaN
1 123.0
nested_dict.level1_subdict.level2_subdict.level3_key
0 None
1 True
Notice ParquetDB handles schema evolution for you, so if you update a nested dictionary, the schema will be updated to reflect the new structure. If these is a mismatch ParquetDB will add null values for new fields in either the incoming data or the existing data.
Lists¶
Lists (including lists of lists) are supported. Under the hood, these will be stored as Arrow list arrays if no further flattening or ragged array logic is specified.
[30]:
data_with_lists = [{"simple_list": ["a", "b", "c"]}]
db.create(data_with_lists)
df = db.read().to_pandas()
print(df)
empty_dict.dummy_field id nested_dict.level1_key \
0 NaN 0 None
1 NaN 1 value1
2 NaN 2 None
nested_dict.level1_subdict.level2_key \
0 NaN
1 123.0
2 NaN
nested_dict.level1_subdict.level2_subdict.level3_key simple_list
0 None None
1 True None
2 None [a, b, c]
Here for list array data types that cannot form fixed shape arrays, ParquetDB will use ListArray type, which can be ragged.
By default ParquetDB will try to detect if the list array can be a fixed shape array, if it cannot it will use ListArray type. you can turn this behavior off by setting convert_to_fixed_shape=False
or choose which fields ParquetDB will try to convert to fixed shape arrays by setting treat_fields_as_ragged=[]
.
[31]:
data_with_lists = [{"simple_list_not_fixed": [0, 1, 2, 3]}]
db.create(data_with_lists, convert_to_fixed_shape=False)
df = db.read().to_pandas()
print(df)
empty_dict.dummy_field id nested_dict.level1_key \
0 NaN 0 None
1 NaN 1 value1
2 NaN 2 None
3 NaN 3 None
nested_dict.level1_subdict.level2_key \
0 NaN
1 123.0
2 NaN
3 NaN
nested_dict.level1_subdict.level2_subdict.level3_key simple_list \
0 None None
1 True None
2 None [a, b, c]
3 None None
simple_list_not_fixed
0 None
1 None
2 None
3 [0, 1, 2, 3]
Fixed Shape Arrays¶
As breifly mentioned above, ParquetDB will try to detect if the list array can be a fixed shape array, if it cannot it will use ListArray type.
[32]:
data_with_fixed_shape_array = [{"fixed_shape_array": [0, 1, 2, 3]}]
db.create(data_with_fixed_shape_array)
df = db.read().to_pandas()
print(df)
empty_dict.dummy_field fixed_shape_array id nested_dict.level1_key \
0 NaN [nan, nan, nan, nan] 0 None
1 NaN [nan, nan, nan, nan] 1 value1
2 NaN [nan, nan, nan, nan] 2 None
3 NaN [nan, nan, nan, nan] 3 None
4 NaN [0.0, 1.0, 2.0, 3.0] 4 None
nested_dict.level1_subdict.level2_key \
0 NaN
1 123.0
2 NaN
3 NaN
4 NaN
nested_dict.level1_subdict.level2_subdict.level3_key simple_list \
0 None None
1 True None
2 None [a, b, c]
3 None None
4 None None
simple_list_not_fixed
0 None
1 None
2 None
3 [0, 1, 2, 3]
4 None
As you can see the ParquetDB converted the list of numbers to a fixed shape array. This will also add empty null values for the fields that do not have a value.
Read Fixed Shape Arrays¶
Supporting Fixed shape arrays is useful because ParquetDB can directly read these arrays into their numpy nd array equivalent. To do this fixed shape arrays have a special method to_numpy_ndarray()
that will return the fixed shape array as a numpy nd array.
For more information check out their extention type documentation.
Also check out the Fixed Shape Arrays
[33]:
table = db.read(columns=["fixed_shape_array"])
fixed_shape_array = table["fixed_shape_array"].combine_chunks().to_numpy_ndarray()
print(fixed_shape_array.shape)
print(fixed_shape_array)
(5, 4)
[[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 0 0 0]
[0 1 2 3]]
These fixed shape arrays can handle multi-dimensional arrays as well.
[34]:
data_with_fixed_shape_array = [
{
"nd_fixed_shape_array": [
[[0, 1, 2, 3], [4, 5, 6, 7]],
[[8, 9, 10, 11], [12, 13, 14, 15]],
]
}
]
db.create(data_with_fixed_shape_array)
table = db.read(columns=["nd_fixed_shape_array"])
nd_fixed_shape_array = table["nd_fixed_shape_array"].combine_chunks().to_numpy_ndarray()
print(nd_fixed_shape_array.shape)
print(nd_fixed_shape_array)
(6, 2, 2, 4)
[[[[ 0 0 0 0]
[ 0 0 0 0]]
[[ 0 0 0 0]
[ 0 0 0 0]]]
[[[ 0 0 0 0]
[ 0 0 0 0]]
[[ 0 0 0 0]
[ 0 0 0 0]]]
[[[ 0 0 0 0]
[ 0 0 0 0]]
[[ 0 0 0 0]
[ 0 0 0 0]]]
[[[ 0 0 0 0]
[ 0 0 0 0]]
[[ 0 0 0 0]
[ 0 0 0 0]]]
[[[ 0 0 0 0]
[ 0 0 0 0]]
[[ 0 0 0 0]
[ 0 0 0 0]]]
[[[ 0 1 2 3]
[ 4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]]]
Python Objects (functions, classes, etc.)¶
Storing raw Python functions requires ParquetDB (or your setup) to allow object serialization. Typically, this is done via pickling. If you only need the code or a reference to the function, consider storing a string or module path instead.
You can turn off this behavior by setting serialize_python_objects=False
in the ParquetDB
constructor.
[35]:
def square(x: int) -> int:
return x**2
data_with_function = [
{
"function_obj": square,
}
]
db.create(data_with_function)
table = db.read(columns=["function_obj"])
print(table)
df = db.read(columns=["function_obj"]).to_pandas()
print(df)
pyarrow.Table
function_obj: extension<parquetdb.PythonObjectArrow<PythonObjectArrowType>>
----
function_obj: [[null],[null],...,[null],[8004951B010000000000008C0A64696C6C2E5F64696C6C948C105F6372656174655F66756E6374696F6E9493942868008C0C5F6372656174655F636F6465949394284B014B004B004B014B024B4343087C00640113005300944E4B028694298C01789485948C40433A5C55736572735C6C6C6C616E675C417070446174615C4C6F63616C5C54656D705C6970796B65726E656C5F34313734305C323334373333383435302E7079948C06737175617265944B0143020001942929749452947D948C085F5F6E616D655F5F948C085F5F6D61696E5F5F9473680A4E4E749452947D947D948C0F5F5F616E6E6F746174696F6E735F5F947D9428680768008C0A5F6C6F61645F747970659493948C03696E7494859452948C0672657475726E94681B75738694622E]]
function_obj
0 None
1 None
2 None
3 None
4 None
5 None
6 <function square at 0x000002035C293E50>
Here you can see when the function is in a pyarrow table the function will be in bytes. To make the function return to its original form, you can simply convert the table to a pandas dataframe.
ParquetDB also packages all the dependencies, so if the proper environment is setup you can simply call the function from the dataframe.
[36]:
result = df.iloc[6]["function_obj"](11)
print(result)
121
This can also work with class instances.
[37]:
class CustomClass:
def __init__(self, name, value):
self.name = name
self.value = value
def __repr__(self):
return f"CustomClass(name={self.name}, value={self.value})"
custom_instance = CustomClass("example", 42)
data_with_instance = [
{
"custom_obj": custom_instance,
}
]
db.create(data_with_instance)
table = db.read(columns=["custom_obj"])
print(table)
df = db.read(columns=["custom_obj"]).to_pandas()
print(df)
class_ex = df.iloc[7]["custom_obj"]
print(class_ex.name)
print(class_ex.value)
pyarrow.Table
custom_obj: extension<parquetdb.PythonObjectArrow<PythonObjectArrowType>>
----
custom_obj: [[null],[null],...,[null],[800495A0020000000000008C0A64696C6C2E5F64696C6C948C0C5F6372656174655F747970659493942868008C0A5F6C6F61645F747970659493948C047479706594859452948C0B437573746F6D436C6173739468048C066F626A656374948594529485947D94288C0A5F5F6D6F64756C655F5F948C085F5F6D61696E5F5F948C085F5F696E69745F5F9468008C105F6372656174655F66756E6374696F6E9493942868008C0C5F6372656174655F636F6465949394284B034B004B004B034B024B4343107C017C005F007C027C005F0164005300944E85948C046E616D65948C0576616C75659486948C0473656C66946817681887948C40433A5C55736572735C6C6C6C616E675C417070446174615C4C6F63616C5C54656D705C6970796B65726E656C5F34313734305C333635333833343239312E70799468104B02430400010601942929749452947D948C085F5F6E616D655F5F94680F7368104E4E749452947D947D94288C0F5F5F616E6E6F746174696F6E735F5F947D948C0C5F5F7175616C6E616D655F5F948C14437573746F6D436C6173732E5F5F696E69745F5F94758694628C085F5F726570725F5F946812286814284B014B004B004B014B054B43431664017C006A009B0064027C006A019B0064039D05530094284E8C11437573746F6D436C617373286E616D653D948C082C2076616C75653D948C01299474946819681A8594681C682B4B0643020001942929749452947D946821680F73682B4E4E749452947D947D942868267D9468288C14437573746F6D436C6173732E5F5F726570725F5F94758694628C075F5F646F635F5F944E8C0D5F5F736C6F746E616D65735F5F945D9475749452948C086275696C74696E73948C0773657461747472949394684168286808879452302981947D942868178C076578616D706C659468184B2A75622E]]
custom_obj
0 None
1 None
2 None
3 None
4 None
5 None
6 None
7 CustomClass(name=example, value=42)
example
42
Potential Serialization Concerns¶
When storing Python-specific objects (like functions or class instances), you are relying on serialization (e.g., pickling). This approach:
Works best in Python-only environments where you control the version and environment of the data.
Can be risky if you need cross-language support or a long-term archive format that outlasts Python versions.
For cross-language compatibility and long-term storage, consider storing object data in a more standardized format (e.g., JSON for nested structures, integers/floats for numeric data, or well-defined schemas for complex types).