Manipulating Metadata¶
One of ParquetDB’s strengths is the ability to store and manage metadata alongside your dataset. You can attach metadata at:
Dataset level (e.g.,
version
,source
, etc.), which applies to the entire table or dataset.Field/column level (e.g.,
units
,description
, etc.), which applies to specific columns.
In this notebook, we’ll walk through:
Updating the Schema – how to add or change fields in the dataset schema, including updating metadata.
Setting Dataset Metadata – how to set or update top-level metadata for the entire dataset.
Setting Field Metadata – how to set or update metadata for individual fields (columns).
The update_schema
method allows you to modify the structure and metadata of your dataset. You can:
Change the data type of an existing field.
Add new fields (if your workflow demands it).
Update the top-level metadata (if
update_metadata=True
).Optionally normalize the dataset after making schema changes by providing a
normalize_config
.
def update_schema(
self,
field_dict: dict = None,
schema: pa.Schema = None,
update_metadata: bool = True,
normalize_config: NormalizeConfig = NormalizeConfig()
):
...
field_dict
: A dictionary of field updates, where keys are field names and values are the new field definitions (e.g., pa.int32(), pa.float64()), or pa.field(“field_name”, pa.int32()).schema
: A fully defined PyArrow Schema object to replace or merge with the existing one.update_metadata
: If True, merges the new schema’s metadata with existing metadata. If False, replaces the metadata entirely.normalize_config
: A NormalizeConfig object for controlling file distribution after the schema update.
[ ]:
from parquetdb import ParquetDB
import pyarrow as pa
db = ParquetDB("my_dataset")
data = [
{"name": "Alice", "age": 30},
{"name": "Bob", "age": 25},
{"name": "Charlie", "age": 35},
]
db.create(data)
print(db)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: my_dataset
• Number of columns: 3
• Number of rows: 3
• Number of files: 1
• Number of rows per file: [3]
• Number of row groups per file: [1]
• Serialized metadata size per file: [717] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- age
- id
- name
Update Schema¶
[3]:
table = db.read()
print(table)
# Suppose we want to change the 'age' field to float64
field_updates = {
"age": pa.field(
"age", pa.float64()
) # or simply pa.float64() if your internal method accepts that
}
db.update_schema(field_dict=field_updates, update_metadata=True)
table = db.read()
print(table)
pyarrow.Table
age: int64
id: int64
name: string
----
age: [[30,25,35]]
id: [[0,1,2]]
name: [["Alice","Bob","Charlie"]]
pyarrow.Table
age: double
id: int64
name: string
----
age: [[30,25,35]]
id: [[0,1,2]]
name: [["Alice","Bob","Charlie"]]
Setting Dataset Metadata¶
[4]:
# Set dataset-level metadata, merging with existing entries
db.set_metadata({"source": "API", "version": "1.0"})
print(db)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: my_dataset
• Number of columns: 3
• Number of rows: 3
• Number of files: 1
• Number of rows per file: [3]
• Number of row groups per file: [1]
• Serialized metadata size per file: [854] Bytes
############################################################
METADATA
############################################################
• source: API
• version: 1.0
############################################################
COLUMN DETAILS
############################################################
• Columns:
- age
- id
- name
If we call set_metadata
again with additional keys:
[5]:
# Add more metadata, merging with the existing ones
db.set_metadata({"author": "Data Engineer", "department": "Analytics"})
print(db.get_metadata())
{'source': 'API', 'version': '1.0', 'author': 'Data Engineer', 'department': 'Analytics'}
If you want to replace the existing metadata:
[6]:
# Replace existing metadata
db.set_metadata({"source": "API_2", "version": "2.0"}, update=False)
print(db.get_metadata())
{'source': 'API_2', 'version': '2.0'}
Setting Field-Level Metadata¶
If you want to attach descriptive information to specific fields (columns), use set_field_metadata
. This is useful for storing units of measurement, data lineage, or other column-specific properties.
[7]:
field_meta = {"age": {"units": "Years", "description": "Age of the person"}}
db.set_field_metadata(field_meta)
print(db.get_field_metadata())
{'age': {'units': 'Years', 'description': 'Age of the person'}, 'id': {}, 'name': {}}
Note: When physically stored, metadata is typically stored in the Parquet file footer and read by PyArrow upon loading. If you rely on certain metadata keys in your analysis, ensure your entire workflow consistently updates and preserves them.