Advanced Update Operations¶
ParquetDB’s update
method allows you to modify existing records in your dataset by matching on one or more “update keys.” Typically, the id
field is used to identify which records to update, but you can also specify additional or different keys based on your data schema.
```python def update( self, data: Union[list, dict, pd.DataFrame], schema: pa.Schema = None, metadata: dict = None, fields_metadata: dict = None, update_keys: Union[list, str] = [“id”], treat_fields_as_ragged=None, convert_to_fixed_shape: bool = True, normalize_config: NormalizeConfig = NormalizeConfig(), ): “”” Updates existing records in the database.
Parameters
----------
data : dict, list of dicts, or pandas.DataFrame
The data to update. Each record must contain an 'id' (or specified update key)
corresponding to the record(s) to update.
schema : pyarrow.Schema, optional
The schema for the data being updated. If not provided, it is inferred.
metadata : dict, optional
Additional metadata for the entire dataset.
fields_metadata : dict, optional
Additional metadata for each field/column.
update_keys : list of str or str, optional
Which columns to use for matching existing records. Default is 'id'.
treat_fields_as_ragged : list of str, optional
A list of fields to treat as ragged arrays.
convert_to_fixed_shape : bool, optional
If True, convert ragged arrays to a fixed shape if possible.
normalize_config : NormalizeConfig, optional
Configuration for the normalization process, optimizing performance
by managing row distribution and file structure.
Example
-------
>>> db.update(
... data=[{"id": 1, "name": "John", "age": 30}, {"id": 2, "name": "Jane", "age": 25}]
... )
"""
...
[1]:
import pprint
import shutil
import os
import pandas as pd
import pyarrow as pa
from parquetdb import ParquetDB, NormalizeConfig
db_path = "ParquetDB"
if os.path.exists(db_path):
shutil.rmtree(db_path)
db = ParquetDB(db_path)
data = [
{"name": "John", "age": 30, "nested": {"a": 1, "b": 2}},
{"name": "Jane", "age": 25, "nested": {"a": 3, "b": 4}},
{"name": "Jimmy", "age": 30, "nested": {"a": 1, "b": 2}},
{"name": "Jill", "age": 35, "nested": {"a": 3, "b": 4}},
]
db.create(data)
print(db)
df = db.read().to_pandas()
print(df)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 5
• Number of rows: 4
• Number of files: 1
• Number of rows per file: [4]
• Number of row groups per file: [1]
• Serialized metadata size per file: [1101] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- id
- nested.a
- name
- nested.b
- age
age id name nested.a nested.b
0 30 0 John 1 2
1 25 1 Jane 3 4
2 30 2 Jimmy 1 2
3 35 3 Jill 3 4
Basic Usage¶
Here’s how to call update
on a ParquetDB instance. We’ll assume the dataset is already populated with some records. When data is inputed into ParquetDB, it is assigned a unique id for which each record can be matched and updated. From the above data the index starts at 0 and goes to 3 sequentially through the list of dictionaries.
[2]:
update_data = [
{"id": 1, "age": 31},
{"id": 2, "age": 26},
]
db.update(update_data)
df = db.read().to_pandas()
print(df)
age id name nested.a nested.b
0 30 0 John 1 2
1 31 1 Jane 3 4
2 26 2 Jimmy 1 2
3 35 3 Jill 3 4
As you can see the data has been updated with the new values. You can also specify nested dictionaries and it will update the corresponding nested values.
[3]:
update_data = [
{"id": 0, "nested": {"a": 100}},
{"id": 3, "nested": {"a": 200}},
]
db.update(update_data)
df = db.read().to_pandas()
print(df)
age id name nested.a nested.b
0 30 0 John 100 2
1 31 1 Jane 3 4
2 26 2 Jimmy 1 2
3 35 3 Jill 200 4
Update on multiple keys¶
You can also update on select which keys to update on and can even update on multiple keys.
[16]:
db_path = "ParquetDB"
if os.path.exists(db_path):
shutil.rmtree(db_path)
db = ParquetDB(db_path)
current_data = [
{"id_1": 100, "id_2": 10, "field_1": "here"},
{"id_1": 55, "id_2": 11},
{"id_1": 33, "id_2": 12},
{"id_1": 12, "id_2": 13},
{"id_1": 33, "id_2": 50},
]
db.create(current_data)
print(db)
df = db.read().to_pandas()
print(df)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 4
• Number of rows: 5
• Number of files: 1
• Number of rows per file: [5]
• Number of row groups per file: [1]
• Serialized metadata size per file: [907] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- field_1
- id_1
- id
- id_2
field_1 id id_1 id_2
0 here 0 100 10
1 None 1 55 11
2 None 2 33 12
3 None 3 12 13
4 None 4 33 50
[17]:
incoming_data = [
{"id_1": 100, "id_2": 10, "field_2": "there"},
{"id_1": 5, "id_2": 5},
{
"id_1": 33,
"id_2": 13,
}, # Note: emp_id 4 doesn't exist in employees. So no update will be applied
{
"id_1": 33,
"id_2": 12,
"field_2": "field_2",
"field_3": "field_3",
},
]
db.update(incoming_data, update_keys=["id_1", "id_2"])
table = db.read()
print(table.to_pandas())
assert table["field_1"].combine_chunks().to_pylist() == ["here", None, None, None, None]
assert table["field_2"].combine_chunks().to_pylist() == [
"there",
None,
"field_2",
None,
None,
]
assert table["field_3"].combine_chunks().to_pylist() == [
None,
None,
"field_3",
None,
None,
]
field_1 field_2 field_3 id id_1 id_2
0 here there None 0 100 10
1 None None None 1 55 11
2 None field_2 field_3 2 33 12
3 None None None 3 12 13
4 None None None 4 33 50