Advanced Update Operations

ParquetDB’s update method allows you to modify existing records in your dataset by matching on one or more “update keys.” Typically, the id field is used to identify which records to update, but you can also specify additional or different keys based on your data schema.

```python def update( self, data: Union[list, dict, pd.DataFrame], schema: pa.Schema = None, metadata: dict = None, fields_metadata: dict = None, update_keys: Union[list, str] = [“id”], treat_fields_as_ragged=None, convert_to_fixed_shape: bool = True, normalize_config: NormalizeConfig = NormalizeConfig(), ): “”” Updates existing records in the database.

Parameters
----------
data : dict, list of dicts, or pandas.DataFrame
    The data to update. Each record must contain an 'id' (or specified update key)
    corresponding to the record(s) to update.
schema : pyarrow.Schema, optional
    The schema for the data being updated. If not provided, it is inferred.
metadata : dict, optional
    Additional metadata for the entire dataset.
fields_metadata : dict, optional
    Additional metadata for each field/column.
update_keys : list of str or str, optional
    Which columns to use for matching existing records. Default is 'id'.
treat_fields_as_ragged : list of str, optional
    A list of fields to treat as ragged arrays.
convert_to_fixed_shape : bool, optional
    If True, convert ragged arrays to a fixed shape if possible.
normalize_config : NormalizeConfig, optional
    Configuration for the normalization process, optimizing performance
    by managing row distribution and file structure.

Example
-------
>>> db.update(
...     data=[{"id": 1, "name": "John", "age": 30}, {"id": 2, "name": "Jane", "age": 25}]
... )
"""
...

[1]:
import pprint
import shutil
import os
import pandas as pd
import pyarrow as pa
from parquetdb import ParquetDB, NormalizeConfig

db_path = "ParquetDB"
if os.path.exists(db_path):
    shutil.rmtree(db_path)

db = ParquetDB(db_path)
data = [
    {"name": "John", "age": 30, "nested": {"a": 1, "b": 2}},
    {"name": "Jane", "age": 25, "nested": {"a": 3, "b": 4}},
    {"name": "Jimmy", "age": 30, "nested": {"a": 1, "b": 2}},
    {"name": "Jill", "age": 35, "nested": {"a": 3, "b": 4}},
]

db.create(data)
print(db)

df = db.read().to_pandas()
print(df)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB

• Number of columns: 5
• Number of rows: 4
• Number of files: 1
• Number of rows per file: [4]
• Number of row groups per file: [1]
• Serialized metadata size per file: [1101] Bytes

############################################################
METADATA
############################################################

############################################################
COLUMN DETAILS
############################################################
• Columns:
    - id
    - nested.a
    - name
    - nested.b
    - age

   age  id   name  nested.a  nested.b
0   30   0   John         1         2
1   25   1   Jane         3         4
2   30   2  Jimmy         1         2
3   35   3   Jill         3         4

Basic Usage

Here’s how to call update on a ParquetDB instance. We’ll assume the dataset is already populated with some records. When data is inputed into ParquetDB, it is assigned a unique id for which each record can be matched and updated. From the above data the index starts at 0 and goes to 3 sequentially through the list of dictionaries.

[2]:
update_data = [
    {"id": 1, "age": 31},
    {"id": 2, "age": 26},
]

db.update(update_data)

df = db.read().to_pandas()
print(df)
   age  id   name  nested.a  nested.b
0   30   0   John         1         2
1   31   1   Jane         3         4
2   26   2  Jimmy         1         2
3   35   3   Jill         3         4

As you can see the data has been updated with the new values. You can also specify nested dictionaries and it will update the corresponding nested values.

[3]:
update_data = [
    {"id": 0, "nested": {"a": 100}},
    {"id": 3, "nested": {"a": 200}},
]

db.update(update_data)

df = db.read().to_pandas()
print(df)
   age  id   name  nested.a  nested.b
0   30   0   John       100         2
1   31   1   Jane         3         4
2   26   2  Jimmy         1         2
3   35   3   Jill       200         4

Update on multiple keys

You can also update on select which keys to update on and can even update on multiple keys.

[16]:
db_path = "ParquetDB"
if os.path.exists(db_path):
    shutil.rmtree(db_path)

db = ParquetDB(db_path)
current_data = [
    {"id_1": 100, "id_2": 10, "field_1": "here"},
    {"id_1": 55, "id_2": 11},
    {"id_1": 33, "id_2": 12},
    {"id_1": 12, "id_2": 13},
    {"id_1": 33, "id_2": 50},
]

db.create(current_data)
print(db)

df = db.read().to_pandas()
print(df)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB

• Number of columns: 4
• Number of rows: 5
• Number of files: 1
• Number of rows per file: [5]
• Number of row groups per file: [1]
• Serialized metadata size per file: [907] Bytes

############################################################
METADATA
############################################################

############################################################
COLUMN DETAILS
############################################################
• Columns:
    - field_1
    - id_1
    - id
    - id_2

  field_1  id  id_1  id_2
0    here   0   100    10
1    None   1    55    11
2    None   2    33    12
3    None   3    12    13
4    None   4    33    50
[17]:
incoming_data = [
    {"id_1": 100, "id_2": 10, "field_2": "there"},
    {"id_1": 5, "id_2": 5},
    {
        "id_1": 33,
        "id_2": 13,
    },  # Note: emp_id 4 doesn't exist in employees. So no update will be applied
    {
        "id_1": 33,
        "id_2": 12,
        "field_2": "field_2",
        "field_3": "field_3",
    },
]


db.update(incoming_data, update_keys=["id_1", "id_2"])

table = db.read()
print(table.to_pandas())
assert table["field_1"].combine_chunks().to_pylist() == ["here", None, None, None, None]
assert table["field_2"].combine_chunks().to_pylist() == [
    "there",
    None,
    "field_2",
    None,
    None,
]
assert table["field_3"].combine_chunks().to_pylist() == [
    None,
    None,
    "field_3",
    None,
    None,
]
  field_1  field_2  field_3  id  id_1  id_2
0    here    there     None   0   100    10
1    None     None     None   1    55    11
2    None  field_2  field_3   2    33    12
3    None     None     None   3    12    13
4    None     None     None   4    33    50