Getting Started¶
This notebook provides a quick introduction to using ParquetDB, a lightweight file-based database system that leverages the Parquet format and PyArrow under the hood.
If you have not installed ParquetDB yet, you can do so via pip:
[ ]:
!pip install parquetdb
Creating a Database¶
Initialize a ParquetDB instance by specifying the dataset name. In practice, you would also specify a path where Parquet files will be stored. If no path is provided, a default directory may be used (e.g., current working directory).
[1]:
import os
import shutil
from parquetdb import ParquetDB
db_path = "ParquetDB"
if os.path.exists(db_path):
shutil.rmtree(db_path)
db = ParquetDB(db_path=db_path)
print(db)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 1
• Number of rows: 0
• Number of files: 1
• Number of rows per file: [0]
• Number of row groups per file: [1]
• Serialized metadata size per file: [312] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- id
Adding Data¶
You can add new records to your ParquetDB instance using db.create()
. The data can be:
A dictionary of field-value pairs
A list of dictionaries
A Pandas DataFrame
[2]:
data = [
{"name": "Charlie", "age": 28, "occupation": "Designer"},
{"name": "Diana", "age": 32, "occupation": "Product Manager"},
]
db.create(data)
print("Data added successfully!")
print(db)
Data added successfully!
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 4
• Number of rows: 2
• Number of files: 1
• Number of rows per file: [2]
• Number of row groups per file: [1]
• Serialized metadata size per file: [896] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- age
- occupation
- name
- id
ParquetDB can handle schema evolution and handle nested data.
[3]:
data = [
{"name": "Jimmy", "field1": {"subfield1": "value1", "subfield2": "value2"}},
]
db.create(data)
print(db)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 6
• Number of rows: 3
• Number of files: 2
• Number of rows per file: [2, 1]
• Number of row groups per file: [1, 1]
• Serialized metadata size per file: [1244, 1206] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- id
- field1.subfield2
- name
- occupation
- age
- field1.subfield1
Whenever nested data is added, ParquetDB will flatten the data. This is because read and write operations can be performed much quicker on flattened data. ParquetDB also has an option to recover the nested structure during read operations.
Reading Data¶
The db.read()
method retrieves data from the ParquetDB. By default, the data is returned as a PyArrow Table.
[4]:
# Read all data
table = db.read()
print("All Employees:")
print(type(table))
print(table)
All Employees:
<class 'pyarrow.lib.Table'>
pyarrow.Table
age: int64
field1.subfield1: string
field1.subfield2: string
id: int64
name: string
occupation: string
----
age: [[28,32],[null]]
field1.subfield1: [[null,null],["value1"]]
field1.subfield2: [[null,null],["value2"]]
id: [[0,1],[2]]
name: [["Charlie","Diana"],["Jimmy"]]
occupation: [["Designer","Product Manager"],[null]]
You can transform this data to a Pandas DataFrame if you prefer. Here, I am setting the split_blocks
to True
and self_destruct
to True
to avoid allocated unncessary memory. You can learn more about Pyarrow Pandas Integration here.
[5]:
df = table.to_pandas(split_blocks=True, self_destruct=True)
You can optionally specify filters and columns to read only what you need. This functionality is powered by PyArrow and Parquet’s metadata, enabling efficient predicate pushdown. To learn more about accepted PyArrow filters PyArrow documentation.
[6]:
# Read specific columns
df = db.read(columns=["name"]).to_pandas()
print("\nJust the Names:")
print(df)
# Read data with filters
from pyarrow import compute as pc
age_filter = pc.field("age") > 30
df = db.read(filters=[age_filter]).to_pandas()
print("\nEmployees older than 30:")
print(df)
Just the Names:
name
0 Charlie
1 Diana
2 Jimmy
Employees older than 30:
age field1.subfield1 field1.subfield2 id name occupation
0 32 None None 1 Diana Product Manager
Updating Data¶
You can update records by calling db.update()
with a list of items. Each item to be updated must include its id
field. You can also add new fields here as well.
[7]:
update_data = [
{
"id": 1,
"age": 32,
"occupation": "Senior Engineer",
"field1": {"subfield1": "value8", "subfield2": "value9"},
},
{
"id": 2,
"age": 50,
"occupation": "Senior Engineer",
},
]
db.update(update_data)
print("Records updated successfully!")
df = db.read().to_pandas()
print(df)
Records updated successfully!
age field1.subfield1 field1.subfield2 id name occupation
0 28 None None 0 Charlie Designer
1 32 value8 value9 1 Diana Senior Engineer
2 50 value1 value2 2 Jimmy Senior Engineer
Deleting Data¶
Delete by id¶
Remove specific records from the database by specifying their id
values.
[8]:
db.delete(ids=[2])
print("Records deleted successfully!")
df = db.read().to_pandas()
print(df)
Records deleted successfully!
age field1.subfield1 field1.subfield2 id name occupation
0 28 None None 0 Charlie Designer
1 32 value8 value9 1 Diana Senior Engineer
Delete by columns¶
You can also delete columns by specifying the column names.
[9]:
db.delete(columns=["field1.subfield1"])
print("Columns deleted successfully!")
df = db.read().to_pandas()
print(df)
Columns deleted successfully!
age field1.subfield2 id name occupation
0 28 None 0 Charlie Designer
1 32 value9 1 Diana Senior Engineer
Delete by filters¶
You can also delete records by specifying filters.
[10]:
db.delete(filters=[pc.field("age") > 30])
print("Records deleted successfully!")
df = db.read().to_pandas()
print(df)
Records deleted successfully!
age field1.subfield2 id name occupation
0 28 None 0 Charlie Designer
Transform Data¶
Updating is a costly operation to perform on very large datasets. If you know ahead of time you are going to operate on the entire dataset before hand, using the transform
method is much more efficient. Instead of searching for the ids, it will take the current dataset and apply the transformation to create a new dataset. In this example, I create a new column age_bin
by binning the age
from 0 to 100 in increments of 10.
From the docs, here is the signature of the transform
method:
def transform(
self,
transform_callable: Callable[[pa.Table], pa.Table],
new_db_path: Optional[str] = None,
normalize_config: NormalizeConfig = NormalizeConfig(),
) -> Optional["ParquetDB"]:
"""
Transform the entire dataset using a user-provided callable.
This function:
1. Reads the entire dataset as a PyArrow table.
2. Applies the `transform_callable`, which should accept a `pa.Table`
and return another `pa.Table`.
3. Writes out the transformed data:
- Overwrites this ParquetDB in-place (if `new_db_path=None`), or
- Creates a new ParquetDB at `new_db_path` (if `new_db_path!=None`).
[11]:
import pyarrow as pa
import numpy as np
def binning_age_column(table):
df = table.to_pandas()
df["age_bin"] = df["age"].apply(
lambda x: np.histogram(x, bins=range(0, 100, 10))[0]
)
return pa.Table.from_pandas(df)
db.transform(binning_age_column)
df = db.read().to_pandas()
print(df)
age field1.subfield2 id name occupation age_bin
0 28 None 0 Charlie Designer [0, 0, 1, 0, 0, 0, 0, 0, 0]