Small Benchmark¶
This notebook runs a small benchmark for ParquetDB, PyArrow, and SQLite across comparable read/write/query operations. This was first conducted by Christopher Körber and adapated by Logan Lang for this notebook
Benchmark Details¶
Data Generation: Generates 1,000,000 rows × 100 columns of integers (0–1,000,000). Integers are chosen as a basic primitive type—byte size is the main factor, so these results represent a lower bound on performance; more complex or larger types will incur higher cost.
ParquetDB Normalization (defaults): row-group size 50,000–100,000 rows, max rows per file 10,000,000. Tuning these can shift performance between inserts, reads, and updates.
System Specifications¶
Operating System: Windows 10
Processor: AMD Ryzen 7 3700X 8‑Core @ 3.6 MHz (8 cores, 16 logical processors)
Memory: 128 GB DDR4‑3600 MHz (4×32 GB DIMMs)
Storage: SATA HDD 2TB (Model: ST2000DM008-2FR102)
Setup¶
[ ]:
!pip install parquetdb
[1]:
import os
import time
import shutil
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
from parquetdb import config
from parquetdb.utils import general_utils
bench_dir = os.path.join(config.data_dir, 'benchmarks')
sqlite_dir = os.path.join(bench_dir, 'sqlite')
pa_dir = os.path.join(bench_dir, 'pyarrow')
pq_dir = os.path.join(bench_dir, 'parquetdb')
for d in (sqlite_dir, pa_dir, pq_dir):
os.makedirs(d, exist_ok=True)
Test Data¶
[2]:
orders=np.arange(7)
data_dict = {}
col_prefix = "col"
for order in orders:
data_dict[order] = general_utils.generate_pylist_data(n_rows=10**order, min_value=0, max_value=1_000_000, prefix=col_prefix)
parquetdb_filters = [pa.compute.field(f"{col_prefix}1") > 100, pa.compute.field(f"{col_prefix}97") < 1000]
pyarrow_filter = (pa.compute.field(f"{col_prefix}1") > 100) & (pa.compute.field(f"{col_prefix}97") < 1000)
sql_query = f"{col_prefix}1 > 100 and {col_prefix}97 < 1000"
CREATE_TIMES={
"parquetdb":{"mean":[], "std":[]},
"pyarrow":{"mean":[], "std":[]},
"sqlite":{"mean":[], "std":[]}
}
READ_TIMES={
"parquetdb":{"mean":[], "std":[]},
"pyarrow":{"mean":[], "std":[]},
"sqlite":{"mean":[], "std":[]}
}
QUERY_TIMES={
"parquetdb":{"mean":[], "std":[]},
"pyarrow":{"mean":[], "std":[]},
"sqlite":{"mean":[], "std":[]}
}
Using PyArrow Directly¶
[3]:
pyarrow_dir = os.path.join(pa_dir, "pyarrow")
def pyarrow_benchmark_experiment(data):
if os.path.exists(pyarrow_dir):
shutil.rmtree(pyarrow_dir)
os.makedirs(pyarrow_dir, exist_ok=True)
create_time = time.time()
start = 0
table = pa.Table.from_pylist(data).add_column(0, 'id', [range(start, start + len(data))])
temp_file_path=os.path.join(pyarrow_dir, "0.parquet")
pq.write_table(table, temp_file_path)
create_time=time.time() - create_time
del table
read_time = time.time()
dataset = ds.dataset(pyarrow_dir, format="parquet")
table=dataset.to_table(filter=None)
read_time = time.time() - read_time
del table
del dataset
query_time = time.time()
dataset = ds.dataset(pyarrow_dir, format="parquet")
table=dataset.to_table(filter=pyarrow_filter)
query_time = time.time() - query_time
return create_time, read_time, query_time
experiment_dict = { key: {} for key in range(5)}
for run_name, benchmark_dict in experiment_dict.items():
tmp_dict={}
create_times = []
read_times = []
query_times=[]
for order, data in data_dict.items():
create_time, read_time, query_time = pyarrow_benchmark_experiment(data)
create_times.append(create_time)
read_times.append(read_time)
query_times.append(query_time)
tmp_dict = {
"create_times": create_times,
"read_times": read_times,
"query_times": query_times
}
benchmark_dict[run_name] = tmp_dict
mean_create_times=[]
mean_read_times=[]
mean_query_times=[]
std_create_times=[]
std_read_times=[]
std_query_times=[]
for order, data in data_dict.items():
tmp_create_times=[]
tmp_read_times=[]
tmp_query_times=[]
for run_name, benchmark_dict in experiment_dict.items():
tmp_create_times.append(benchmark_dict[run_name]['create_times'][order])
tmp_read_times.append(benchmark_dict[run_name]['read_times'][order])
tmp_query_times.append(benchmark_dict[run_name]['query_times'][order])
mean_create_times.append(float(np.mean(tmp_create_times)))
mean_read_times.append(float(np.mean(tmp_read_times)))
mean_query_times.append(float(np.mean(tmp_query_times)))
std_create_times.append(float(np.std(tmp_create_times)))
std_read_times.append(float(np.std(tmp_read_times)))
std_query_times.append(float(np.std(tmp_query_times)))
CREATE_TIMES['pyarrow']["mean"] = mean_create_times
CREATE_TIMES['pyarrow']["std"] = std_create_times
READ_TIMES['pyarrow']["mean"] = mean_read_times
READ_TIMES['pyarrow']["std"] = std_read_times
QUERY_TIMES['pyarrow']["mean"] = mean_query_times
QUERY_TIMES['pyarrow']["std"] = std_query_times
[4]:
print("Create Times:")
print(CREATE_TIMES['pyarrow']["mean"])
print("Read Times:")
print(READ_TIMES['pyarrow']["mean"])
print("Query Times:")
print(QUERY_TIMES['pyarrow']["mean"])
Create Times:
[0.03320021629333496, 0.014103555679321289, 0.009400510787963867, 0.03350682258605957, 0.471631383895874, 7.403422498703003, 70.13993415832519]
Read Times:
[0.014600133895874024, 0.015394306182861328, 0.012600994110107422, 0.012792396545410156, 0.01582474708557129, 0.05335421562194824, 0.35906219482421875]
Query Times:
[0.005000877380371094, 0.004373693466186523, 0.004302406311035156, 0.006783628463745117, 0.009602928161621093, 0.04960236549377441, 0.30310521125793455]
SQLite¶
[5]:
import sqlite3
import traceback
import sys
sqlite_db = os.path.join(sqlite_dir, "benchmark.sqlite")
def sqlite_benchmark_experiment(data):
if os.path.exists(sqlite_db): os.remove(sqlite_db)
sql_data_keys = data[0].keys()
sql_data_values=[]
for record in data:
# for key,value in record.items():
sql_data_values.append(tuple(record.values()))
n_cols = len(data[0])
try:
create_time = time.time()
conn = sqlite3.connect(sqlite_db)
cursor = conn.cursor()
cursor.execute("PRAGMA synchronous = OFF")
cursor.execute("PRAGMA journal_mode = MEMORY")
cols = ', '.join(f'{data_key} INTEGER' for data_key in sql_data_keys)
conn.execute(f'CREATE TABLE benchmark ({cols})')
placeholders = ', '.join('?' for _ in range(n_cols))
sql= f'INSERT INTO benchmark VALUES ({placeholders})'
cursor.executemany(sql, sql_data_values)
conn.commit()
create_time=time.time() - create_time
except Exception as e:
# Get the traceback information as a formatted string
tb_str = traceback.format_exc()
print(tb_str)
# Or, get the traceback object and work with it directly
tb = sys.exc_info()[2]
traceback.print_tb(tb)
conn.close()
try:
read_time = time.time()
conn = sqlite3.connect(sqlite_db)
cursor = conn.cursor()
cursor.execute("SELECT * FROM benchmark")
results = cursor.fetchall()
read_time = time.time() - read_time
except Exception as e:
# Get the traceback information as a formatted string
tb_str = traceback.format_exc()
print(tb_str)
# Or, get the traceback object and work with it directly
tb = sys.exc_info()[2]
traceback.print_tb(tb)
conn.close()
try:
query_time = time.time()
conn = sqlite3.connect(sqlite_db)
cursor = conn.cursor()
cursor.execute(f"SELECT * FROM benchmark WHERE {sql_query}")
results = cursor.fetchall()
query_time = time.time() - query_time
except Exception as e:
# Get the traceback information as a formatted string
tb_str = traceback.format_exc()
print(tb_str)
# Or, get the traceback object and work with it directly
tb = sys.exc_info()[2]
traceback.print_tb(tb)
conn.close()
return create_time, read_time, query_time
experiment_dict = { key: {} for key in range(5)}
for run_name, benchmark_dict in experiment_dict.items():
tmp_dict={}
create_times = []
read_times = []
query_times=[]
for order, data in data_dict.items():
create_time, read_time, query_time = sqlite_benchmark_experiment(data)
create_times.append(create_time)
read_times.append(read_time)
query_times.append(query_time)
tmp_dict = {
"create_times": create_times,
"read_times": read_times,
"query_times": query_times
}
benchmark_dict[run_name] = tmp_dict
mean_create_times=[]
mean_read_times=[]
mean_query_times=[]
std_create_times=[]
std_read_times=[]
std_query_times=[]
for order, data in data_dict.items():
tmp_create_times=[]
tmp_read_times=[]
tmp_query_times=[]
for run_name, benchmark_dict in experiment_dict.items():
tmp_create_times.append(benchmark_dict[run_name]['create_times'][order])
tmp_read_times.append(benchmark_dict[run_name]['read_times'][order])
tmp_query_times.append(benchmark_dict[run_name]['query_times'][order])
mean_create_times.append(float(np.mean(tmp_create_times)))
mean_read_times.append(float(np.mean(tmp_read_times)))
mean_query_times.append(float(np.mean(tmp_query_times)))
std_create_times.append(float(np.std(tmp_create_times)))
std_read_times.append(float(np.std(tmp_read_times)))
std_query_times.append(float(np.std(tmp_query_times)))
CREATE_TIMES['sqlite']["mean"] = mean_create_times
CREATE_TIMES['sqlite']["std"] = std_create_times
READ_TIMES['sqlite']["mean"] = mean_read_times
READ_TIMES['sqlite']["std"] = std_read_times
QUERY_TIMES['sqlite']["mean"] = mean_query_times
QUERY_TIMES['sqlite']["std"] = std_query_times
[6]:
print("Create Times:")
print(CREATE_TIMES['sqlite']["mean"])
print("Read Times:")
print(READ_TIMES['sqlite']["mean"])
print("Query Times:")
print(QUERY_TIMES['sqlite']["mean"])
Create Times:
[0.0010138988494873048, 0.0008130550384521484, 0.0011197566986083985, 0.008905124664306641, 0.07463836669921875, 0.736366605758667, 7.4553422927856445]
Read Times:
[0.0013258934020996093, 0.0015575885772705078, 0.0031287670135498047, 0.01792774200439453, 0.16240925788879396, 1.6550938606262207, 16.552225065231323]
Query Times:
[0.00040950775146484373, 0.00030159950256347656, 0.0005053997039794922, 0.001902294158935547, 0.016704320907592773, 0.18021435737609864, 1.7924356937408448]
ParquetDB¶
[9]:
from parquetdb import ParquetDB
parquetdb_dir = os.path.join(pq_dir, "parquetdb", "BenchmarkDB")
def parquetdb_benchmark_experiment(data):
time.sleep(1)
if os.path.exists(parquetdb_dir):
shutil.rmtree(parquetdb_dir)
os.makedirs(parquetdb_dir, exist_ok=True)
db = ParquetDB(db_path=parquetdb_dir)
create_time = time.time()
db.create(data)
create_time=time.time() - create_time
read_time = time.time()
table=db.read(filters=None)
read_time = time.time() - read_time
del table
query_time = time.time()
table=db.read(filters=parquetdb_filters)
query_time = time.time() - query_time
return create_time, read_time, query_time
experiment_dict = { key: {} for key in range(5)}
for run_name, benchmark_dict in experiment_dict.items():
tmp_dict={}
create_times = []
read_times = []
query_times=[]
for order, data in data_dict.items():
create_time, read_time, query_time = parquetdb_benchmark_experiment(data)
create_times.append(create_time)
read_times.append(read_time)
query_times.append(query_time)
tmp_dict = {
"create_times": create_times,
"read_times": read_times,
"query_times": query_times
}
benchmark_dict[run_name] = tmp_dict
mean_create_times=[]
mean_read_times=[]
mean_query_times=[]
std_create_times=[]
std_read_times=[]
std_query_times=[]
for order, data in data_dict.items():
tmp_create_times=[]
tmp_read_times=[]
tmp_query_times=[]
for run_name, benchmark_dict in experiment_dict.items():
tmp_create_times.append(benchmark_dict[run_name]['create_times'][order])
tmp_read_times.append(benchmark_dict[run_name]['read_times'][order])
tmp_query_times.append(benchmark_dict[run_name]['query_times'][order])
mean_create_times.append(float(np.mean(tmp_create_times)))
mean_read_times.append(float(np.mean(tmp_read_times)))
mean_query_times.append(float(np.mean(tmp_query_times)))
std_create_times.append(float(np.std(tmp_create_times)))
std_read_times.append(float(np.std(tmp_read_times)))
std_query_times.append(float(np.std(tmp_query_times)))
CREATE_TIMES['parquetdb']["mean"] = mean_create_times
CREATE_TIMES['parquetdb']["std"] = std_create_times
READ_TIMES['parquetdb']["mean"] = mean_read_times
READ_TIMES['parquetdb']["std"] = std_read_times
QUERY_TIMES['parquetdb']["mean"] = mean_query_times
QUERY_TIMES['parquetdb']["std"] = std_query_times
[24]:
print("Create Times:")
print(CREATE_TIMES['parquetdb']["mean"])
print("Read Times:")
print(READ_TIMES['parquetdb']["mean"])
print("Query Times:")
print(QUERY_TIMES['parquetdb']["mean"])
Create Times:
[0.07000021934509278, 0.1691793441772461, 0.06844029426574708, 0.09826879501342774, 0.4450040340423584, 3.8328524112701414, 32.39931697845459]
Read Times:
[0.019400930404663085, 0.015804147720336913, 0.016584157943725586, 0.01441502571105957, 0.018503570556640626, 0.10959677696228028, 0.3610626220703125]
Query Times:
[0.007051420211791992, 0.0063992500305175785, 0.006599617004394531, 0.008312273025512695, 0.012049388885498048, 0.04615607261657715, 0.7494649410247802]
Plotting results¶
[17]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
def plotting_experiments(benchmark_times: dict):
plt.rcParams.update({
'axes.labelsize': 18, 'axes.titlesize': 18,
'xtick.labelsize': 14, 'ytick.labelsize': 14
})
fig, axes = plt.subplots(figsize=(10, 6))
# colors, styles
colors = ["#e52207", "#e5a000","#59b9de"]
linestyle="solid"
markerstyle="o"
marker_fill="none"
n_rows = [10**order for order in orders]
for i, (label, time_dict) in enumerate(benchmark_times.items()):
mean_times=time_dict['mean']
std_times=time_dict['std']
axes.plot(
n_rows,
mean_times,
label=label,
color=colors[i],
linestyle=linestyle,
marker=markerstyle,
fillstyle=marker_fill,
)
# Add error bars for standard deviation
axes.errorbar(
n_rows,
mean_times,
yerr=std_times,
fmt='none', # No line connecting error bars
ecolor=colors[i],
elinewidth=1.5,
capsize=3
)
scale = 36
ax_inset = inset_axes(
axes,
width=f"{scale}%",
height=f"{scale}%",
loc="upper left",
bbox_to_anchor=(0.05, -0.03, 1, 1),
bbox_transform=axes.transAxes,
borderpad=2,
)
for i, (label, time_dict) in enumerate(benchmark_times.items()):
mean_times=time_dict['mean']
std_times=time_dict['std']
ax_inset.plot(
n_rows,
mean_times,
label=label,
color=colors[i],
markersize=8,
linestyle=linestyle,
marker=markerstyle,
fillstyle=marker_fill,
)
axes.errorbar(
n_rows,
mean_times,
yerr=std_times,
fmt='none', # No line connecting error bars
ecolor=colors[i],
elinewidth=1.5,
capsize=3
)
axes.set_xlabel("Number of Rows")
axes.spines["left"].set_linestyle(linestyle)
axes.spines["left"].set_linewidth(2.5)
axes.spines["right"].set_visible(False)
axes.tick_params(axis="both", which="major", length=10, width=2, direction="out")
axes.grid(True)
ax_inset.grid(True)
ax_inset.set_xscale("log")
ax_inset.set_yscale("log")
ax_inset.set_xlabel("Number of Rows (log)", fontsize=8)
maj = ticker.LogLocator(numticks=9)
minr = ticker.LogLocator(subs="all", numticks=9)
ax_inset.xaxis.set_major_locator(maj)
ax_inset.xaxis.set_minor_locator(minr)
ax_inset.spines["left"].set_linestyle(linestyle)
ax_inset.spines["left"].set_linewidth(2.5)
ax_inset.spines["right"].set_visible(False)
ax_inset.tick_params(axis="both", which="major", length=6, width=1.5, direction="out")
ax_inset.tick_params(axis="x", which="minor", length=3, width=1, direction="out")
ax_inset.tick_params(axis="y", which="minor", length=3, width=1, direction="out")
lines1, labels1 = axes.get_legend_handles_labels()
axes.legend(lines1, labels1, loc="upper center", bbox_to_anchor=(0.15, 0, 1, 1))
return axes, ax_inset
Create Times¶
[18]:
axes,ax_inset=plotting_experiments(benchmark_times=CREATE_TIMES)
axes.set_title("Benchmark for Create Times")
axes.set_ylabel("Create Times (s)")
ax_inset.set_ylabel("Create Time (log)", fontsize=8, labelpad=-2)
plt.tight_layout()
plt.show()
C:\Users\lllang\AppData\Local\Temp\ipykernel_50988\488313717.py:5: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
plt.tight_layout()

Read Times¶
[22]:
axes,ax_inset= plotting_experiments(benchmark_times=READ_TIMES)
axes.set_title("Benchmark for Create Times")
axes.set_ylabel("Read Times (s)")
ax_inset.set_ylabel("Read Time (log)", fontsize=8, labelpad=-2)
plt.tight_layout()
plt.show()
C:\Users\lllang\AppData\Local\Temp\ipykernel_50988\3302302533.py:5: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
plt.tight_layout()

Query Times¶
[23]:
axes,ax_inset= plotting_experiments(benchmark_times=QUERY_TIMES)
axes.set_title("Benchmark for Query Times")
axes.set_ylabel("Query Times (s)")
ax_inset.set_ylabel("Query Time (log)", fontsize=8, labelpad=-2)
plt.tight_layout()
plt.show()
C:\Users\lllang\AppData\Local\Temp\ipykernel_50988\211292255.py:5: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
plt.tight_layout()

Dicussion¶
Create Times
SQLite scales best for raw inserts: its lightweight B‑tree writer in C outpaces both columnar solutions at every scale (≤ 1 M rows).
ParquetDB is next.
PyArrow the worst perfromance: this can mainly be attributed to a difference in how the indices are generated.
Read Times
ParquetDB & PyArrow are effectively identical: both return zero‑copy Arrow tables in ~0.1 s for 1 M rows, thanks to their native columnar layout and batch I/O.
SQLite is ~40 × slower on full table scans, since it must fetch each row via a cursor and append into Python lists in a tight loop.
Query Times
ParquetDB & PyArrow again match each other closely: filtering 1 M rows takes ~0.3–0.4 s, as Arrow applies vectorized predicates.
SQLite requires ~1.8 s for the same filter, because every row check is a separate Python‑C transition and Python boolean append.
Developer experience & boilerplate
A raw PyArrow workflow requires explicit directory management, manual ID generation, repeated calls to
pa.Table.from_pylist()
,pq.write_table()
, and rebuilding the dataset for each operation—boilerplate that can be tedious to write, maintain, and debug.ParquetDB abstracts away all of that: you simply call
db.create(data)
,db.read()
, ordb.read(filters=…)
. It manages file layouts, row‑group boundaries, indexing, and state under the hood, reducing cognitive load and speeding up development.
Key takeaways
ParquetDB’s performance is dominated by the underlying Arrow I/O path—it inherits PyArrow’s blazing read/query speed, with only a small additional cost on table creation.
For write‑heavy workloads up to ~1 M rows, SQLite still leads on raw insert throughput.
For analytics‑style workloads (full scans or vectorized filters), the columnar engines (ParquetDB/PyArrow) deliver an order of magnitude better performance by avoiding Python‑level loops.
For developer productivity, ParquetDB’s simple API can save hours of boilerplate code and eliminate error‑prone state management.