Small Benchmark

This notebook runs a small benchmark for ParquetDB, PyArrow, and SQLite across comparable read/write/query operations. This was first conducted by Christopher Körber and adapated by Logan Lang for this notebook

Benchmark Details

  • Data Generation: Generates 1,000,000 rows × 100 columns of integers (0–1,000,000). Integers are chosen as a basic primitive type—byte size is the main factor, so these results represent a lower bound on performance; more complex or larger types will incur higher cost.

  • ParquetDB Normalization (defaults): row-group size 50,000–100,000 rows, max rows per file 10,000,000. Tuning these can shift performance between inserts, reads, and updates.

System Specifications

  • Operating System: Windows 10

  • Processor: AMD Ryzen 7 3700X 8‑Core @ 3.6 MHz (8 cores, 16 logical processors)

  • Memory: 128 GB DDR4‑3600 MHz (4×32 GB DIMMs)

  • Storage: SATA HDD 2TB (Model: ST2000DM008-2FR102)

Setup

[ ]:
!pip install parquetdb
[1]:
import os
import time
import shutil

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

from parquetdb import config
from parquetdb.utils import general_utils

bench_dir = os.path.join(config.data_dir, 'benchmarks')
sqlite_dir = os.path.join(bench_dir, 'sqlite')
pa_dir = os.path.join(bench_dir, 'pyarrow')
pq_dir = os.path.join(bench_dir, 'parquetdb')

for d in (sqlite_dir, pa_dir, pq_dir):
    os.makedirs(d, exist_ok=True)

Test Data

[2]:
orders=np.arange(7)
data_dict = {}

col_prefix = "col"
for order in orders:
    data_dict[order] = general_utils.generate_pylist_data(n_rows=10**order, min_value=0, max_value=1_000_000, prefix=col_prefix)

parquetdb_filters = [pa.compute.field(f"{col_prefix}1") > 100, pa.compute.field(f"{col_prefix}97") < 1000]
pyarrow_filter = (pa.compute.field(f"{col_prefix}1") > 100) & (pa.compute.field(f"{col_prefix}97") < 1000)
sql_query = f"{col_prefix}1 > 100 and {col_prefix}97 < 1000"

CREATE_TIMES={
    "parquetdb":{"mean":[], "std":[]},
    "pyarrow":{"mean":[], "std":[]},
    "sqlite":{"mean":[], "std":[]}
}

READ_TIMES={
    "parquetdb":{"mean":[], "std":[]},
    "pyarrow":{"mean":[], "std":[]},
    "sqlite":{"mean":[], "std":[]}
}

QUERY_TIMES={
    "parquetdb":{"mean":[], "std":[]},
    "pyarrow":{"mean":[], "std":[]},
    "sqlite":{"mean":[], "std":[]}
}

Using PyArrow Directly

[3]:
pyarrow_dir = os.path.join(pa_dir, "pyarrow")

def pyarrow_benchmark_experiment(data):

    if os.path.exists(pyarrow_dir):
        shutil.rmtree(pyarrow_dir)
    os.makedirs(pyarrow_dir, exist_ok=True)


    create_time = time.time()
    start = 0
    table = pa.Table.from_pylist(data).add_column(0, 'id', [range(start, start + len(data))])
    temp_file_path=os.path.join(pyarrow_dir, "0.parquet")
    pq.write_table(table, temp_file_path)
    create_time=time.time() - create_time
    del table

    read_time = time.time()
    dataset = ds.dataset(pyarrow_dir, format="parquet")
    table=dataset.to_table(filter=None)
    read_time = time.time() - read_time
    del table
    del dataset

    query_time = time.time()
    dataset = ds.dataset(pyarrow_dir, format="parquet")
    table=dataset.to_table(filter=pyarrow_filter)
    query_time = time.time() - query_time

    return create_time, read_time, query_time


experiment_dict = { key: {} for key in range(5)}
for run_name, benchmark_dict in experiment_dict.items():
    tmp_dict={}

    create_times = []
    read_times = []
    query_times=[]
    for order, data in data_dict.items():
        create_time, read_time, query_time = pyarrow_benchmark_experiment(data)
        create_times.append(create_time)
        read_times.append(read_time)
        query_times.append(query_time)

    tmp_dict = {
        "create_times": create_times,
        "read_times": read_times,
        "query_times": query_times
    }
    benchmark_dict[run_name] = tmp_dict

mean_create_times=[]
mean_read_times=[]
mean_query_times=[]

std_create_times=[]
std_read_times=[]
std_query_times=[]
for order, data in data_dict.items():
    tmp_create_times=[]
    tmp_read_times=[]
    tmp_query_times=[]
    for run_name, benchmark_dict in experiment_dict.items():
        tmp_create_times.append(benchmark_dict[run_name]['create_times'][order])
        tmp_read_times.append(benchmark_dict[run_name]['read_times'][order])
        tmp_query_times.append(benchmark_dict[run_name]['query_times'][order])
    mean_create_times.append(float(np.mean(tmp_create_times)))
    mean_read_times.append(float(np.mean(tmp_read_times)))
    mean_query_times.append(float(np.mean(tmp_query_times)))
    std_create_times.append(float(np.std(tmp_create_times)))
    std_read_times.append(float(np.std(tmp_read_times)))
    std_query_times.append(float(np.std(tmp_query_times)))


CREATE_TIMES['pyarrow']["mean"] = mean_create_times
CREATE_TIMES['pyarrow']["std"] = std_create_times

READ_TIMES['pyarrow']["mean"] = mean_read_times
READ_TIMES['pyarrow']["std"] = std_read_times

QUERY_TIMES['pyarrow']["mean"] = mean_query_times
QUERY_TIMES['pyarrow']["std"] = std_query_times

[4]:
print("Create Times:")
print(CREATE_TIMES['pyarrow']["mean"])
print("Read Times:")
print(READ_TIMES['pyarrow']["mean"])
print("Query Times:")
print(QUERY_TIMES['pyarrow']["mean"])
Create Times:
[0.03320021629333496, 0.014103555679321289, 0.009400510787963867, 0.03350682258605957, 0.471631383895874, 7.403422498703003, 70.13993415832519]
Read Times:
[0.014600133895874024, 0.015394306182861328, 0.012600994110107422, 0.012792396545410156, 0.01582474708557129, 0.05335421562194824, 0.35906219482421875]
Query Times:
[0.005000877380371094, 0.004373693466186523, 0.004302406311035156, 0.006783628463745117, 0.009602928161621093, 0.04960236549377441, 0.30310521125793455]

SQLite

[5]:
import sqlite3
import traceback
import sys
sqlite_db = os.path.join(sqlite_dir, "benchmark.sqlite")

def sqlite_benchmark_experiment(data):
    if os.path.exists(sqlite_db): os.remove(sqlite_db)

    sql_data_keys = data[0].keys()
    sql_data_values=[]
    for record in data:
        # for key,value in record.items():
        sql_data_values.append(tuple(record.values()))
    n_cols = len(data[0])

    try:
        create_time = time.time()
        conn = sqlite3.connect(sqlite_db)
        cursor = conn.cursor()
        cursor.execute("PRAGMA synchronous = OFF")
        cursor.execute("PRAGMA journal_mode = MEMORY")
        cols = ', '.join(f'{data_key} INTEGER' for data_key in sql_data_keys)

        conn.execute(f'CREATE TABLE benchmark ({cols})')
        placeholders = ', '.join('?' for _ in range(n_cols))
        sql= f'INSERT INTO benchmark VALUES ({placeholders})'

        cursor.executemany(sql, sql_data_values)
        conn.commit()
        create_time=time.time() - create_time
    except Exception as e:
        # Get the traceback information as a formatted string
        tb_str = traceback.format_exc()
        print(tb_str)
        # Or, get the traceback object and work with it directly
        tb = sys.exc_info()[2]
        traceback.print_tb(tb)
    conn.close()


    try:
        read_time = time.time()
        conn = sqlite3.connect(sqlite_db)
        cursor = conn.cursor()
        cursor.execute("SELECT * FROM benchmark")
        results = cursor.fetchall()
        read_time = time.time() - read_time

    except Exception as e:
        # Get the traceback information as a formatted string
        tb_str = traceback.format_exc()
        print(tb_str)
        # Or, get the traceback object and work with it directly
        tb = sys.exc_info()[2]
        traceback.print_tb(tb)
    conn.close()

    try:
        query_time = time.time()
        conn = sqlite3.connect(sqlite_db)
        cursor = conn.cursor()
        cursor.execute(f"SELECT * FROM benchmark WHERE {sql_query}")
        results = cursor.fetchall()
        query_time = time.time() - query_time
    except Exception as e:
        # Get the traceback information as a formatted string
        tb_str = traceback.format_exc()
        print(tb_str)
        # Or, get the traceback object and work with it directly
        tb = sys.exc_info()[2]
        traceback.print_tb(tb)
    conn.close()


    return create_time, read_time, query_time


experiment_dict = { key: {} for key in range(5)}
for run_name, benchmark_dict in experiment_dict.items():
    tmp_dict={}

    create_times = []
    read_times = []
    query_times=[]
    for order, data in data_dict.items():
        create_time, read_time, query_time = sqlite_benchmark_experiment(data)
        create_times.append(create_time)
        read_times.append(read_time)
        query_times.append(query_time)

    tmp_dict = {
        "create_times": create_times,
        "read_times": read_times,
        "query_times": query_times
    }
    benchmark_dict[run_name] = tmp_dict

mean_create_times=[]
mean_read_times=[]
mean_query_times=[]

std_create_times=[]
std_read_times=[]
std_query_times=[]
for order, data in data_dict.items():
    tmp_create_times=[]
    tmp_read_times=[]
    tmp_query_times=[]
    for run_name, benchmark_dict in experiment_dict.items():
        tmp_create_times.append(benchmark_dict[run_name]['create_times'][order])
        tmp_read_times.append(benchmark_dict[run_name]['read_times'][order])
        tmp_query_times.append(benchmark_dict[run_name]['query_times'][order])

    mean_create_times.append(float(np.mean(tmp_create_times)))
    mean_read_times.append(float(np.mean(tmp_read_times)))
    mean_query_times.append(float(np.mean(tmp_query_times)))
    std_create_times.append(float(np.std(tmp_create_times)))
    std_read_times.append(float(np.std(tmp_read_times)))
    std_query_times.append(float(np.std(tmp_query_times)))


CREATE_TIMES['sqlite']["mean"] = mean_create_times
CREATE_TIMES['sqlite']["std"] = std_create_times

READ_TIMES['sqlite']["mean"] = mean_read_times
READ_TIMES['sqlite']["std"] = std_read_times

QUERY_TIMES['sqlite']["mean"] = mean_query_times
QUERY_TIMES['sqlite']["std"] = std_query_times
[6]:
print("Create Times:")
print(CREATE_TIMES['sqlite']["mean"])
print("Read Times:")
print(READ_TIMES['sqlite']["mean"])
print("Query Times:")
print(QUERY_TIMES['sqlite']["mean"])
Create Times:
[0.0010138988494873048, 0.0008130550384521484, 0.0011197566986083985, 0.008905124664306641, 0.07463836669921875, 0.736366605758667, 7.4553422927856445]
Read Times:
[0.0013258934020996093, 0.0015575885772705078, 0.0031287670135498047, 0.01792774200439453, 0.16240925788879396, 1.6550938606262207, 16.552225065231323]
Query Times:
[0.00040950775146484373, 0.00030159950256347656, 0.0005053997039794922, 0.001902294158935547, 0.016704320907592773, 0.18021435737609864, 1.7924356937408448]

ParquetDB

[9]:
from parquetdb import ParquetDB

parquetdb_dir = os.path.join(pq_dir, "parquetdb", "BenchmarkDB")

def parquetdb_benchmark_experiment(data):

    time.sleep(1)
    if os.path.exists(parquetdb_dir):
        shutil.rmtree(parquetdb_dir)
    os.makedirs(parquetdb_dir, exist_ok=True)

    db = ParquetDB(db_path=parquetdb_dir)


    create_time = time.time()
    db.create(data)
    create_time=time.time() - create_time


    read_time = time.time()
    table=db.read(filters=None)
    read_time = time.time() - read_time
    del table


    query_time = time.time()
    table=db.read(filters=parquetdb_filters)
    query_time = time.time() - query_time

    return create_time, read_time, query_time





experiment_dict = { key: {} for key in range(5)}
for run_name, benchmark_dict in experiment_dict.items():
    tmp_dict={}

    create_times = []
    read_times = []
    query_times=[]
    for order, data in data_dict.items():
        create_time, read_time, query_time = parquetdb_benchmark_experiment(data)
        create_times.append(create_time)
        read_times.append(read_time)
        query_times.append(query_time)

    tmp_dict = {
        "create_times": create_times,
        "read_times": read_times,
        "query_times": query_times
    }
    benchmark_dict[run_name] = tmp_dict

mean_create_times=[]
mean_read_times=[]
mean_query_times=[]

std_create_times=[]
std_read_times=[]
std_query_times=[]
for order, data in data_dict.items():
    tmp_create_times=[]
    tmp_read_times=[]
    tmp_query_times=[]
    for run_name, benchmark_dict in experiment_dict.items():
        tmp_create_times.append(benchmark_dict[run_name]['create_times'][order])
        tmp_read_times.append(benchmark_dict[run_name]['read_times'][order])
        tmp_query_times.append(benchmark_dict[run_name]['query_times'][order])

    mean_create_times.append(float(np.mean(tmp_create_times)))
    mean_read_times.append(float(np.mean(tmp_read_times)))
    mean_query_times.append(float(np.mean(tmp_query_times)))
    std_create_times.append(float(np.std(tmp_create_times)))
    std_read_times.append(float(np.std(tmp_read_times)))
    std_query_times.append(float(np.std(tmp_query_times)))


CREATE_TIMES['parquetdb']["mean"] = mean_create_times
CREATE_TIMES['parquetdb']["std"] = std_create_times

READ_TIMES['parquetdb']["mean"] = mean_read_times
READ_TIMES['parquetdb']["std"] = std_read_times

QUERY_TIMES['parquetdb']["mean"] = mean_query_times
QUERY_TIMES['parquetdb']["std"] = std_query_times
[24]:
print("Create Times:")
print(CREATE_TIMES['parquetdb']["mean"])
print("Read Times:")
print(READ_TIMES['parquetdb']["mean"])
print("Query Times:")
print(QUERY_TIMES['parquetdb']["mean"])
Create Times:
[0.07000021934509278, 0.1691793441772461, 0.06844029426574708, 0.09826879501342774, 0.4450040340423584, 3.8328524112701414, 32.39931697845459]
Read Times:
[0.019400930404663085, 0.015804147720336913, 0.016584157943725586, 0.01441502571105957, 0.018503570556640626, 0.10959677696228028, 0.3610626220703125]
Query Times:
[0.007051420211791992, 0.0063992500305175785, 0.006599617004394531, 0.008312273025512695, 0.012049388885498048, 0.04615607261657715, 0.7494649410247802]

Plotting results

[17]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from mpl_toolkits.axes_grid1.inset_locator import inset_axes

def plotting_experiments(benchmark_times: dict):

    plt.rcParams.update({
        'axes.labelsize': 18, 'axes.titlesize': 18,
        'xtick.labelsize': 14, 'ytick.labelsize': 14
    })

    fig, axes = plt.subplots(figsize=(10, 6))


    # colors, styles
    colors = ["#e52207", "#e5a000","#59b9de"]
    linestyle="solid"
    markerstyle="o"
    marker_fill="none"


    n_rows = [10**order for order in orders]
    for i, (label, time_dict)  in enumerate(benchmark_times.items()):
        mean_times=time_dict['mean']
        std_times=time_dict['std']
        axes.plot(
            n_rows,
            mean_times,
            label=label,
            color=colors[i],
            linestyle=linestyle,
            marker=markerstyle,
            fillstyle=marker_fill,
        )
        # Add error bars for standard deviation
        axes.errorbar(
            n_rows,
            mean_times,
            yerr=std_times,
            fmt='none',  # No line connecting error bars
            ecolor=colors[i],
            elinewidth=1.5,
            capsize=3
        )


    scale = 36
    ax_inset = inset_axes(
        axes,
        width=f"{scale}%",
        height=f"{scale}%",
        loc="upper left",
        bbox_to_anchor=(0.05, -0.03, 1, 1),
        bbox_transform=axes.transAxes,
        borderpad=2,
    )

    for i, (label, time_dict)  in enumerate(benchmark_times.items()):
        mean_times=time_dict['mean']
        std_times=time_dict['std']
        ax_inset.plot(
            n_rows,
            mean_times,
            label=label,
            color=colors[i],
            markersize=8,
            linestyle=linestyle,
            marker=markerstyle,
            fillstyle=marker_fill,
        )

        axes.errorbar(
            n_rows,
            mean_times,
            yerr=std_times,
            fmt='none',  # No line connecting error bars
            ecolor=colors[i],
            elinewidth=1.5,
            capsize=3
        )

    axes.set_xlabel("Number of Rows")

    axes.spines["left"].set_linestyle(linestyle)
    axes.spines["left"].set_linewidth(2.5)
    axes.spines["right"].set_visible(False)
    axes.tick_params(axis="both", which="major", length=10, width=2, direction="out")
    axes.grid(True)


    ax_inset.grid(True)
    ax_inset.set_xscale("log")
    ax_inset.set_yscale("log")
    ax_inset.set_xlabel("Number of Rows (log)", fontsize=8)



    maj = ticker.LogLocator(numticks=9)
    minr = ticker.LogLocator(subs="all", numticks=9)
    ax_inset.xaxis.set_major_locator(maj)
    ax_inset.xaxis.set_minor_locator(minr)
    ax_inset.spines["left"].set_linestyle(linestyle)
    ax_inset.spines["left"].set_linewidth(2.5)

    ax_inset.spines["right"].set_visible(False)

    ax_inset.tick_params(axis="both", which="major", length=6, width=1.5, direction="out")
    ax_inset.tick_params(axis="x", which="minor", length=3, width=1, direction="out")
    ax_inset.tick_params(axis="y", which="minor", length=3, width=1, direction="out")

    lines1, labels1 = axes.get_legend_handles_labels()
    axes.legend(lines1, labels1, loc="upper center", bbox_to_anchor=(0.15, 0, 1, 1))
    return axes, ax_inset

Create Times

[18]:
axes,ax_inset=plotting_experiments(benchmark_times=CREATE_TIMES)
axes.set_title("Benchmark for Create Times")
axes.set_ylabel("Create Times (s)")
ax_inset.set_ylabel("Create Time (log)", fontsize=8, labelpad=-2)
plt.tight_layout()
plt.show()
C:\Users\lllang\AppData\Local\Temp\ipykernel_50988\488313717.py:5: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  plt.tight_layout()
../../_images/examples_benchmarks_Small_Benchmark_18_1.png

Read Times

[22]:
axes,ax_inset= plotting_experiments(benchmark_times=READ_TIMES)
axes.set_title("Benchmark for Create Times")
axes.set_ylabel("Read Times (s)")
ax_inset.set_ylabel("Read Time (log)", fontsize=8, labelpad=-2)
plt.tight_layout()
plt.show()
C:\Users\lllang\AppData\Local\Temp\ipykernel_50988\3302302533.py:5: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  plt.tight_layout()
../../_images/examples_benchmarks_Small_Benchmark_20_1.png

Query Times

[23]:
axes,ax_inset= plotting_experiments(benchmark_times=QUERY_TIMES)
axes.set_title("Benchmark for Query Times")
axes.set_ylabel("Query Times (s)")
ax_inset.set_ylabel("Query Time (log)", fontsize=8, labelpad=-2)
plt.tight_layout()
plt.show()
C:\Users\lllang\AppData\Local\Temp\ipykernel_50988\211292255.py:5: UserWarning: This figure includes Axes that are not compatible with tight_layout, so results might be incorrect.
  plt.tight_layout()
../../_images/examples_benchmarks_Small_Benchmark_22_1.png

Dicussion

  1. Create Times

    • SQLite scales best for raw inserts: its lightweight B‑tree writer in C outpaces both columnar solutions at every scale (≤ 1 M rows).

    • ParquetDB is next.

    • PyArrow the worst perfromance: this can mainly be attributed to a difference in how the indices are generated.

  2. Read Times

    • ParquetDB & PyArrow are effectively identical: both return zero‑copy Arrow tables in ~0.1 s for 1 M rows, thanks to their native columnar layout and batch I/O.

    • SQLite is ~40 × slower on full table scans, since it must fetch each row via a cursor and append into Python lists in a tight loop.

  3. Query Times

    • ParquetDB & PyArrow again match each other closely: filtering 1 M rows takes ~0.3–0.4 s, as Arrow applies vectorized predicates.

    • SQLite requires ~1.8 s for the same filter, because every row check is a separate Python‑C transition and Python boolean append.

  4. Developer experience & boilerplate

    • A raw PyArrow workflow requires explicit directory management, manual ID generation, repeated calls to pa.Table.from_pylist(), pq.write_table(), and rebuilding the dataset for each operation—boilerplate that can be tedious to write, maintain, and debug.

    • ParquetDB abstracts away all of that: you simply call db.create(data), db.read(), or db.read(filters=…). It manages file layouts, row‑group boundaries, indexing, and state under the hood, reducing cognitive load and speeding up development.

  5. Key takeaways

    • ParquetDB’s performance is dominated by the underlying Arrow I/O path—it inherits PyArrow’s blazing read/query speed, with only a small additional cost on table creation.

    • For write‑heavy workloads up to ~1 M rows, SQLite still leads on raw insert throughput.

    • For analytics‑style workloads (full scans or vectorized filters), the columnar engines (ParquetDB/PyArrow) deliver an order of magnitude better performance by avoiding Python‑level loops.

    • For developer productivity, ParquetDB’s simple API can save hours of boilerplate code and eliminate error‑prone state management.