Other Helper Methods¶
Table Joins¶
ParquetDB provides a custom join_tables
function that extends beyond the built-in PyArrow ``join` <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join>`__ method, allowing you to handle custom extension types and more complex data types in your joins.
This notebook demonstrates how to perform various join operations—such as left semi
, right semi
, left anti
, right anti
, inner
, left outer
, right outer
, full outer
joins—on two PyArrow tables.
While PyArrow’s built-in join
is powerful, certain use cases may involve:
Custom extension types that PyArrow doesn’t support out-of-the-box.
Complex or nested types that require additional logic during joins (e.g., arrays of structs, custom objects, etc.).
Below is the full implementation of the custom join_tables
function. It closely mimics the logic of PyArrow’s built-in Table.join
but adds:
Index columns (
left_index
andright_index
) to preserve the original row ordering.Logic to coalesce keys (if
coalesce_keys=True
).Automatic handling of suffixes for overlapping columns (
left_suffix
andright_suffix
).The ability to seamlessly merge custom extension types and complex data that might otherwise be incompatible with the standard PyArrow join.
def join_tables(
left_table: pa.Table,
right_table: pa.Table,
left_keys,
right_keys=None,
join_type="left outer",
left_suffix=None,
right_suffix=None,
coalesce_keys=True,
):
"""
Custom join operation for PyArrow Tables, accommodating complex or extension types
and additional logic for suffixes and metadata merging.
Parameters
----------
left_table : pa.Table
The left-side table to join.
right_table : pa.Table
The right-side table to join.
left_keys : list or str
Column name(s) in the left table for the join.
right_keys : list or str, optional
Column name(s) in the right table for the join.
join_type : str, optional
Type of join to perform. E.g., 'left outer', 'right outer', 'inner', 'full outer',
'left semi', 'right semi', 'left anti', 'right anti'. Defaults to 'left outer'.
left_suffix : str, optional
Suffix for overlapping column names from the left table.
right_suffix : str, optional
Suffix for overlapping column names from the right table.
coalesce_keys : bool, optional
Whether to coalesce join keys if columns have null values. Defaults to True.
[1]:
import pyarrow as pa
from parquetdb.utils import pyarrow_utils
# Construct two sample tables using ParquetDB-like logic
left_data = [
{"id_1": 100, "id_2": 10, "field_1": "left_1"},
{"id_1": 33, "id_2": 12},
{"id_1": 12, "id_2": 13, "field_2": "left_2"},
]
right_data = [
{"id_1": 100, "id_2": 10, "field_2": "right_1"},
{"id_1": 5, "id_2": 5},
{"id_1": 33, "id_2": 13, "extra_field": "right_extra"},
{"id_1": 33, "id_2": 12, "field_2": "right_2"},
]
# Convert to PyArrow tables
left_table = pa.Table.from_pylist(left_data)
right_table = pa.Table.from_pylist(right_data)
df_left = left_table.to_pandas()
df_right = right_table.to_pandas()
print(df_left)
print(df_right)
# Perform a left outer join using built-in PyArrow
pyarrow_join_result = right_table.join(
left_table,
keys=["id_1", "id_2"],
right_keys=["id_1", "id_2"],
join_type="left outer",
left_suffix="_right",
right_suffix="_left", # reversed to illustrate differences
)
# Perform the same join with our custom join_tables
custom_join_result = pyarrow_utils.join_tables(
right_table,
left_table,
left_keys=["id_1", "id_2"],
right_keys=["id_1", "id_2"],
join_type="left outer",
left_suffix="_right",
right_suffix="_left",
coalesce_keys=True,
)
df_custom_join = custom_join_result.to_pandas()
print(df_custom_join)
id_1 id_2 field_1
0 100 10 left_1
1 33 12 None
2 12 13 None
id_1 id_2 field_2
0 100 10 right_1
1 5 5 None
2 33 13 None
3 33 12 right_2
field_2 id_1 id_2 field_1
0 right_1 100 10 left_1
1 right_2 33 12 None
2 None 5 5 None
3 None 33 13 None
Drop Duplicates¶
ParquetDB also provides a drop_duplicates
function that allows you to drop duplicate rows from a PyArrow Table based on specified keys, keeping the first occurrence.
def drop_duplicates(table, keys):
"""
Drops duplicate rows from a PyArrow Table based on the specified keys,
keeping the first occurrence.
Parameters
----------
table : pyarrow.Table
The input table from which duplicates will be removed.
keys : list of str
A list of column names that determine the uniqueness of rows.
Returns
-------
pyarrow.Table
A new table with duplicates removed, keeping the first occurrence
of each unique key combination.
"""
[2]:
data = [
{"id": 0, "name": "Alice", "category": 1},
{"id": 1, "name": "Bob", "category": 1},
{
"id": 2,
"name": "Bob",
"category": 1,
}, # Duplicate combination of (name, category)
{"id": 3, "name": "Charlie", "category": 2},
{
"id": 4,
"name": "Alice",
"category": 1,
}, # Another duplicate combination of (name, category)
]
# Convert to a PyArrow table
table = pa.Table.from_pylist(data)
# Specify the key columns that define uniqueness (excluding "id"—the function will add it automatically)
unique_keys = ["name", "category"]
# Drop duplicates
deduplicated_table = pyarrow_utils.drop_duplicates(table, unique_keys)
# Show results
print("Original Table:")
print(table.to_pandas())
print("\nDeduplicated Table (keeping first occurrence):")
print(deduplicated_table.to_pandas())
Original Table:
id name category
0 0 Alice 1
1 1 Bob 1
2 2 Bob 1
3 3 Charlie 2
4 4 Alice 1
Deduplicated Table (keeping first occurrence):
id name category
0 0 Alice 1
1 1 Bob 1
2 3 Charlie 2
[ ]: