Other Helper Methods

Table Joins

ParquetDB provides a custom join_tables function that extends beyond the built-in PyArrow ``join` <https://arrow.apache.org/docs/python/generated/pyarrow.Table.html#pyarrow.Table.join>`__ method, allowing you to handle custom extension types and more complex data types in your joins.

This notebook demonstrates how to perform various join operations—such as left semi, right semi, left anti, right anti, inner, left outer, right outer, full outer joins—on two PyArrow tables.

While PyArrow’s built-in join is powerful, certain use cases may involve:

  • Custom extension types that PyArrow doesn’t support out-of-the-box.

  • Complex or nested types that require additional logic during joins (e.g., arrays of structs, custom objects, etc.).

Below is the full implementation of the custom join_tables function. It closely mimics the logic of PyArrow’s built-in Table.join but adds:

  1. Index columns (left_index and right_index) to preserve the original row ordering.

  2. Logic to coalesce keys (if coalesce_keys=True).

  3. Automatic handling of suffixes for overlapping columns (left_suffix and right_suffix).

  4. The ability to seamlessly merge custom extension types and complex data that might otherwise be incompatible with the standard PyArrow join.

def join_tables(
    left_table: pa.Table,
    right_table: pa.Table,
    left_keys,
    right_keys=None,
    join_type="left outer",
    left_suffix=None,
    right_suffix=None,
    coalesce_keys=True,
):
    """
    Custom join operation for PyArrow Tables, accommodating complex or extension types
    and additional logic for suffixes and metadata merging.

    Parameters
    ----------
    left_table : pa.Table
        The left-side table to join.
    right_table : pa.Table
        The right-side table to join.
    left_keys : list or str
        Column name(s) in the left table for the join.
    right_keys : list or str, optional
        Column name(s) in the right table for the join.
    join_type : str, optional
        Type of join to perform. E.g., 'left outer', 'right outer', 'inner', 'full outer',
        'left semi', 'right semi', 'left anti', 'right anti'. Defaults to 'left outer'.
    left_suffix : str, optional
        Suffix for overlapping column names from the left table.
    right_suffix : str, optional
        Suffix for overlapping column names from the right table.
    coalesce_keys : bool, optional
        Whether to coalesce join keys if columns have null values. Defaults to True.
[1]:
import pyarrow as pa
from parquetdb.utils import pyarrow_utils

# Construct two sample tables using ParquetDB-like logic
left_data = [
    {"id_1": 100, "id_2": 10, "field_1": "left_1"},
    {"id_1": 33, "id_2": 12},
    {"id_1": 12, "id_2": 13, "field_2": "left_2"},
]

right_data = [
    {"id_1": 100, "id_2": 10, "field_2": "right_1"},
    {"id_1": 5, "id_2": 5},
    {"id_1": 33, "id_2": 13, "extra_field": "right_extra"},
    {"id_1": 33, "id_2": 12, "field_2": "right_2"},
]

# Convert to PyArrow tables
left_table = pa.Table.from_pylist(left_data)
right_table = pa.Table.from_pylist(right_data)

df_left = left_table.to_pandas()
df_right = right_table.to_pandas()

print(df_left)
print(df_right)

# Perform a left outer join using built-in PyArrow
pyarrow_join_result = right_table.join(
    left_table,
    keys=["id_1", "id_2"],
    right_keys=["id_1", "id_2"],
    join_type="left outer",
    left_suffix="_right",
    right_suffix="_left",  # reversed to illustrate differences
)

# Perform the same join with our custom join_tables
custom_join_result = pyarrow_utils.join_tables(
    right_table,
    left_table,
    left_keys=["id_1", "id_2"],
    right_keys=["id_1", "id_2"],
    join_type="left outer",
    left_suffix="_right",
    right_suffix="_left",
    coalesce_keys=True,
)


df_custom_join = custom_join_result.to_pandas()

print(df_custom_join)
   id_1  id_2 field_1
0   100    10  left_1
1    33    12    None
2    12    13    None
   id_1  id_2  field_2
0   100    10  right_1
1     5     5     None
2    33    13     None
3    33    12  right_2
   field_2  id_1  id_2 field_1
0  right_1   100    10  left_1
1  right_2    33    12    None
2     None     5     5    None
3     None    33    13    None

Drop Duplicates

ParquetDB also provides a drop_duplicates function that allows you to drop duplicate rows from a PyArrow Table based on specified keys, keeping the first occurrence.

def drop_duplicates(table, keys):
    """
    Drops duplicate rows from a PyArrow Table based on the specified keys,
    keeping the first occurrence.

    Parameters
    ----------
    table : pyarrow.Table
        The input table from which duplicates will be removed.
    keys : list of str
        A list of column names that determine the uniqueness of rows.

    Returns
    -------
    pyarrow.Table
        A new table with duplicates removed, keeping the first occurrence
        of each unique key combination.
    """
[2]:
data = [
    {"id": 0, "name": "Alice", "category": 1},
    {"id": 1, "name": "Bob", "category": 1},
    {
        "id": 2,
        "name": "Bob",
        "category": 1,
    },  # Duplicate combination of (name, category)
    {"id": 3, "name": "Charlie", "category": 2},
    {
        "id": 4,
        "name": "Alice",
        "category": 1,
    },  # Another duplicate combination of (name, category)
]

# Convert to a PyArrow table
table = pa.Table.from_pylist(data)

# Specify the key columns that define uniqueness (excluding "id"—the function will add it automatically)
unique_keys = ["name", "category"]

# Drop duplicates
deduplicated_table = pyarrow_utils.drop_duplicates(table, unique_keys)

# Show results
print("Original Table:")
print(table.to_pandas())

print("\nDeduplicated Table (keeping first occurrence):")
print(deduplicated_table.to_pandas())
Original Table:
   id     name  category
0   0    Alice         1
1   1      Bob         1
2   2      Bob         1
3   3  Charlie         2
4   4    Alice         1

Deduplicated Table (keeping first occurrence):
   id     name  category
0   0    Alice         1
1   1      Bob         1
2   3  Charlie         2
[ ]: