PyArrowUtils¶
Check if a PyArrow type is an extension type. |
|
Check if a PyArrow type is a fixed shape tensor type. |
|
Drop duplicate rows from a PyArrow Table based on specified keys. |
|
Join two PyArrow tables based on specified key columns. |
|
Sort the fields in a PyArrow schema alphabetically by field name. |
|
Cast a PyArrow table to match a new schema. |
|
Unify two PyArrow schemas while preserving and merging their metadata. |
|
Delete specified columns from a PyArrow table. |
|
Delete rows from a table based on ID values. |
|
Delete rows from a table based on values in a specified field. |
Detailed Documentation¶
- add_new_null_fields_in_column(column_array, field, new_type)¶
- add_new_null_fields_in_struct(column_array, new_struct_type)¶
- add_new_null_fields_in_table(table, new_schema)¶
- align_table(current_table: Table, new_schema: Schema) Table ¶
Aligns the given table to the new schema, filling in missing fields or struct fields with null values.
- Parameters:
current_table (pyarrow.Table) – The table to align.
new_schema (pyarrow.Schema) – The target schema to align the table to.
- Returns:
pyarrow.Table – The aligned table with: - Missing fields filled with null values - Fields ordered according to new schema - Struct fields aligned to match new schema
- convert_list_column_to_fixed_tensor(table, column_name)¶
Converts a variable-sized list column in a PyArrow Table to a fixed-size list (tensor) column.
This function checks if a column in the table is of list type and converts it to a fixed-size list array (i.e., tensor) based on the dimensions of the first non-null element in the list. The fixed-size list array is then updated in the table.
This will only convert floats, integers, booleans, and decimals. Also it will only convert if the list is homogeneous.
- Parameters:
table (pyarrow.Table) – The input PyArrow Table containing the column to be converted.
column_name (str) – The name of the column in the table which contains list values to be converted to fixed-size arrays.
- Returns:
pyarrow.Table – The updated table where the specified column has been converted to a fixed-size list array (tensor).
Examples
>>> import pyarrow as pa >>> import numpy as np >>> data = pa.array([[1, 2], [3, 4], [5, 6]], type=pa.list_(pa.int64())) >>> table = pa.table([data], names=['col1']) >>> modified_table = convert_lists_to_fixed_size_list_arrays_in_column(table, 'col1') >>> print(modified_table) pyarrow.Table col1: fixed_shape_tensor<list<item: int64>[2]> ---- col1: [ [1, 2], [3, 4], [5, 6] ]
- create_empty_batch_generator(schema: ~pyarrow.lib.Schema, columns: list = None, special_fields: list = [pyarrow.Field<id: int64>])¶
Orders the fields in a table’s struct columns to match a new schema.
- Parameters:
table (pa.Table) – The original table.
new_schema (pa.Schema) – The new schema with the desired field order.
- Returns:
pa.Table – A new table with fields ordered according to new_schema.
Examples
>>> table = pa.table([{'b': 2, 'a': 1}], schema=pa.schema([pa.field('b', pa.int32()), pa.field('a', pa.int32())])) >>> new_schema = pa.schema([pa.field('a', pa.int32()), pa.field('b', pa.int32())]) >>> order_fields_in_table(table, new_schema) pyarrow.Table
- create_empty_table(schema: ~pyarrow.lib.Schema, columns: list = None, special_fields: list = [pyarrow.Field<id: int64>]) Table ¶
Creates an empty PyArrow table with the same schema as the dataset or specific columns.
- Parameters:
schema (pa.Schema) – The schema of the dataset to mimic in the empty generator.
columns (list, optional) – List of column names to include in the empty table. Defaults to None.
special_fields (list, optional) – A list of fields to use if the schema is empty. Defaults to a field named ‘id’ of type pa.int64().
- Returns:
pa.Table – An empty PyArrow table with the specified schema.
Examples
>>> schema = pa.schema([pa.field('a', pa.int32()), pa.field('b', pa.string())]) >>> create_empty_table(schema) pyarrow.Table
- create_nested_arrays_dict_from_flattened_table(table)¶
Reconstructs a nested dictionary of arrays from a flattened PyArrow table.
- Parameters:
table (pa.Table) – The PyArrow table with flattened field names.
- Returns:
dict – A dictionary where keys represent the nested field structure, and values are the corresponding arrays.
Examples
>>> table = pa.table([pa.array([1, 2]), pa.array([3, 4])], names=['a.b', 'a.c']) >>> create_nested_arrays_dict_from_flattened_table(table) {'a': {'b': <pyarrow.Array object at 0x...>, 'c': <pyarrow.Array object at 0x...>}}
- create_struct_arrays_from_dict(nested_dict)¶
Creates PyArrow StructArrays and schema from a nested dictionary.
- Parameters:
nested_dict (dict) – The dictionary where keys represent field names and values are either arrays or nested dictionaries.
- Returns:
tuple of (pa.StructArray, pa.StructType) – A tuple containing the created StructArray and its corresponding StructType schema.
Examples
>>> nested_dict = {'a': pa.array([1, 2]), 'b': {'c': pa.array([3, 4])}} >>> create_struct_arrays_from_dict(nested_dict) (<pyarrow.StructArray object at 0x...>, StructType(a: int64, b: StructType(c: int64)))
- delete_columns(table, columns)¶
Delete specified columns from a PyArrow table.
- Parameters:
table (pyarrow.Table) – The input table to delete columns from.
columns (list of str) – List of column names to delete.
- Returns:
pyarrow.Table – A new table with the specified columns removed.
Examples
>>> import pyarrow as pa >>> table = pa.table({'a': [1,2,3], 'b': [4,5,6], 'c': [7,8,9]}) >>> delete_columns(table, ['b', 'c']) pyarrow.Table a: int64
- delete_field_values(table, values, field_name)¶
Delete rows from a table based on values in a specified field.
- Parameters:
table (pyarrow.Table) – The input table to delete rows from.
values (array-like) – List of values to delete.
field_name (str) – Name of the field to match values against.
- Returns:
pyarrow.Table – A new table with rows matching the specified values in the given field removed.
Examples
>>> import pyarrow as pa >>> table = pa.table({'id': [1,2,3], 'category': ['A','B','A']}) >>> delete_field_values(table, ['A'], 'category') pyarrow.Table id: int64 category: string ---- id category 2 B
- delete_ids(table, ids)¶
Delete rows from a table based on ID values.
- Parameters:
table (pyarrow.Table) – The input table to delete rows from.
ids (array-like) – List of ID values to delete.
- Returns:
pyarrow.Table – A new table with rows matching the specified IDs removed.
Examples
>>> import pyarrow as pa >>> table = pa.table({'id': [1,2,3,4], 'value': ['a','b','c','d']}) >>> delete_ids(table, [2,3]) pyarrow.Table id: int64 value: string ---- id value 1 a 4 d
- drop_duplicates(table, keys)¶
Drop duplicate rows from a PyArrow Table based on specified keys.
- Parameters:
table (pyarrow.Table) – The input table from which duplicates will be removed.
keys (list of str) – A list of column names that determine the uniqueness of rows.
- Returns:
pyarrow.Table – A new table with duplicates removed, keeping the first occurrence of each unique key combination.
Notes
This function keeps the first occurrence of each unique key combination.
- fill_null_nested_structs(array)¶
Fills null values within a nested PyArrow StructArray, recursively processing any nested structs.
- Parameters:
array (pa.Array) – The PyArrow StructArray that may contain nested structs with null values.
- Returns:
pa.StructArray – A new StructArray with nulls handled recursively within nested structs.
Examples
>>> array = pa.array([{'a': 1, 'b': None}, {'a': None, 'b': {'c': 2}}], type=pa.struct([('a', pa.int32()), ('b', pa.struct([('c', pa.int32())]))])) >>> fill_null_nested_structs(array) <pyarrow.StructArray object at 0x...>
- fill_null_nested_structs_in_table(table)¶
Recursively fills null values within nested struct columns of a PyArrow table.
- Parameters:
table (pa.Table) – The PyArrow table to process for nested structs and null values.
- Returns:
pa.Table – A new table where nulls within nested struct columns have been handled.
Examples
>>> table = pa.table([{'a': 1, 'b': None}, {'a': None, 'b': {'c': 2}}], schema=pa.schema([('a', pa.int32()), ('b', pa.struct([('c', pa.int32())]))])) >>> fill_null_nested_structs_in_table(table) pyarrow.Table
- find_difference_between_pyarrow_schemas(schema1, schema2)¶
Finds the difference between two PyArrow schemas.
- Parameters:
schema1 (pyarrow.Schema) – The first schema to compare.
schema2 (pyarrow.Schema) – The second schema to compare.
- Returns:
set – A set of field names that are present in schema1 but not in schema2.
Examples
>>> schema1 = pa.schema([("a", pa.int32()), ("b", pa.string())]) >>> schema2 = pa.schema([("b", pa.string())]) >>> find_difference_between_pyarrow_schemas(schema1, schema2) {'a'}
- flatten_nested_structs(array, parent_name)¶
Flattens nested structs within a PyArrow array, creating fully qualified field names.
- Parameters:
array (pa.Array) – The PyArrow StructArray containing nested fields to flatten.
parent_name (str) – The name of the parent field, used to generate fully qualified field names.
- Returns:
list of tuple – A list of tuples, where each tuple contains a flattened array and its corresponding field.
Examples
>>> array = pa.array([{'a': {'b': 1}}, {'a': {'b': 2}}], type=pa.struct([('a', pa.struct([('b', pa.int32())]))])) >>> flatten_nested_structs(array, 'a') [(array([1, 2], type=int32), Field<name: a.b, type: int32>)]
- flatten_table(table)¶
Flattens nested struct columns within a PyArrow table.
- Parameters:
table (pa.Table) – The PyArrow table containing nested struct columns to flatten.
- Returns:
pa.Table – A new table with flattened struct fields.
Examples
>>> table = pa.table([{'a': {'b': 1}}, {'a': {'b': 2}}], schema=pa.schema([('a', pa.struct([('b', pa.int32())]))])) >>> flatten_table(table) pyarrow.Table
- flatten_table_in_column(table, column_name)¶
Flattens a nested struct column in a PyArrow Table.
This function takes a column from a PyArrow Table that is of struct type, and flattens its fields into separate columns. The original struct column is replaced by these individual columns in the resulting table. The column names are sorted alphabetically after flattening.
This does not work in in the table_column_callbacks function. As it will remove existing field of the nested structs
- Parameters:
table (pyarrow.Table) – The input PyArrow Table containing the column to be flattened.
column_name (str) – The name of the struct column in the table to be flattened.
- Returns:
pyarrow.Table – The updated table where the nested struct column has been flattened into individual columns.
Examples
>>> import pyarrow as pa >>> data = pa.array([{'a': 1, 'b': 2}, {'a': 3, 'b': 4}], type=pa.struct([('a', pa.int64()), ('b', pa.int64())])) >>> table = pa.table([data], names=['col1']) >>> modified_table = flatten_table_in_column(table, 'col1') >>> print(modified_table) pyarrow.Table a: int64 b: int64 ---- a: [1, 3] b: [2, 4]
- infer_pyarrow_types(data_dict: dict)¶
Infers PyArrow types for the given dictionary of data. The function skips the ‘id’ field and infers the data types for all other keys.
- Parameters:
data_dict (dict) – A dictionary where keys represent field names and values represent data values.
- Returns:
dict – A dictionary where keys are field names and values are the inferred PyArrow data types.
Examples
>>> data_dict = {'a': 123, 'b': 'string_value', 'id': 1} >>> infer_pyarrow_types(data_dict) {'a': DataType(int64), 'b': DataType(string)}
- is_empty_struct_in_column(table, column_name)¶
- is_extension_type(type)¶
Check if a PyArrow type is an extension type.
- Parameters:
type (pyarrow.DataType) – The PyArrow type to check.
- Returns:
bool – True if the type is an extension type, False otherwise.
- is_fixed_shape_tensor(type)¶
Check if a PyArrow type is a fixed shape tensor type.
- Parameters:
type (pyarrow.DataType) – The PyArrow type to check.
- Returns:
bool – True if the type is a fixed shape tensor, False otherwise.
- join_tables(left_table, right_table, left_keys: List[str], right_keys: List[str] = None, join_type: str = 'left outer', left_suffix: str = None, right_suffix: str = None, coalesce_keys: bool = True)¶
Join two PyArrow tables based on specified key columns.
- Parameters:
left_table (pyarrow.Table) – The left table for the join operation.
right_table (pyarrow.Table) – The right table for the join operation.
left_keys (str or list of str) – Column name(s) from the left table to join on.
right_keys (str or list of str, optional) – Column name(s) from the right table to join on. If None, uses left_keys.
join_type (str, optional) – Type of join to perform. Default is “left outer”. Supported types: “left outer”, “right outer”, “inner”, “full outer”.
left_suffix (str, optional) – Suffix to append to overlapping column names from the left table.
right_suffix (str, optional) – Suffix to append to overlapping column names from the right table.
coalesce_keys (bool, optional) – Whether to combine join keys that appear in both tables. Default is True.
- Returns:
pyarrow.Table – A new table containing the joined data with: - All columns from both tables (except join keys which are coalesced if specified) - Column name conflicts resolved using the provided suffixes - Combined metadata from both input tables
Notes
The function: - Preserves the original data types and metadata - Handles overlapping column names by adding suffixes - Removes temporary index columns used for joining - Combines metadata from both input tables
Examples
>>> left = pa.table({'id': [1, 2, 3], 'value': ['a', 'b', 'c']}) >>> right = pa.table({'id': [2, 3, 4], 'score': [10, 20, 30]}) >>> joined = join_tables(left, right, 'id') >>> joined.column_names ['id', 'value', 'score']
- merge_schemas(current_schema: Schema, incoming_schema: Schema) Schema ¶
Merges two PyArrow schemas, combining fields and recursively merging struct fields.
- Parameters:
current_schema (pyarrow.Schema) – The existing schema to merge.
incoming_schema (pyarrow.Schema) – The new schema to merge with the existing schema.
- Returns:
pa.Schema – A new PyArrow schema that represents the merged result of the two input schemas.
Examples
>>> current_schema = pa.schema([("a", pa.int32()), ("b", pa.string())]) >>> incoming_schema = pa.schema([("b", pa.string()), ("c", pa.float64())]) >>> merge_schemas(current_schema, incoming_schema) Schema(a: int32, b: string, c: float64)
- merge_structs(current_type: StructType, incoming_type: StructType) StructType ¶
Recursively merges two PyArrow StructTypes.
- Parameters:
current_type (pa.StructType) – The existing struct type.
incoming_type (pa.StructType) – The new struct type to merge with the existing struct.
- Returns:
pa.StructType – A new PyArrow StructType representing the merged result of the two input structs.
Examples
>>> current = pa.struct([("a", pa.int32()), ("b", pa.string())]) >>> incoming = pa.struct([("b", pa.string()), ("c", pa.float64())]) >>> merge_structs(current, incoming) StructType(a: int32, b: string, c: float64)
- order_fields_in_struct(column_array, new_struct_type)¶
Orders the fields in a struct array to match a new struct type.
- Parameters:
column_array (pa.Array) – The original struct array.
new_struct_type (pa.StructType) – The new struct type with the desired field order.
- Returns:
pa.Array – A new struct array with fields ordered according to new_struct_type.
Examples
>>> column_array = pa.array([{'b': 2, 'a': 1}], type=pa.struct([pa.field('b', pa.int32()), pa.field('a', pa.int32())])) >>> new_struct_type = pa.struct([pa.field('a', pa.int32()), pa.field('b', pa.int32())]) >>> order_fields_in_struct(column_array, new_struct_type) <pyarrow.StructArray object at 0x...>
- order_fields_in_table(table, new_schema)¶
Orders the fields in a table’s struct columns to match a new schema.
- Parameters:
table (pa.Table) – The original table.
new_schema (pa.Schema) – The new schema with the desired field order.
- Returns:
pa.Table – A new table with fields ordered according to new_schema.
Examples
>>> table = pa.table([{'b': 2, 'a': 1}], schema=pa.schema([pa.field('b', pa.int32()), pa.field('a', pa.int32())])) >>> new_schema = pa.schema([pa.field('a', pa.int32()), pa.field('b', pa.int32())]) >>> order_fields_in_table(table, new_schema) pyarrow.Table
- rebuild_nested_table(table, load_format='table')¶
- replace_empty_structs(column_array: ~pyarrow.lib.Array, dummy_field=pyarrow.Field<dummy_field: int16>)¶
Replaces empty PyArrow struct arrays with a struct containing a dummy field.
- Parameters:
column_array (pa.Array) – The column array to inspect for empty structs.
dummy_field (pa.Field, optional) – The dummy field to insert into empty structs. Defaults to a field named ‘dummy_field’ with type pa.int16().
- Returns:
pa.Array – The input array with empty structs replaced by structs containing the dummy field.
Examples
>>> column_array = pa.array([{'a': 1}, {}, {'a': 2}], type=pa.struct([pa.field('a', pa.int32())])) >>> replace_empty_structs(column_array) <pyarrow.StructArray object at 0x...>
- replace_empty_structs_in_column(table, column_name, dummy_field=pyarrow.Field<dummy_field: int16>, is_nested=False)¶
Replaces empty struct values in a specified column of a PyArrow Table with a dummy field.
This function checks if the given column in the table is of a struct type. If it is, it replaces any empty structs in the column with a dummy struct that includes the dummy_field. The modified column is then updated in the table and returned.
- Parameters:
table (pyarrow.Table) – The input PyArrow Table containing the column to be modified.
column_name (str) – The name of the column in the table where empty structs should be replaced.
dummy_field (pyarrow.Field, optional) – A dummy field to insert into empty structs, by default pa.field(‘dummy_field’, pa.int16()).
- Returns:
pyarrow.Table – The updated table where empty structs in the specified column have been replaced with dummy structs.
Examples
>>> import pyarrow as pa >>> data = pa.array([{'a': 1}, None, {}], type=pa.struct([('a', pa.int64())])) >>> table = pa.table([data], names=['col1']) >>> modified_table = replace_empty_structs_in_column(table, 'col1') >>> print(modified_table) pyarrow.Table col1: struct<a: int64, dummy_field: int16> ---- col1: [{a: 1}, {dummy_field: 0}, {dummy_field: 0}]
- replace_empty_structs_in_table(table, dummy_field=pyarrow.Field<dummy_field: int16>)¶
Replaces empty struct fields in a PyArrow table with a struct containing a dummy field.
- Parameters:
table (pa.Table) – The table in which to replace empty structs.
dummy_field (pa.Field, optional) – The dummy field to insert into empty structs. Defaults to a field named ‘dummy_field’ with type pa.int16().
- Returns:
pa.Table – The table with empty struct fields replaced by structs containing the dummy field.
Examples
>>> table = pa.table([{'a': 1}, {}, {'a': 2}], schema=pa.schema([pa.field('a', pa.struct([pa.field('a', pa.int32())]))])) >>> replace_empty_structs_in_table(table) pyarrow.Table
- schema_equal(schema1, schema2)¶
Compare two PyArrow schemas for equality, including metadata.
- Parameters:
schema1 (pyarrow.Schema) – First schema to compare
schema2 (pyarrow.Schema) – Second schema to compare
- Returns:
bool – True if schemas and their metadata are equal, False otherwise
Notes
This function checks both the schema structure (fields and types) as well as the metadata dictionaries for equality. Two schemas are considered equal only if both their structure and metadata match exactly.
- sort_schema(schema)¶
Sort the fields in a PyArrow schema alphabetically by field name.
- Parameters:
schema (pyarrow.Schema) – Input schema to sort
- Returns:
pyarrow.Schema – New schema with fields sorted alphabetically, preserving the original metadata
Notes
The function maintains the original schema metadata while reordering the fields. This is useful for ensuring consistent field ordering across different schema instances.
- table_column_callbacks(table, callbacks=[])¶
Applies a list of callback functions to each column in a PyArrow Table.
This function iterates over all columns in the provided table and applies each callback function from the callbacks list to each column. The callbacks are expected to modify the table in some way (e.g., transforming or updating columns), and the updated table is returned after all callbacks are applied.
- Parameters:
table (pyarrow.Table) – The input PyArrow Table to which the callback functions will be applied.
callbacks (list of callable, optional) – A list of functions to be applied to each column in the table. Each callback should take two arguments: the table and the name of the column.
- Returns:
pyarrow.Table – The updated table after applying all callback functions to each column.
Examples
>>> import pyarrow as pa >>> def uppercase_column_names(table, column_name): ... new_name = column_name.upper() ... column = table.column(column_name) ... return table.rename_columns([new_name if name == column_name else name for name in table.column_names]) ... >>> data = pa.array([1, 2, 3]) >>> table = pa.table([data], names=['col1']) >>> modified_table = table_column_callbacks(table, callbacks=[uppercase_column_names]) >>> print(modified_table) pyarrow.Table COL1: int64 ---- COL1: [1, 2, 3]
- table_schema_cast(current_table, new_schema)¶
Cast a PyArrow table to match a new schema.
This function casts an existing PyArrow table to match a new schema, handling: - Fields present in both schemas (cast to new type) - New fields (added as null columns) - Special handling for fixed shape tensors - Reordering columns alphabetically
- Parameters:
current_table (pyarrow.Table) – The table to be cast to the new schema
new_schema (pyarrow.Schema) – The target schema to cast the table to
- Returns:
pyarrow.Table – A new table with the schema matching new_schema
- Raises:
Exception – If casting a list column fails, typically due to nested null fields
- unify_schemas(schema_list, promote_options='permissive')¶
Unify two PyArrow schemas while preserving and merging their metadata.
- Parameters:
schema_list (list of pyarrow.Schema) – List containing exactly two schemas to unify. The first schema is considered the current schema and the second is the new schema to merge with.
promote_options (str, optional) – Options for type promotion when unifying schemas. Default is “permissive”. See PyArrow documentation for valid options.
- Returns:
pyarrow.Schema – A new unified schema that: - Contains all fields from both input schemas - Has merged metadata from both schemas if they differ - Has field types promoted according to promote_options
Notes
The function: - Extracts and merges metadata from both schemas if they differ - Uses PyArrow’s unify_schemas to combine the field definitions - Reattaches the merged metadata to the final schema
Examples
>>> schema1 = pa.schema([('id', pa.int32()), ('name', pa.string())]) >>> schema2 = pa.schema([('id', pa.int64()), ('age', pa.int32())]) >>> unified = unify_schemas([schema1, schema2]) >>> unified.names ['id', 'name', 'age']
- update_fixed_shape_tensor(current_array, update_array)¶
Update a fixed shape tensor array with values from another array, preserving nulls.
This function updates values in current_array with non-null values from update_array. Where update_array has null values, the original values from current_array are preserved.
- Parameters:
current_array (pyarrow.Array) – The original fixed shape tensor array to be updated
update_array (pyarrow.Array) – The array containing update values. Must be the same length as current_array.
- Returns:
pyarrow.Array – A new array with values from current_array updated with non-null values from update_array
Notes
The function preserves the null/non-null status of elements: - Where update_array has non-null values, those values replace current_array values - Where update_array has null values, current_array values are preserved - The returned array maintains the same length as the input arrays
Examples
>>> import pyarrow as pa >>> current = pa.array([[1,2], [3,4], [5,6]]) >>> update = pa.array([[7,8], None, [9,10]]) >>> result = update_fixed_shape_tensor(current, update) >>> result.to_pylist() [[7,8], [3,4], [9,10]]
- update_flattend_table(current_table, incoming_table, update_keys: List[str] | str = ['id'])¶
Updates the current table using the values from the incoming table by flattening both tables, applying the updates, and then rebuilding the nested structure.
- Parameters:
current_table (pa.Table) – The current PyArrow table to update.
incoming_table (pa.Table) – The incoming PyArrow table containing updated values.
- Returns:
pa.Table – The updated PyArrow table with flattened and rebuilt structure.
- update_schema(current_schema, schema=None, field_dict=None, update_metadata=True)¶
Update the schema of a given table based on a provided schema or field modifications.
This function allows updating the schema of a PyArrow table by either replacing the entire schema or modifying individual fields within the existing schema. It can take a dictionary of field names and their corresponding new field definitions to update specific fields in the schema. Alternatively, a completely new schema can be provided to replace the current one.
- Parameters:
current_current (pa.Schema) – The current schema of the table.
schema (pa.Schema, optional) – A new schema to replace the existing schema of the table. If provided, this will completely override the current schema.
field_dict (dict, optional) – A dictionary where the keys are existing field names and the values are the new PyArrow field definitions to replace the old ones. This is used for selectively updating specific fields within the current schema.
update_metadata (bool, optional) – Whether to update the metadata of the table.
- Returns:
pa.Table – A new PyArrow table with the updated schema.