parquetdb.core.parquetdb.NormalizeConfig

class NormalizeConfig(load_format: str = 'table', batch_size: int = 131072, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: MemoryPool | None = None, filesystem: FileSystem | None = None, file_options: FileWriteOptions | None = None, max_partitions: int = 1024, max_open_files: int = 1024, max_rows_per_file: int = 100000, min_rows_per_group: int = 0, max_rows_per_group: int = 100000, file_visitor: Callable | None = None, existing_data_behavior: str = 'overwrite_or_ignore', create_dir: bool = True)

Configuration for the normalization process, optimizing performance by managing row distribution and file structure.

Variables:
  • load_format (str) – The format of the output dataset. Supported formats are ‘table’ and ‘batches’. Default: ‘table’

  • batch_size (int) – The number of rows to process in each batch. Default: 131,072

  • batch_readahead (int) – The number of batches to read ahead in a file. Default: 16

  • fragment_readahead (int) – The number of files to read ahead, improving IO utilization at the cost of RAM usage. Default: 4

  • fragment_scan_options (Optional[pa.dataset.FragmentScanOptions]) – Options specific to a particular scan and fragment type, potentially changing across scans. Default: None

  • use_threads (bool) – Whether to use maximum parallelism determined by available CPU cores. Default: True

  • memory_pool (Optional[pa.MemoryPool]) – The memory pool for allocations. Uses the system’s default memory pool if not specified. Default: None

  • filesystem (pyarrow.fs.FileSystem) – Filesystem for writing the dataset. Default: None

  • file_options (pyarrow.fs.FileWriteOptions) – Options for writing the dataset files. Default: None

  • max_partitions (int) – Maximum number of partitions for dataset writing. Default: 1024

  • max_open_files (int) – Maximum open files for dataset writing. Default: 1024

  • max_rows_per_file (int) – Maximum rows per file. Default: 10,000

  • min_rows_per_group (int) – Minimum rows per row group within each file. Default: 0

  • max_rows_per_group (int) – Maximum rows per row group within each file. Default: 10,000

  • existing_data_behavior (str) – How to handle existing data in the dataset directory. Options: ‘overwrite_or_ignore’ Default: ‘overwrite_or_ignore’

  • create_dir (bool) – Whether to create the dataset directory if it does not exist. Default: True

__init__(load_format: str = 'table', batch_size: int = 131072, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: MemoryPool | None = None, filesystem: FileSystem | None = None, file_options: FileWriteOptions | None = None, max_partitions: int = 1024, max_open_files: int = 1024, max_rows_per_file: int = 100000, min_rows_per_group: int = 0, max_rows_per_group: int = 100000, file_visitor: Callable | None = None, existing_data_behavior: str = 'overwrite_or_ignore', create_dir: bool = True) None

Methods

__init__([load_format, batch_size, ...])

Attributes

batch_readahead

batch_size

create_dir

existing_data_behavior

file_options

file_visitor

filesystem

fragment_readahead

fragment_scan_options

load_format

max_open_files

max_partitions

max_rows_per_file

max_rows_per_group

memory_pool

min_rows_per_group

use_threads