parquetdb.core.parquetdb.NormalizeConfig¶
- class NormalizeConfig(load_format: str = 'table', batch_size: int = 131072, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: MemoryPool | None = None, filesystem: FileSystem | None = None, file_options: FileWriteOptions | None = None, max_partitions: int = 1024, max_open_files: int = 1024, max_rows_per_file: int = 100000, min_rows_per_group: int = 0, max_rows_per_group: int = 100000, file_visitor: Callable | None = None, existing_data_behavior: str = 'overwrite_or_ignore', create_dir: bool = True)¶
Configuration for the normalization process, optimizing performance by managing row distribution and file structure.
- Variables:
load_format (str) – The format of the output dataset. Supported formats are ‘table’ and ‘batches’. Default: ‘table’
batch_size (int) – The number of rows to process in each batch. Default: 131,072
batch_readahead (int) – The number of batches to read ahead in a file. Default: 16
fragment_readahead (int) – The number of files to read ahead, improving IO utilization at the cost of RAM usage. Default: 4
fragment_scan_options (Optional[pa.dataset.FragmentScanOptions]) – Options specific to a particular scan and fragment type, potentially changing across scans. Default: None
use_threads (bool) – Whether to use maximum parallelism determined by available CPU cores. Default: True
memory_pool (Optional[pa.MemoryPool]) – The memory pool for allocations. Uses the system’s default memory pool if not specified. Default: None
filesystem (pyarrow.fs.FileSystem) – Filesystem for writing the dataset. Default: None
file_options (pyarrow.fs.FileWriteOptions) – Options for writing the dataset files. Default: None
max_partitions (int) – Maximum number of partitions for dataset writing. Default: 1024
max_open_files (int) – Maximum open files for dataset writing. Default: 1024
max_rows_per_file (int) – Maximum rows per file. Default: 10,000
min_rows_per_group (int) – Minimum rows per row group within each file. Default: 0
max_rows_per_group (int) – Maximum rows per row group within each file. Default: 10,000
existing_data_behavior (str) – How to handle existing data in the dataset directory. Options: ‘overwrite_or_ignore’ Default: ‘overwrite_or_ignore’
create_dir (bool) – Whether to create the dataset directory if it does not exist. Default: True
- __init__(load_format: str = 'table', batch_size: int = 131072, batch_readahead: int = 16, fragment_readahead: int = 4, fragment_scan_options: FragmentScanOptions | None = None, use_threads: bool = True, memory_pool: MemoryPool | None = None, filesystem: FileSystem | None = None, file_options: FileWriteOptions | None = None, max_partitions: int = 1024, max_open_files: int = 1024, max_rows_per_file: int = 100000, min_rows_per_group: int = 0, max_rows_per_group: int = 100000, file_visitor: Callable | None = None, existing_data_behavior: str = 'overwrite_or_ignore', create_dir: bool = True) None ¶
Methods
__init__
([load_format, batch_size, ...])Attributes
batch_readahead
batch_size
create_dir
existing_data_behavior
file_options
file_visitor
filesystem
fragment_readahead
fragment_scan_options
load_format
max_open_files
max_partitions
max_rows_per_file
max_rows_per_group
memory_pool
min_rows_per_group
use_threads