Normalization

Normalization is a crucial process for ensuring optimal performance and efficient management of data in file-based database systems like ParquetDB. In traditional databases, normalization typically refers to structuring relational tables to reduce redundancy. In ParquetDB’s context, normalization helps balance the distribution of data across multiple Parquet files, avoiding situations where files have uneven row counts.

Without proper normalization, data skew can lead to performance bottlenecks in operations such as queries, inserts, updates, or deletions. By normalizing your dataset, ParquetDB rewrites and restructures your files to have a more consistent number of rows, improving parallelization and read/write speeds.

In this notebook, we will:

  1. Generate an example dataset using generate_similar_data.

  2. Demonstrate how to normalize data in ParquetDB using NormalizeConfig.

  3. Show how normalization can improve performance by ensuring each file has a balanced distribution of rows.

[2]:
import pprint
import os
import shutil
from parquetdb.utils.general_utils import generate_similar_data
from parquetdb import ParquetDB, NormalizeConfig

Generate Example Data

Below, we’ll generate a dataset that imitates real-world data variations using the generate_similar_data utility function. This function creates new data entries based on the structure of a provided template.

[3]:
# Define a simple template data entry
template_dict = {
    "float_field": 10,
    "int_field": 10,
    "name": "item",
    "nested_value": {"value": 10, "name": "item"},
    "list_field": [1, 2, 3],
}
for x in range(500):
    template_dict[f"column_{x}"] = "test"

template = [template_dict]

# Generate multiple data entries
num_entries = 100000  # Feel free to adjust this
data = generate_similar_data(template, num_entries)

print("Generated Data:")
pprint.pprint(data[0])
Generated Data:
{'column_0': 'test_74',
 'column_1': 'test_16',
 'column_10': 'test_64',
 'column_100': 'test_84',
 'column_101': 'test_76',
 'column_102': 'test_78',
 'column_103': 'test_26',
 'column_104': 'test_16',
 'column_105': 'test_70',
 'column_106': 'test_19',
 'column_107': 'test_5',
 'column_108': 'test_82',
 'column_109': 'test_54',
 'column_11': 'test_72',
 'column_110': 'test_94',
 'column_111': 'test_59',
 'column_112': 'test_31',
 'column_113': 'test_33',
 'column_114': 'test_30',
 'column_115': 'test_82',
 'column_116': 'test_2',
 'column_117': 'test_20',
 'column_118': 'test_82',
 'column_119': 'test_31',
 'column_12': 'test_26',
 'column_120': 'test_89',
 'column_121': 'test_84',
 'column_122': 'test_85',
 'column_123': 'test_81',
 'column_124': 'test_37',
 'column_125': 'test_100',
 'column_126': 'test_95',
 'column_127': 'test_54',
 'column_128': 'test_86',
 'column_129': 'test_82',
 'column_13': 'test_5',
 'column_130': 'test_62',
 'column_131': 'test_94',
 'column_132': 'test_87',
 'column_133': 'test_99',
 'column_134': 'test_73',
 'column_135': 'test_56',
 'column_136': 'test_98',
 'column_137': 'test_36',
 'column_138': 'test_4',
 'column_139': 'test_88',
 'column_14': 'test_1',
 'column_140': 'test_71',
 'column_141': 'test_35',
 'column_142': 'test_71',
 'column_143': 'test_76',
 'column_144': 'test_55',
 'column_145': 'test_55',
 'column_146': 'test_15',
 'column_147': 'test_99',
 'column_148': 'test_80',
 'column_149': 'test_61',
 'column_15': 'test_91',
 'column_150': 'test_33',
 'column_151': 'test_58',
 'column_152': 'test_75',
 'column_153': 'test_51',
 'column_154': 'test_46',
 'column_155': 'test_20',
 'column_156': 'test_44',
 'column_157': 'test_49',
 'column_158': 'test_51',
 'column_159': 'test_2',
 'column_16': 'test_4',
 'column_160': 'test_25',
 'column_161': 'test_29',
 'column_162': 'test_54',
 'column_163': 'test_3',
 'column_164': 'test_23',
 'column_165': 'test_84',
 'column_166': 'test_41',
 'column_167': 'test_45',
 'column_168': 'test_65',
 'column_169': 'test_34',
 'column_17': 'test_99',
 'column_170': 'test_60',
 'column_171': 'test_57',
 'column_172': 'test_99',
 'column_173': 'test_67',
 'column_174': 'test_25',
 'column_175': 'test_97',
 'column_176': 'test_62',
 'column_177': 'test_30',
 'column_178': 'test_21',
 'column_179': 'test_70',
 'column_18': 'test_41',
 'column_180': 'test_59',
 'column_181': 'test_15',
 'column_182': 'test_67',
 'column_183': 'test_20',
 'column_184': 'test_41',
 'column_185': 'test_41',
 'column_186': 'test_42',
 'column_187': 'test_60',
 'column_188': 'test_51',
 'column_189': 'test_93',
 'column_19': 'test_62',
 'column_190': 'test_16',
 'column_191': 'test_60',
 'column_192': 'test_32',
 'column_193': 'test_94',
 'column_194': 'test_56',
 'column_195': 'test_91',
 'column_196': 'test_29',
 'column_197': 'test_65',
 'column_198': 'test_3',
 'column_199': 'test_5',
 'column_2': 'test_36',
 'column_20': 'test_12',
 'column_200': 'test_43',
 'column_201': 'test_25',
 'column_202': 'test_25',
 'column_203': 'test_31',
 'column_204': 'test_70',
 'column_205': 'test_5',
 'column_206': 'test_24',
 'column_207': 'test_8',
 'column_208': 'test_6',
 'column_209': 'test_81',
 'column_21': 'test_66',
 'column_210': 'test_88',
 'column_211': 'test_7',
 'column_212': 'test_98',
 'column_213': 'test_83',
 'column_214': 'test_5',
 'column_215': 'test_83',
 'column_216': 'test_62',
 'column_217': 'test_70',
 'column_218': 'test_6',
 'column_219': 'test_42',
 'column_22': 'test_44',
 'column_220': 'test_68',
 'column_221': 'test_57',
 'column_222': 'test_93',
 'column_223': 'test_5',
 'column_224': 'test_20',
 'column_225': 'test_8',
 'column_226': 'test_85',
 'column_227': 'test_70',
 'column_228': 'test_72',
 'column_229': 'test_9',
 'column_23': 'test_98',
 'column_230': 'test_60',
 'column_231': 'test_76',
 'column_232': 'test_88',
 'column_233': 'test_20',
 'column_234': 'test_42',
 'column_235': 'test_33',
 'column_236': 'test_63',
 'column_237': 'test_78',
 'column_238': 'test_21',
 'column_239': 'test_11',
 'column_24': 'test_87',
 'column_240': 'test_79',
 'column_241': 'test_25',
 'column_242': 'test_82',
 'column_243': 'test_70',
 'column_244': 'test_77',
 'column_245': 'test_4',
 'column_246': 'test_30',
 'column_247': 'test_13',
 'column_248': 'test_29',
 'column_249': 'test_24',
 'column_25': 'test_42',
 'column_250': 'test_37',
 'column_251': 'test_77',
 'column_252': 'test_53',
 'column_253': 'test_52',
 'column_254': 'test_26',
 'column_255': 'test_43',
 'column_256': 'test_3',
 'column_257': 'test_4',
 'column_258': 'test_23',
 'column_259': 'test_55',
 'column_26': 'test_39',
 'column_260': 'test_66',
 'column_261': 'test_47',
 'column_262': 'test_24',
 'column_263': 'test_16',
 'column_264': 'test_78',
 'column_265': 'test_49',
 'column_266': 'test_36',
 'column_267': 'test_74',
 'column_268': 'test_31',
 'column_269': 'test_62',
 'column_27': 'test_66',
 'column_270': 'test_27',
 'column_271': 'test_2',
 'column_272': 'test_71',
 'column_273': 'test_64',
 'column_274': 'test_33',
 'column_275': 'test_83',
 'column_276': 'test_73',
 'column_277': 'test_44',
 'column_278': 'test_25',
 'column_279': 'test_79',
 'column_28': 'test_77',
 'column_280': 'test_84',
 'column_281': 'test_16',
 'column_282': 'test_42',
 'column_283': 'test_69',
 'column_284': 'test_76',
 'column_285': 'test_94',
 'column_286': 'test_75',
 'column_287': 'test_81',
 'column_288': 'test_55',
 'column_289': 'test_63',
 'column_29': 'test_45',
 'column_290': 'test_12',
 'column_291': 'test_71',
 'column_292': 'test_9',
 'column_293': 'test_5',
 'column_294': 'test_9',
 'column_295': 'test_44',
 'column_296': 'test_33',
 'column_297': 'test_70',
 'column_298': 'test_97',
 'column_299': 'test_84',
 'column_3': 'test_28',
 'column_30': 'test_91',
 'column_300': 'test_64',
 'column_301': 'test_97',
 'column_302': 'test_30',
 'column_303': 'test_28',
 'column_304': 'test_70',
 'column_305': 'test_60',
 'column_306': 'test_44',
 'column_307': 'test_7',
 'column_308': 'test_14',
 'column_309': 'test_50',
 'column_31': 'test_10',
 'column_310': 'test_72',
 'column_311': 'test_62',
 'column_312': 'test_95',
 'column_313': 'test_68',
 'column_314': 'test_77',
 'column_315': 'test_52',
 'column_316': 'test_40',
 'column_317': 'test_27',
 'column_318': 'test_6',
 'column_319': 'test_47',
 'column_32': 'test_75',
 'column_320': 'test_78',
 'column_321': 'test_98',
 'column_322': 'test_24',
 'column_323': 'test_7',
 'column_324': 'test_74',
 'column_325': 'test_4',
 'column_326': 'test_51',
 'column_327': 'test_69',
 'column_328': 'test_25',
 'column_329': 'test_33',
 'column_33': 'test_15',
 'column_330': 'test_68',
 'column_331': 'test_14',
 'column_332': 'test_12',
 'column_333': 'test_27',
 'column_334': 'test_85',
 'column_335': 'test_41',
 'column_336': 'test_92',
 'column_337': 'test_73',
 'column_338': 'test_66',
 'column_339': 'test_92',
 'column_34': 'test_2',
 'column_340': 'test_93',
 'column_341': 'test_68',
 'column_342': 'test_36',
 'column_343': 'test_35',
 'column_344': 'test_78',
 'column_345': 'test_44',
 'column_346': 'test_55',
 'column_347': 'test_87',
 'column_348': 'test_33',
 'column_349': 'test_80',
 'column_35': 'test_52',
 'column_350': 'test_81',
 'column_351': 'test_8',
 'column_352': 'test_52',
 'column_353': 'test_73',
 'column_354': 'test_23',
 'column_355': 'test_10',
 'column_356': 'test_96',
 'column_357': 'test_25',
 'column_358': 'test_33',
 'column_359': 'test_5',
 'column_36': 'test_1',
 'column_360': 'test_2',
 'column_361': 'test_67',
 'column_362': 'test_30',
 'column_363': 'test_23',
 'column_364': 'test_92',
 'column_365': 'test_12',
 'column_366': 'test_5',
 'column_367': 'test_43',
 'column_368': 'test_94',
 'column_369': 'test_21',
 'column_37': 'test_92',
 'column_370': 'test_29',
 'column_371': 'test_92',
 'column_372': 'test_11',
 'column_373': 'test_50',
 'column_374': 'test_90',
 'column_375': 'test_3',
 'column_376': 'test_45',
 'column_377': 'test_78',
 'column_378': 'test_58',
 'column_379': 'test_47',
 'column_38': 'test_70',
 'column_380': 'test_8',
 'column_381': 'test_94',
 'column_382': 'test_74',
 'column_383': 'test_98',
 'column_384': 'test_7',
 'column_385': 'test_49',
 'column_386': 'test_47',
 'column_387': 'test_38',
 'column_388': 'test_16',
 'column_389': 'test_70',
 'column_39': 'test_50',
 'column_390': 'test_28',
 'column_391': 'test_18',
 'column_392': 'test_70',
 'column_393': 'test_72',
 'column_394': 'test_59',
 'column_395': 'test_1',
 'column_396': 'test_56',
 'column_397': 'test_1',
 'column_398': 'test_11',
 'column_399': 'test_28',
 'column_4': 'test_33',
 'column_40': 'test_47',
 'column_400': 'test_72',
 'column_401': 'test_8',
 'column_402': 'test_7',
 'column_403': 'test_52',
 'column_404': 'test_13',
 'column_405': 'test_61',
 'column_406': 'test_58',
 'column_407': 'test_42',
 'column_408': 'test_82',
 'column_409': 'test_58',
 'column_41': 'test_48',
 'column_410': 'test_51',
 'column_411': 'test_46',
 'column_412': 'test_81',
 'column_413': 'test_49',
 'column_414': 'test_83',
 'column_415': 'test_88',
 'column_416': 'test_16',
 'column_417': 'test_76',
 'column_418': 'test_42',
 'column_419': 'test_30',
 'column_42': 'test_56',
 'column_420': 'test_11',
 'column_421': 'test_71',
 'column_422': 'test_47',
 'column_423': 'test_28',
 'column_424': 'test_95',
 'column_425': 'test_54',
 'column_426': 'test_47',
 'column_427': 'test_55',
 'column_428': 'test_11',
 'column_429': 'test_91',
 'column_43': 'test_18',
 'column_430': 'test_67',
 'column_431': 'test_86',
 'column_432': 'test_35',
 'column_433': 'test_90',
 'column_434': 'test_88',
 'column_435': 'test_61',
 'column_436': 'test_18',
 'column_437': 'test_100',
 'column_438': 'test_22',
 'column_439': 'test_25',
 'column_44': 'test_36',
 'column_440': 'test_6',
 'column_441': 'test_35',
 'column_442': 'test_41',
 'column_443': 'test_91',
 'column_444': 'test_47',
 'column_445': 'test_25',
 'column_446': 'test_87',
 'column_447': 'test_22',
 'column_448': 'test_73',
 'column_449': 'test_32',
 'column_45': 'test_48',
 'column_450': 'test_3',
 'column_451': 'test_11',
 'column_452': 'test_40',
 'column_453': 'test_83',
 'column_454': 'test_28',
 'column_455': 'test_10',
 'column_456': 'test_12',
 'column_457': 'test_83',
 'column_458': 'test_51',
 'column_459': 'test_90',
 'column_46': 'test_43',
 'column_460': 'test_26',
 'column_461': 'test_91',
 'column_462': 'test_34',
 'column_463': 'test_66',
 'column_464': 'test_13',
 'column_465': 'test_13',
 'column_466': 'test_65',
 'column_467': 'test_90',
 'column_468': 'test_21',
 'column_469': 'test_11',
 'column_47': 'test_56',
 'column_470': 'test_15',
 'column_471': 'test_24',
 'column_472': 'test_1',
 'column_473': 'test_85',
 'column_474': 'test_79',
 'column_475': 'test_91',
 'column_476': 'test_51',
 'column_477': 'test_2',
 'column_478': 'test_10',
 'column_479': 'test_77',
 'column_48': 'test_15',
 'column_480': 'test_26',
 'column_481': 'test_16',
 'column_482': 'test_100',
 'column_483': 'test_9',
 'column_484': 'test_19',
 'column_485': 'test_31',
 'column_486': 'test_39',
 'column_487': 'test_65',
 'column_488': 'test_37',
 'column_489': 'test_32',
 'column_49': 'test_28',
 'column_490': 'test_35',
 'column_491': 'test_61',
 'column_492': 'test_56',
 'column_493': 'test_29',
 'column_494': 'test_93',
 'column_495': 'test_49',
 'column_496': 'test_24',
 'column_497': 'test_76',
 'column_498': 'test_63',
 'column_499': 'test_19',
 'column_5': 'test_94',
 'column_50': 'test_76',
 'column_51': 'test_88',
 'column_52': 'test_24',
 'column_53': 'test_84',
 'column_54': 'test_27',
 'column_55': 'test_89',
 'column_56': 'test_78',
 'column_57': 'test_27',
 'column_58': 'test_33',
 'column_59': 'test_42',
 'column_6': 'test_88',
 'column_60': 'test_94',
 'column_61': 'test_51',
 'column_62': 'test_62',
 'column_63': 'test_93',
 'column_64': 'test_53',
 'column_65': 'test_6',
 'column_66': 'test_73',
 'column_67': 'test_94',
 'column_68': 'test_20',
 'column_69': 'test_4',
 'column_7': 'test_16',
 'column_70': 'test_7',
 'column_71': 'test_97',
 'column_72': 'test_38',
 'column_73': 'test_61',
 'column_74': 'test_58',
 'column_75': 'test_99',
 'column_76': 'test_71',
 'column_77': 'test_97',
 'column_78': 'test_19',
 'column_79': 'test_45',
 'column_8': 'test_20',
 'column_80': 'test_59',
 'column_81': 'test_2',
 'column_82': 'test_77',
 'column_83': 'test_51',
 'column_84': 'test_38',
 'column_85': 'test_92',
 'column_86': 'test_84',
 'column_87': 'test_69',
 'column_88': 'test_83',
 'column_89': 'test_95',
 'column_9': 'test_69',
 'column_90': 'test_36',
 'column_91': 'test_60',
 'column_92': 'test_59',
 'column_93': 'test_28',
 'column_94': 'test_37',
 'column_95': 'test_54',
 'column_96': 'test_86',
 'column_97': 'test_100',
 'column_98': 'test_29',
 'column_99': 'test_81',
 'float_field': 12,
 'int_field': 9,
 'list_field': [1, 2, 3],
 'name': 'item_66',
 'nested_value': {'name': 'item_6', 'value': 0}}

Next, we import the data into our database

[4]:
db_path = "ParquetDB"
if os.path.exists(db_path):
    shutil.rmtree(db_path)
db = ParquetDB(db_path=db_path)

db.create(data)
print(db)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB

• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [4]
• Serialized metadata size per file: [225147] Bytes

############################################################
METADATA
############################################################

############################################################
COLUMN DETAILS
############################################################
• Columns:
    - column_296
    - column_302
    - column_370
    - column_0
    - column_80
    - column_367
    - column_344
    - column_216
    - column_151
    - column_61
    - column_439
    - column_87
    - column_379
    - column_351
    - column_329
    - column_423
    - column_97
    - column_42
    - column_401
    - column_355
    - column_171
    - column_261
    - column_125
    - column_270
    - column_398
    - column_17
    - column_18
    - column_479
    - column_186
    - column_471
    - column_274
    - column_236
    - column_421
    - column_394
    - column_369
    - column_397
    - column_417
    - column_432
    - column_258
    - column_120
    - column_178
    - column_415
    - column_126
    - column_477
    - column_124
    - column_105
    - column_360
    - column_374
    - column_99
    - column_314
    - column_380
    - column_154
    - column_320
    - name
    - column_459
    - column_78
    - column_15
    - column_291
    - column_189
    - column_229
    - column_412
    - column_276
    - column_96
    - column_123
    - column_495
    - column_358
    - column_354
    - column_169
    - column_156
    - column_452
    - column_483
    - column_170
    - column_163
    - column_119
    - column_180
    - column_463
    - column_176
    - column_349
    - column_106
    - column_316
    - column_442
    - column_338
    - column_206
    - float_field
    - column_68
    - column_444
    - column_222
    - column_16
    - column_324
    - column_58
    - list_field
    - column_364
    - column_226
    - column_366
    - column_82
    - column_323
    - column_95
    - column_248
    - column_437
    - column_204
    - column_275
    - column_160
    - column_307
    - column_37
    - column_116
    - column_333
    - column_143
    - column_436
    - column_46
    - column_148
    - column_376
    - column_164
    - column_277
    - column_278
    - column_357
    - column_365
    - column_212
    - column_361
    - column_499
    - column_85
    - column_217
    - column_402
    - column_157
    - column_272
    - column_285
    - column_426
    - column_57
    - column_400
    - column_240
    - nested_value.value
    - column_127
    - column_89
    - column_468
    - id
    - column_118
    - column_469
    - column_418
    - column_490
    - column_150
    - column_430
    - column_152
    - column_455
    - column_475
    - column_172
    - column_146
    - column_8
    - column_359
    - column_420
    - column_408
    - column_84
    - column_210
    - column_309
    - column_482
    - column_102
    - column_195
    - column_223
    - column_194
    - column_200
    - column_29
    - column_158
    - column_336
    - column_207
    - column_414
    - column_22
    - column_228
    - column_407
    - column_454
    - column_234
    - column_322
    - column_64
    - column_45
    - column_310
    - column_147
    - column_348
    - column_38
    - column_115
    - column_269
    - column_167
    - column_179
    - column_447
    - column_330
    - column_419
    - column_32
    - column_94
    - column_187
    - column_203
    - column_14
    - column_337
    - column_498
    - column_98
    - column_67
    - column_263
    - column_235
    - column_249
    - column_292
    - column_63
    - column_27
    - column_53
    - column_26
    - column_450
    - column_243
    - column_44
    - column_306
    - column_331
    - column_315
    - column_108
    - column_428
    - column_383
    - column_453
    - column_232
    - column_460
    - column_413
    - column_441
    - column_3
    - column_390
    - column_60
    - column_117
    - column_303
    - column_215
    - column_91
    - column_434
    - column_52
    - column_145
    - column_294
    - column_256
    - column_470
    - column_75
    - column_202
    - column_451
    - column_347
    - column_262
    - column_144
    - column_31
    - column_472
    - column_213
    - column_218
    - column_363
    - column_456
    - column_155
    - column_73
    - column_484
    - column_23
    - column_404
    - column_431
    - column_133
    - column_221
    - column_440
    - column_279
    - column_438
    - column_429
    - column_327
    - column_231
    - column_350
    - column_392
    - column_49
    - column_25
    - column_230
    - column_142
    - column_284
    - column_191
    - column_445
    - column_140
    - column_405
    - column_55
    - column_93
    - column_461
    - column_254
    - column_340
    - column_136
    - column_39
    - column_138
    - column_435
    - column_252
    - column_128
    - column_237
    - column_192
    - column_188
    - column_406
    - column_488
    - column_224
    - column_209
    - column_250
    - column_121
    - column_389
    - column_403
    - column_50
    - column_70
    - column_290
    - column_198
    - column_388
    - column_199
    - column_36
    - column_334
    - column_35
    - column_377
    - column_40
    - column_395
    - column_185
    - column_386
    - column_100
    - column_111
    - column_10
    - column_114
    - column_239
    - column_113
    - column_433
    - column_473
    - column_326
    - column_465
    - column_7
    - column_72
    - column_51
    - column_2
    - column_165
    - column_448
    - column_183
    - column_257
    - column_308
    - column_384
    - column_131
    - column_266
    - column_5
    - column_282
    - column_493
    - column_104
    - column_181
    - column_30
    - column_491
    - column_287
    - column_153
    - column_242
    - column_298
    - column_149
    - column_265
    - column_139
    - column_339
    - column_385
    - column_168
    - column_372
    - column_411
    - column_166
    - column_141
    - column_11
    - column_83
    - column_4
    - column_47
    - column_424
    - column_362
    - column_214
    - column_244
    - column_368
    - column_474
    - column_54
    - column_409
    - column_174
    - column_271
    - column_9
    - column_129
    - column_335
    - column_356
    - column_299
    - column_325
    - column_79
    - column_193
    - column_253
    - column_137
    - column_328
    - column_173
    - column_76
    - column_56
    - column_134
    - column_159
    - column_24
    - column_297
    - column_283
    - column_449
    - column_288
    - column_494
    - column_422
    - column_496
    - column_6
    - column_74
    - column_268
    - column_427
    - column_382
    - column_462
    - column_311
    - column_273
    - column_289
    - column_373
    - column_247
    - column_238
    - column_211
    - nested_value.name
    - column_280
    - column_196
    - column_487
    - column_458
    - column_225
    - column_295
    - column_425
    - column_62
    - column_457
    - column_313
    - column_378
    - column_109
    - column_175
    - column_13
    - column_182
    - column_342
    - column_343
    - column_12
    - column_321
    - column_201
    - column_485
    - column_197
    - column_1
    - column_177
    - column_161
    - column_259
    - column_305
    - column_341
    - column_416
    - column_130
    - column_345
    - column_381
    - column_208
    - column_135
    - column_300
    - column_66
    - column_241
    - column_264
    - column_467
    - column_317
    - column_286
    - column_86
    - column_41
    - column_122
    - column_255
    - column_92
    - column_486
    - column_69
    - column_393
    - column_107
    - column_466
    - column_28
    - column_301
    - column_220
    - column_245
    - column_219
    - column_375
    - column_480
    - column_190
    - column_112
    - column_371
    - column_481
    - column_132
    - column_101
    - column_246
    - column_332
    - column_267
    - column_65
    - column_353
    - column_464
    - column_293
    - column_48
    - column_103
    - column_205
    - column_19
    - column_446
    - column_492
    - column_43
    - column_318
    - column_443
    - column_59
    - int_field
    - column_20
    - column_281
    - column_489
    - column_312
    - column_184
    - column_34
    - column_81
    - column_77
    - column_33
    - column_71
    - column_497
    - column_88
    - column_387
    - column_319
    - column_304
    - column_399
    - column_227
    - column_410
    - column_233
    - column_260
    - column_391
    - column_352
    - column_110
    - column_476
    - column_90
    - column_21
    - column_251
    - column_346
    - column_396
    - column_478
    - column_162

[5]:
data = None
df = db.read().to_pandas()
print(df)
      column_296 column_302 column_370  column_0 column_80 column_367  \
0        test_33    test_30    test_29   test_74   test_59    test_43
1        test_62    test_97    test_10   test_29   test_59    test_47
2         test_6    test_65    test_28   test_29    test_6    test_81
3        test_35    test_61    test_57   test_61   test_71    test_92
4        test_56    test_44    test_70   test_18   test_75    test_77
...          ...        ...        ...       ...       ...        ...
99995    test_43    test_48    test_46   test_77   test_26    test_20
99996     test_6    test_45    test_36  test_100   test_47    test_58
99997     test_2    test_33    test_61   test_99   test_64    test_44
99998    test_96    test_15    test_26   test_58   test_35    test_77
99999    test_63    test_87    test_39   test_38   test_59    test_83

      column_344 column_216 column_151 column_61  ... column_352 column_110  \
0        test_78    test_62    test_58   test_51  ...    test_52    test_94
1        test_21    test_39    test_50   test_78  ...    test_32    test_57
2        test_32    test_21    test_34   test_25  ...    test_70    test_46
3        test_66     test_7    test_18    test_3  ...    test_42     test_4
4        test_75    test_51    test_76   test_76  ...    test_16    test_36
...          ...        ...        ...       ...  ...        ...        ...
99995    test_71    test_11    test_43   test_73  ...    test_89    test_27
99996    test_38    test_33    test_82   test_99  ...    test_90     test_3
99997    test_26    test_29    test_19   test_74  ...    test_70    test_50
99998    test_43    test_16    test_54   test_40  ...     test_5    test_15
99999    test_82    test_53    test_46   test_71  ...    test_97    test_74

      column_476 column_90 column_21 column_251 column_346 column_396  \
0        test_51   test_36   test_66    test_77    test_55    test_56
1        test_23   test_36   test_48    test_77    test_86    test_81
2        test_66   test_63   test_71    test_82    test_89    test_16
3        test_26   test_71   test_30    test_56    test_17    test_32
4         test_6   test_66   test_14    test_43    test_23    test_16
...          ...       ...       ...        ...        ...        ...
99995    test_55    test_9   test_38    test_15    test_98    test_99
99996    test_67    test_9   test_61    test_13    test_39    test_25
99997     test_9    test_5   test_48    test_69    test_44    test_11
99998    test_10   test_92    test_5    test_31    test_44    test_84
99999    test_82   test_13   test_54    test_39    test_84    test_51

      column_478 column_162
0        test_10    test_54
1         test_4    test_44
2        test_90    test_75
3        test_17    test_29
4        test_68    test_50
...          ...        ...
99995    test_28    test_58
99996    test_89     test_4
99997    test_95    test_74
99998    test_61    test_23
99999     test_7    test_53

[100000 rows x 507 columns]

Normalize Data Using ParquetDB

Next, we’ll introduce the NormalizeConfig class, which allows you to fine-tune how normalization is performed over the various operations in ParquetDB.

The NormalizeConfig Class

@dataclass
class NormalizeConfig:
    load_format: str = "table"
    batch_size: int = 131_072
    batch_readahead: int = 16
    fragment_readahead: int = 4
    fragment_scan_options: Optional[pa.dataset.FragmentScanOptions] = None
    use_threads: bool = True
    memory_pool: Optional[pa.MemoryPool] = None
    filesystem: Optional[fs.FileSystem] = None
    file_options: Optional[ds.FileWriteOptions] = None
    use_threads: bool = config.parquetdb_config.normalize_kwargs.use_threads
    max_partitions: int = config.parquetdb_config.normalize_kwargs.max_partitions
    max_open_files: int = config.parquetdb_config.normalize_kwargs.max_open_files
    max_rows_per_file: int = config.parquetdb_config.normalize_kwargs.max_rows_per_file
    min_rows_per_group: int = (
        config.parquetdb_config.normalize_kwargs.min_rows_per_group
    )
    max_rows_per_group: int = (
        config.parquetdb_config.normalize_kwargs.max_rows_per_group
    )
    file_visitor: Optional[Callable] = None
    existing_data_behavior: str = (
        config.parquetdb_config.normalize_kwargs.existing_data_behavior
    )
    create_dir: bool = True

The NormalizeConfig data class allows you to fine-tune how normalization is performed. The most important parameters are the following:

  • ``load_format : str`` The format of the output dataset. Supported formats are 'table' and 'batches' (default: 'table').

  • ``batch_size : int, optional`` The number of rows to process in each batch (default: None).

  • ``batch_readahead : int, optional`` The number of batches to read ahead in a file (default: 16).

  • ``fragment_readahead : int, optional`` The number of files to read ahead, improving IO utilization at the cost of RAM usage (default: 4).

  • ``max_open_files : int`` Maximum open files for dataset writing (default: 1024).

  • ``max_rows_per_file : int`` Maximum rows per file (default: 10,000).

  • ``min_rows_per_group : int`` Minimum rows per row group within each file (default: 0).

  • ``max_rows_per_group : int`` Maximum rows per row group within each file (default: 10,000).

In parquet files, it stores the data in row groups, this allows for batching and parallelization.

Below is a diagram of what your data will look like in memory.

  • Csv files use a row based system, which is inefficient as it does not store similar data contiguously in memory.

  • Columnar storage is more efficient as it stores similar data contiguously in memory, however batching the data is not so great as data is not stored in chunks.

  • Parquet files use a Row Group based system, which is more efficient as it stores similar data contiguously, but it also stores in chunks which is great for parallelization.

Row Group Storage

Optimizing parameters like the number of rows per row group, how many row groups per file, and how many files to read ahead can help significantly improve speed and memory performance.

Let’s look at the details of the row groups of our current dataset. We can do this by using the summary

[6]:
print(db.summary(show_row_group_metadata=True))
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB

• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [4]
• Number of rows per row group per file:
    - ParquetDB_0.parquet:
        - Row group 0: 32768 rows
        - Row group 1: 32768 rows
        - Row group 2: 32768 rows
        - Row group 3: 1696 rows
• Serialized metadata size per file: [225147] Bytes

############################################################
METADATA
############################################################

############################################################
COLUMN DETAILS
############################################################

Here we can see that in our first file, we have 4 row groups where there is a maximum rows per group of 32,768. Typically, this is fine but if your system can handle it, it is best the chunk the data into larger groups.

A good rule of thumb for these settings should be about 2 GB per file and about 200MB per row group size. This will require some trial and error to find the best settings for your system.

Let’s normalize the data with NormalizeConfig and change it so it is 50000 rows per row group.

[25]:
from parquetdb import NormalizeConfig

normalize_config = NormalizeConfig(min_rows_per_group=50000, max_rows_per_group=50000)

db.normalize(normalize_config=normalize_config)

print(db.summary(show_row_group_metadata=True))
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB

• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [2]
• Number of rows per row group per file:
    - ParquetDB_0.parquet:
        - Row group 0: 50000 rows
        - Row group 1: 50000 rows
• Serialized metadata size per file: [136357] Bytes

############################################################
METADATA
############################################################

############################################################
COLUMN DETAILS
############################################################

Now we have two row groups, each with a maximum of 50,000 rows.

In some cases, however, this might be too large—especially if you’re working with a wide dataset (e.g., ~4,000 columns). To handle such cases, we can reduce the number of rows per group to 10,000 for better performance.

For particularly large datasets, it’s also important to fine-tune additional parameters, such as batch_readahead, fragment_readahead, load_format="batches", and batch_size.

By default, ParquetDB uses load_format="table" and batch_size=None, meaning it will attempt to write and read all the data at once. While this approach works well for smaller datasets, it can cause performance bottlenecks when handling larger datasets.

To address this, we can set load_format="batches" and define batch_size=5000. This configuration ensures that data is processed in chunks of 5,000 rows at a time, improving memory management. Additionally, setting batch_readahead=2 allows ParquetDB to load two batches into memory ahead of processing, further enhancing performance by reducing waiting times.

When reading data, ParquetDB processes files sequentially. To optimize this process, we can control how many files are opened and read ahead by setting fragment_readahead=2. This ensures that the system reads two files ahead, balancing I/O performance and memory usage.

Note: The batch size can only go as high as the number of rows in a row group.

[26]:
normalize_config = NormalizeConfig(
    load_format="batches",
    batch_size=100,
    batch_readahead=16,
    fragment_readahead=4,
    max_rows_per_group=10000,
    min_rows_per_group=10000,
)

db.normalize(normalize_config=normalize_config)

print(db.summary(show_row_group_metadata=True))
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB

• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [10]
• Number of rows per row group per file:
    - ParquetDB_0.parquet:
        - Row group 0: 10000 rows
        - Row group 1: 10000 rows
        - Row group 2: 10000 rows
        - Row group 3: 10000 rows
        - Row group 4: 10000 rows
        - Row group 5: 10000 rows
        - Row group 6: 10000 rows
        - Row group 7: 10000 rows
        - Row group 8: 10000 rows
        - Row group 9: 10000 rows
• Serialized metadata size per file: [497366] Bytes

############################################################
METADATA
############################################################

############################################################
COLUMN DETAILS
############################################################

Finishing

Now that we normalized the data, we can see that the data is more evenly distributed across the row groups. Many methods, such as (read, update, delete, transform, update_schema), in ParquetDB take as an argument a normalize_config which allows you to fine-tune the normalization process during these operations.