Normalization¶
Normalization is a crucial process for ensuring optimal performance and efficient management of data in file-based database systems like ParquetDB. In traditional databases, normalization typically refers to structuring relational tables to reduce redundancy. In ParquetDB’s context, normalization helps balance the distribution of data across multiple Parquet files, avoiding situations where files have uneven row counts.
Without proper normalization, data skew can lead to performance bottlenecks in operations such as queries, inserts, updates, or deletions. By normalizing your dataset, ParquetDB rewrites and restructures your files to have a more consistent number of rows, improving parallelization and read/write speeds.
In this notebook, we will:
Generate an example dataset using
generate_similar_data
.Demonstrate how to normalize data in ParquetDB using
NormalizeConfig
.Show how normalization can improve performance by ensuring each file has a balanced distribution of rows.
[2]:
import pprint
import os
import shutil
from parquetdb.utils.general_utils import generate_similar_data
from parquetdb import ParquetDB, NormalizeConfig
Generate Example Data¶
Below, we’ll generate a dataset that imitates real-world data variations using the generate_similar_data
utility function. This function creates new data entries based on the structure of a provided template.
[3]:
# Define a simple template data entry
template_dict = {
"float_field": 10,
"int_field": 10,
"name": "item",
"nested_value": {"value": 10, "name": "item"},
"list_field": [1, 2, 3],
}
for x in range(500):
template_dict[f"column_{x}"] = "test"
template = [template_dict]
# Generate multiple data entries
num_entries = 100000 # Feel free to adjust this
data = generate_similar_data(template, num_entries)
print("Generated Data:")
pprint.pprint(data[0])
Generated Data:
{'column_0': 'test_74',
'column_1': 'test_16',
'column_10': 'test_64',
'column_100': 'test_84',
'column_101': 'test_76',
'column_102': 'test_78',
'column_103': 'test_26',
'column_104': 'test_16',
'column_105': 'test_70',
'column_106': 'test_19',
'column_107': 'test_5',
'column_108': 'test_82',
'column_109': 'test_54',
'column_11': 'test_72',
'column_110': 'test_94',
'column_111': 'test_59',
'column_112': 'test_31',
'column_113': 'test_33',
'column_114': 'test_30',
'column_115': 'test_82',
'column_116': 'test_2',
'column_117': 'test_20',
'column_118': 'test_82',
'column_119': 'test_31',
'column_12': 'test_26',
'column_120': 'test_89',
'column_121': 'test_84',
'column_122': 'test_85',
'column_123': 'test_81',
'column_124': 'test_37',
'column_125': 'test_100',
'column_126': 'test_95',
'column_127': 'test_54',
'column_128': 'test_86',
'column_129': 'test_82',
'column_13': 'test_5',
'column_130': 'test_62',
'column_131': 'test_94',
'column_132': 'test_87',
'column_133': 'test_99',
'column_134': 'test_73',
'column_135': 'test_56',
'column_136': 'test_98',
'column_137': 'test_36',
'column_138': 'test_4',
'column_139': 'test_88',
'column_14': 'test_1',
'column_140': 'test_71',
'column_141': 'test_35',
'column_142': 'test_71',
'column_143': 'test_76',
'column_144': 'test_55',
'column_145': 'test_55',
'column_146': 'test_15',
'column_147': 'test_99',
'column_148': 'test_80',
'column_149': 'test_61',
'column_15': 'test_91',
'column_150': 'test_33',
'column_151': 'test_58',
'column_152': 'test_75',
'column_153': 'test_51',
'column_154': 'test_46',
'column_155': 'test_20',
'column_156': 'test_44',
'column_157': 'test_49',
'column_158': 'test_51',
'column_159': 'test_2',
'column_16': 'test_4',
'column_160': 'test_25',
'column_161': 'test_29',
'column_162': 'test_54',
'column_163': 'test_3',
'column_164': 'test_23',
'column_165': 'test_84',
'column_166': 'test_41',
'column_167': 'test_45',
'column_168': 'test_65',
'column_169': 'test_34',
'column_17': 'test_99',
'column_170': 'test_60',
'column_171': 'test_57',
'column_172': 'test_99',
'column_173': 'test_67',
'column_174': 'test_25',
'column_175': 'test_97',
'column_176': 'test_62',
'column_177': 'test_30',
'column_178': 'test_21',
'column_179': 'test_70',
'column_18': 'test_41',
'column_180': 'test_59',
'column_181': 'test_15',
'column_182': 'test_67',
'column_183': 'test_20',
'column_184': 'test_41',
'column_185': 'test_41',
'column_186': 'test_42',
'column_187': 'test_60',
'column_188': 'test_51',
'column_189': 'test_93',
'column_19': 'test_62',
'column_190': 'test_16',
'column_191': 'test_60',
'column_192': 'test_32',
'column_193': 'test_94',
'column_194': 'test_56',
'column_195': 'test_91',
'column_196': 'test_29',
'column_197': 'test_65',
'column_198': 'test_3',
'column_199': 'test_5',
'column_2': 'test_36',
'column_20': 'test_12',
'column_200': 'test_43',
'column_201': 'test_25',
'column_202': 'test_25',
'column_203': 'test_31',
'column_204': 'test_70',
'column_205': 'test_5',
'column_206': 'test_24',
'column_207': 'test_8',
'column_208': 'test_6',
'column_209': 'test_81',
'column_21': 'test_66',
'column_210': 'test_88',
'column_211': 'test_7',
'column_212': 'test_98',
'column_213': 'test_83',
'column_214': 'test_5',
'column_215': 'test_83',
'column_216': 'test_62',
'column_217': 'test_70',
'column_218': 'test_6',
'column_219': 'test_42',
'column_22': 'test_44',
'column_220': 'test_68',
'column_221': 'test_57',
'column_222': 'test_93',
'column_223': 'test_5',
'column_224': 'test_20',
'column_225': 'test_8',
'column_226': 'test_85',
'column_227': 'test_70',
'column_228': 'test_72',
'column_229': 'test_9',
'column_23': 'test_98',
'column_230': 'test_60',
'column_231': 'test_76',
'column_232': 'test_88',
'column_233': 'test_20',
'column_234': 'test_42',
'column_235': 'test_33',
'column_236': 'test_63',
'column_237': 'test_78',
'column_238': 'test_21',
'column_239': 'test_11',
'column_24': 'test_87',
'column_240': 'test_79',
'column_241': 'test_25',
'column_242': 'test_82',
'column_243': 'test_70',
'column_244': 'test_77',
'column_245': 'test_4',
'column_246': 'test_30',
'column_247': 'test_13',
'column_248': 'test_29',
'column_249': 'test_24',
'column_25': 'test_42',
'column_250': 'test_37',
'column_251': 'test_77',
'column_252': 'test_53',
'column_253': 'test_52',
'column_254': 'test_26',
'column_255': 'test_43',
'column_256': 'test_3',
'column_257': 'test_4',
'column_258': 'test_23',
'column_259': 'test_55',
'column_26': 'test_39',
'column_260': 'test_66',
'column_261': 'test_47',
'column_262': 'test_24',
'column_263': 'test_16',
'column_264': 'test_78',
'column_265': 'test_49',
'column_266': 'test_36',
'column_267': 'test_74',
'column_268': 'test_31',
'column_269': 'test_62',
'column_27': 'test_66',
'column_270': 'test_27',
'column_271': 'test_2',
'column_272': 'test_71',
'column_273': 'test_64',
'column_274': 'test_33',
'column_275': 'test_83',
'column_276': 'test_73',
'column_277': 'test_44',
'column_278': 'test_25',
'column_279': 'test_79',
'column_28': 'test_77',
'column_280': 'test_84',
'column_281': 'test_16',
'column_282': 'test_42',
'column_283': 'test_69',
'column_284': 'test_76',
'column_285': 'test_94',
'column_286': 'test_75',
'column_287': 'test_81',
'column_288': 'test_55',
'column_289': 'test_63',
'column_29': 'test_45',
'column_290': 'test_12',
'column_291': 'test_71',
'column_292': 'test_9',
'column_293': 'test_5',
'column_294': 'test_9',
'column_295': 'test_44',
'column_296': 'test_33',
'column_297': 'test_70',
'column_298': 'test_97',
'column_299': 'test_84',
'column_3': 'test_28',
'column_30': 'test_91',
'column_300': 'test_64',
'column_301': 'test_97',
'column_302': 'test_30',
'column_303': 'test_28',
'column_304': 'test_70',
'column_305': 'test_60',
'column_306': 'test_44',
'column_307': 'test_7',
'column_308': 'test_14',
'column_309': 'test_50',
'column_31': 'test_10',
'column_310': 'test_72',
'column_311': 'test_62',
'column_312': 'test_95',
'column_313': 'test_68',
'column_314': 'test_77',
'column_315': 'test_52',
'column_316': 'test_40',
'column_317': 'test_27',
'column_318': 'test_6',
'column_319': 'test_47',
'column_32': 'test_75',
'column_320': 'test_78',
'column_321': 'test_98',
'column_322': 'test_24',
'column_323': 'test_7',
'column_324': 'test_74',
'column_325': 'test_4',
'column_326': 'test_51',
'column_327': 'test_69',
'column_328': 'test_25',
'column_329': 'test_33',
'column_33': 'test_15',
'column_330': 'test_68',
'column_331': 'test_14',
'column_332': 'test_12',
'column_333': 'test_27',
'column_334': 'test_85',
'column_335': 'test_41',
'column_336': 'test_92',
'column_337': 'test_73',
'column_338': 'test_66',
'column_339': 'test_92',
'column_34': 'test_2',
'column_340': 'test_93',
'column_341': 'test_68',
'column_342': 'test_36',
'column_343': 'test_35',
'column_344': 'test_78',
'column_345': 'test_44',
'column_346': 'test_55',
'column_347': 'test_87',
'column_348': 'test_33',
'column_349': 'test_80',
'column_35': 'test_52',
'column_350': 'test_81',
'column_351': 'test_8',
'column_352': 'test_52',
'column_353': 'test_73',
'column_354': 'test_23',
'column_355': 'test_10',
'column_356': 'test_96',
'column_357': 'test_25',
'column_358': 'test_33',
'column_359': 'test_5',
'column_36': 'test_1',
'column_360': 'test_2',
'column_361': 'test_67',
'column_362': 'test_30',
'column_363': 'test_23',
'column_364': 'test_92',
'column_365': 'test_12',
'column_366': 'test_5',
'column_367': 'test_43',
'column_368': 'test_94',
'column_369': 'test_21',
'column_37': 'test_92',
'column_370': 'test_29',
'column_371': 'test_92',
'column_372': 'test_11',
'column_373': 'test_50',
'column_374': 'test_90',
'column_375': 'test_3',
'column_376': 'test_45',
'column_377': 'test_78',
'column_378': 'test_58',
'column_379': 'test_47',
'column_38': 'test_70',
'column_380': 'test_8',
'column_381': 'test_94',
'column_382': 'test_74',
'column_383': 'test_98',
'column_384': 'test_7',
'column_385': 'test_49',
'column_386': 'test_47',
'column_387': 'test_38',
'column_388': 'test_16',
'column_389': 'test_70',
'column_39': 'test_50',
'column_390': 'test_28',
'column_391': 'test_18',
'column_392': 'test_70',
'column_393': 'test_72',
'column_394': 'test_59',
'column_395': 'test_1',
'column_396': 'test_56',
'column_397': 'test_1',
'column_398': 'test_11',
'column_399': 'test_28',
'column_4': 'test_33',
'column_40': 'test_47',
'column_400': 'test_72',
'column_401': 'test_8',
'column_402': 'test_7',
'column_403': 'test_52',
'column_404': 'test_13',
'column_405': 'test_61',
'column_406': 'test_58',
'column_407': 'test_42',
'column_408': 'test_82',
'column_409': 'test_58',
'column_41': 'test_48',
'column_410': 'test_51',
'column_411': 'test_46',
'column_412': 'test_81',
'column_413': 'test_49',
'column_414': 'test_83',
'column_415': 'test_88',
'column_416': 'test_16',
'column_417': 'test_76',
'column_418': 'test_42',
'column_419': 'test_30',
'column_42': 'test_56',
'column_420': 'test_11',
'column_421': 'test_71',
'column_422': 'test_47',
'column_423': 'test_28',
'column_424': 'test_95',
'column_425': 'test_54',
'column_426': 'test_47',
'column_427': 'test_55',
'column_428': 'test_11',
'column_429': 'test_91',
'column_43': 'test_18',
'column_430': 'test_67',
'column_431': 'test_86',
'column_432': 'test_35',
'column_433': 'test_90',
'column_434': 'test_88',
'column_435': 'test_61',
'column_436': 'test_18',
'column_437': 'test_100',
'column_438': 'test_22',
'column_439': 'test_25',
'column_44': 'test_36',
'column_440': 'test_6',
'column_441': 'test_35',
'column_442': 'test_41',
'column_443': 'test_91',
'column_444': 'test_47',
'column_445': 'test_25',
'column_446': 'test_87',
'column_447': 'test_22',
'column_448': 'test_73',
'column_449': 'test_32',
'column_45': 'test_48',
'column_450': 'test_3',
'column_451': 'test_11',
'column_452': 'test_40',
'column_453': 'test_83',
'column_454': 'test_28',
'column_455': 'test_10',
'column_456': 'test_12',
'column_457': 'test_83',
'column_458': 'test_51',
'column_459': 'test_90',
'column_46': 'test_43',
'column_460': 'test_26',
'column_461': 'test_91',
'column_462': 'test_34',
'column_463': 'test_66',
'column_464': 'test_13',
'column_465': 'test_13',
'column_466': 'test_65',
'column_467': 'test_90',
'column_468': 'test_21',
'column_469': 'test_11',
'column_47': 'test_56',
'column_470': 'test_15',
'column_471': 'test_24',
'column_472': 'test_1',
'column_473': 'test_85',
'column_474': 'test_79',
'column_475': 'test_91',
'column_476': 'test_51',
'column_477': 'test_2',
'column_478': 'test_10',
'column_479': 'test_77',
'column_48': 'test_15',
'column_480': 'test_26',
'column_481': 'test_16',
'column_482': 'test_100',
'column_483': 'test_9',
'column_484': 'test_19',
'column_485': 'test_31',
'column_486': 'test_39',
'column_487': 'test_65',
'column_488': 'test_37',
'column_489': 'test_32',
'column_49': 'test_28',
'column_490': 'test_35',
'column_491': 'test_61',
'column_492': 'test_56',
'column_493': 'test_29',
'column_494': 'test_93',
'column_495': 'test_49',
'column_496': 'test_24',
'column_497': 'test_76',
'column_498': 'test_63',
'column_499': 'test_19',
'column_5': 'test_94',
'column_50': 'test_76',
'column_51': 'test_88',
'column_52': 'test_24',
'column_53': 'test_84',
'column_54': 'test_27',
'column_55': 'test_89',
'column_56': 'test_78',
'column_57': 'test_27',
'column_58': 'test_33',
'column_59': 'test_42',
'column_6': 'test_88',
'column_60': 'test_94',
'column_61': 'test_51',
'column_62': 'test_62',
'column_63': 'test_93',
'column_64': 'test_53',
'column_65': 'test_6',
'column_66': 'test_73',
'column_67': 'test_94',
'column_68': 'test_20',
'column_69': 'test_4',
'column_7': 'test_16',
'column_70': 'test_7',
'column_71': 'test_97',
'column_72': 'test_38',
'column_73': 'test_61',
'column_74': 'test_58',
'column_75': 'test_99',
'column_76': 'test_71',
'column_77': 'test_97',
'column_78': 'test_19',
'column_79': 'test_45',
'column_8': 'test_20',
'column_80': 'test_59',
'column_81': 'test_2',
'column_82': 'test_77',
'column_83': 'test_51',
'column_84': 'test_38',
'column_85': 'test_92',
'column_86': 'test_84',
'column_87': 'test_69',
'column_88': 'test_83',
'column_89': 'test_95',
'column_9': 'test_69',
'column_90': 'test_36',
'column_91': 'test_60',
'column_92': 'test_59',
'column_93': 'test_28',
'column_94': 'test_37',
'column_95': 'test_54',
'column_96': 'test_86',
'column_97': 'test_100',
'column_98': 'test_29',
'column_99': 'test_81',
'float_field': 12,
'int_field': 9,
'list_field': [1, 2, 3],
'name': 'item_66',
'nested_value': {'name': 'item_6', 'value': 0}}
Next, we import the data into our database
[4]:
db_path = "ParquetDB"
if os.path.exists(db_path):
shutil.rmtree(db_path)
db = ParquetDB(db_path=db_path)
db.create(data)
print(db)
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [4]
• Serialized metadata size per file: [225147] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
• Columns:
- column_296
- column_302
- column_370
- column_0
- column_80
- column_367
- column_344
- column_216
- column_151
- column_61
- column_439
- column_87
- column_379
- column_351
- column_329
- column_423
- column_97
- column_42
- column_401
- column_355
- column_171
- column_261
- column_125
- column_270
- column_398
- column_17
- column_18
- column_479
- column_186
- column_471
- column_274
- column_236
- column_421
- column_394
- column_369
- column_397
- column_417
- column_432
- column_258
- column_120
- column_178
- column_415
- column_126
- column_477
- column_124
- column_105
- column_360
- column_374
- column_99
- column_314
- column_380
- column_154
- column_320
- name
- column_459
- column_78
- column_15
- column_291
- column_189
- column_229
- column_412
- column_276
- column_96
- column_123
- column_495
- column_358
- column_354
- column_169
- column_156
- column_452
- column_483
- column_170
- column_163
- column_119
- column_180
- column_463
- column_176
- column_349
- column_106
- column_316
- column_442
- column_338
- column_206
- float_field
- column_68
- column_444
- column_222
- column_16
- column_324
- column_58
- list_field
- column_364
- column_226
- column_366
- column_82
- column_323
- column_95
- column_248
- column_437
- column_204
- column_275
- column_160
- column_307
- column_37
- column_116
- column_333
- column_143
- column_436
- column_46
- column_148
- column_376
- column_164
- column_277
- column_278
- column_357
- column_365
- column_212
- column_361
- column_499
- column_85
- column_217
- column_402
- column_157
- column_272
- column_285
- column_426
- column_57
- column_400
- column_240
- nested_value.value
- column_127
- column_89
- column_468
- id
- column_118
- column_469
- column_418
- column_490
- column_150
- column_430
- column_152
- column_455
- column_475
- column_172
- column_146
- column_8
- column_359
- column_420
- column_408
- column_84
- column_210
- column_309
- column_482
- column_102
- column_195
- column_223
- column_194
- column_200
- column_29
- column_158
- column_336
- column_207
- column_414
- column_22
- column_228
- column_407
- column_454
- column_234
- column_322
- column_64
- column_45
- column_310
- column_147
- column_348
- column_38
- column_115
- column_269
- column_167
- column_179
- column_447
- column_330
- column_419
- column_32
- column_94
- column_187
- column_203
- column_14
- column_337
- column_498
- column_98
- column_67
- column_263
- column_235
- column_249
- column_292
- column_63
- column_27
- column_53
- column_26
- column_450
- column_243
- column_44
- column_306
- column_331
- column_315
- column_108
- column_428
- column_383
- column_453
- column_232
- column_460
- column_413
- column_441
- column_3
- column_390
- column_60
- column_117
- column_303
- column_215
- column_91
- column_434
- column_52
- column_145
- column_294
- column_256
- column_470
- column_75
- column_202
- column_451
- column_347
- column_262
- column_144
- column_31
- column_472
- column_213
- column_218
- column_363
- column_456
- column_155
- column_73
- column_484
- column_23
- column_404
- column_431
- column_133
- column_221
- column_440
- column_279
- column_438
- column_429
- column_327
- column_231
- column_350
- column_392
- column_49
- column_25
- column_230
- column_142
- column_284
- column_191
- column_445
- column_140
- column_405
- column_55
- column_93
- column_461
- column_254
- column_340
- column_136
- column_39
- column_138
- column_435
- column_252
- column_128
- column_237
- column_192
- column_188
- column_406
- column_488
- column_224
- column_209
- column_250
- column_121
- column_389
- column_403
- column_50
- column_70
- column_290
- column_198
- column_388
- column_199
- column_36
- column_334
- column_35
- column_377
- column_40
- column_395
- column_185
- column_386
- column_100
- column_111
- column_10
- column_114
- column_239
- column_113
- column_433
- column_473
- column_326
- column_465
- column_7
- column_72
- column_51
- column_2
- column_165
- column_448
- column_183
- column_257
- column_308
- column_384
- column_131
- column_266
- column_5
- column_282
- column_493
- column_104
- column_181
- column_30
- column_491
- column_287
- column_153
- column_242
- column_298
- column_149
- column_265
- column_139
- column_339
- column_385
- column_168
- column_372
- column_411
- column_166
- column_141
- column_11
- column_83
- column_4
- column_47
- column_424
- column_362
- column_214
- column_244
- column_368
- column_474
- column_54
- column_409
- column_174
- column_271
- column_9
- column_129
- column_335
- column_356
- column_299
- column_325
- column_79
- column_193
- column_253
- column_137
- column_328
- column_173
- column_76
- column_56
- column_134
- column_159
- column_24
- column_297
- column_283
- column_449
- column_288
- column_494
- column_422
- column_496
- column_6
- column_74
- column_268
- column_427
- column_382
- column_462
- column_311
- column_273
- column_289
- column_373
- column_247
- column_238
- column_211
- nested_value.name
- column_280
- column_196
- column_487
- column_458
- column_225
- column_295
- column_425
- column_62
- column_457
- column_313
- column_378
- column_109
- column_175
- column_13
- column_182
- column_342
- column_343
- column_12
- column_321
- column_201
- column_485
- column_197
- column_1
- column_177
- column_161
- column_259
- column_305
- column_341
- column_416
- column_130
- column_345
- column_381
- column_208
- column_135
- column_300
- column_66
- column_241
- column_264
- column_467
- column_317
- column_286
- column_86
- column_41
- column_122
- column_255
- column_92
- column_486
- column_69
- column_393
- column_107
- column_466
- column_28
- column_301
- column_220
- column_245
- column_219
- column_375
- column_480
- column_190
- column_112
- column_371
- column_481
- column_132
- column_101
- column_246
- column_332
- column_267
- column_65
- column_353
- column_464
- column_293
- column_48
- column_103
- column_205
- column_19
- column_446
- column_492
- column_43
- column_318
- column_443
- column_59
- int_field
- column_20
- column_281
- column_489
- column_312
- column_184
- column_34
- column_81
- column_77
- column_33
- column_71
- column_497
- column_88
- column_387
- column_319
- column_304
- column_399
- column_227
- column_410
- column_233
- column_260
- column_391
- column_352
- column_110
- column_476
- column_90
- column_21
- column_251
- column_346
- column_396
- column_478
- column_162
[5]:
data = None
df = db.read().to_pandas()
print(df)
column_296 column_302 column_370 column_0 column_80 column_367 \
0 test_33 test_30 test_29 test_74 test_59 test_43
1 test_62 test_97 test_10 test_29 test_59 test_47
2 test_6 test_65 test_28 test_29 test_6 test_81
3 test_35 test_61 test_57 test_61 test_71 test_92
4 test_56 test_44 test_70 test_18 test_75 test_77
... ... ... ... ... ... ...
99995 test_43 test_48 test_46 test_77 test_26 test_20
99996 test_6 test_45 test_36 test_100 test_47 test_58
99997 test_2 test_33 test_61 test_99 test_64 test_44
99998 test_96 test_15 test_26 test_58 test_35 test_77
99999 test_63 test_87 test_39 test_38 test_59 test_83
column_344 column_216 column_151 column_61 ... column_352 column_110 \
0 test_78 test_62 test_58 test_51 ... test_52 test_94
1 test_21 test_39 test_50 test_78 ... test_32 test_57
2 test_32 test_21 test_34 test_25 ... test_70 test_46
3 test_66 test_7 test_18 test_3 ... test_42 test_4
4 test_75 test_51 test_76 test_76 ... test_16 test_36
... ... ... ... ... ... ... ...
99995 test_71 test_11 test_43 test_73 ... test_89 test_27
99996 test_38 test_33 test_82 test_99 ... test_90 test_3
99997 test_26 test_29 test_19 test_74 ... test_70 test_50
99998 test_43 test_16 test_54 test_40 ... test_5 test_15
99999 test_82 test_53 test_46 test_71 ... test_97 test_74
column_476 column_90 column_21 column_251 column_346 column_396 \
0 test_51 test_36 test_66 test_77 test_55 test_56
1 test_23 test_36 test_48 test_77 test_86 test_81
2 test_66 test_63 test_71 test_82 test_89 test_16
3 test_26 test_71 test_30 test_56 test_17 test_32
4 test_6 test_66 test_14 test_43 test_23 test_16
... ... ... ... ... ... ...
99995 test_55 test_9 test_38 test_15 test_98 test_99
99996 test_67 test_9 test_61 test_13 test_39 test_25
99997 test_9 test_5 test_48 test_69 test_44 test_11
99998 test_10 test_92 test_5 test_31 test_44 test_84
99999 test_82 test_13 test_54 test_39 test_84 test_51
column_478 column_162
0 test_10 test_54
1 test_4 test_44
2 test_90 test_75
3 test_17 test_29
4 test_68 test_50
... ... ...
99995 test_28 test_58
99996 test_89 test_4
99997 test_95 test_74
99998 test_61 test_23
99999 test_7 test_53
[100000 rows x 507 columns]
Normalize Data Using ParquetDB¶
Next, we’ll introduce the NormalizeConfig
class, which allows you to fine-tune how normalization is performed over the various operations in ParquetDB.
The NormalizeConfig
Class¶
@dataclass
class NormalizeConfig:
load_format: str = "table"
batch_size: int = 131_072
batch_readahead: int = 16
fragment_readahead: int = 4
fragment_scan_options: Optional[pa.dataset.FragmentScanOptions] = None
use_threads: bool = True
memory_pool: Optional[pa.MemoryPool] = None
filesystem: Optional[fs.FileSystem] = None
file_options: Optional[ds.FileWriteOptions] = None
use_threads: bool = config.parquetdb_config.normalize_kwargs.use_threads
max_partitions: int = config.parquetdb_config.normalize_kwargs.max_partitions
max_open_files: int = config.parquetdb_config.normalize_kwargs.max_open_files
max_rows_per_file: int = config.parquetdb_config.normalize_kwargs.max_rows_per_file
min_rows_per_group: int = (
config.parquetdb_config.normalize_kwargs.min_rows_per_group
)
max_rows_per_group: int = (
config.parquetdb_config.normalize_kwargs.max_rows_per_group
)
file_visitor: Optional[Callable] = None
existing_data_behavior: str = (
config.parquetdb_config.normalize_kwargs.existing_data_behavior
)
create_dir: bool = True
The NormalizeConfig
data class allows you to fine-tune how normalization is performed. The most important parameters are the following:
``load_format : str`` The format of the output dataset. Supported formats are
'table'
and'batches'
(default:'table'
).``batch_size : int, optional`` The number of rows to process in each batch (default:
None
).``batch_readahead : int, optional`` The number of batches to read ahead in a file (default:
16
).``fragment_readahead : int, optional`` The number of files to read ahead, improving IO utilization at the cost of RAM usage (default:
4
).``max_open_files : int`` Maximum open files for dataset writing (default:
1024
).``max_rows_per_file : int`` Maximum rows per file (default:
10,000
).``min_rows_per_group : int`` Minimum rows per row group within each file (default:
0
).``max_rows_per_group : int`` Maximum rows per row group within each file (default:
10,000
).
In parquet files, it stores the data in row groups, this allows for batching and parallelization.
Below is a diagram of what your data will look like in memory.
Csv files use a row based system, which is inefficient as it does not store similar data contiguously in memory.
Columnar storage is more efficient as it stores similar data contiguously in memory, however batching the data is not so great as data is not stored in chunks.
Parquet files use a Row Group based system, which is more efficient as it stores similar data contiguously, but it also stores in chunks which is great for parallelization.
Optimizing parameters like the number of rows per row group, how many row groups per file, and how many files to read ahead can help significantly improve speed and memory performance.
Let’s look at the details of the row groups of our current dataset. We can do this by using the summary
[6]:
print(db.summary(show_row_group_metadata=True))
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [4]
• Number of rows per row group per file:
- ParquetDB_0.parquet:
- Row group 0: 32768 rows
- Row group 1: 32768 rows
- Row group 2: 32768 rows
- Row group 3: 1696 rows
• Serialized metadata size per file: [225147] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
Here we can see that in our first file, we have 4 row groups where there is a maximum rows per group of 32,768. Typically, this is fine but if your system can handle it, it is best the chunk the data into larger groups.
A good rule of thumb for these settings should be about 2 GB per file and about 200MB per row group size. This will require some trial and error to find the best settings for your system.
Let’s normalize the data with NormalizeConfig
and change it so it is 50000 rows per row group.
[25]:
from parquetdb import NormalizeConfig
normalize_config = NormalizeConfig(min_rows_per_group=50000, max_rows_per_group=50000)
db.normalize(normalize_config=normalize_config)
print(db.summary(show_row_group_metadata=True))
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [2]
• Number of rows per row group per file:
- ParquetDB_0.parquet:
- Row group 0: 50000 rows
- Row group 1: 50000 rows
• Serialized metadata size per file: [136357] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
Now we have two row groups, each with a maximum of 50,000 rows.
In some cases, however, this might be too large—especially if you’re working with a wide dataset (e.g., ~4,000 columns). To handle such cases, we can reduce the number of rows per group to 10,000 for better performance.
For particularly large datasets, it’s also important to fine-tune additional parameters, such as batch_readahead
, fragment_readahead
, load_format="batches"
, and batch_size
.
By default, ParquetDB uses load_format="table"
and batch_size=None
, meaning it will attempt to write and read all the data at once. While this approach works well for smaller datasets, it can cause performance bottlenecks when handling larger datasets.
To address this, we can set load_format="batches"
and define batch_size=5000
. This configuration ensures that data is processed in chunks of 5,000 rows at a time, improving memory management. Additionally, setting batch_readahead=2
allows ParquetDB to load two batches into memory ahead of processing, further enhancing performance by reducing waiting times.
When reading data, ParquetDB processes files sequentially. To optimize this process, we can control how many files are opened and read ahead by setting fragment_readahead=2
. This ensures that the system reads two files ahead, balancing I/O performance and memory usage.
Note: The batch size can only go as high as the number of rows in a row group.
[26]:
normalize_config = NormalizeConfig(
load_format="batches",
batch_size=100,
batch_readahead=16,
fragment_readahead=4,
max_rows_per_group=10000,
min_rows_per_group=10000,
)
db.normalize(normalize_config=normalize_config)
print(db.summary(show_row_group_metadata=True))
============================================================
PARQUETDB SUMMARY
============================================================
Database path: ParquetDB
• Number of columns: 507
• Number of rows: 100000
• Number of files: 1
• Number of rows per file: [100000]
• Number of row groups per file: [10]
• Number of rows per row group per file:
- ParquetDB_0.parquet:
- Row group 0: 10000 rows
- Row group 1: 10000 rows
- Row group 2: 10000 rows
- Row group 3: 10000 rows
- Row group 4: 10000 rows
- Row group 5: 10000 rows
- Row group 6: 10000 rows
- Row group 7: 10000 rows
- Row group 8: 10000 rows
- Row group 9: 10000 rows
• Serialized metadata size per file: [497366] Bytes
############################################################
METADATA
############################################################
############################################################
COLUMN DETAILS
############################################################
Finishing¶
Now that we normalized the data, we can see that the data is more evenly distributed across the row groups. Many methods, such as (read
, update
, delete
, transform
, update_schema
), in ParquetDB take as an argument a normalize_config
which allows you to fine-tune the normalization process during these operations.