You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/08/24 03:38:40 UTC

[GitHub] [hudi] nsivabalan edited a comment on issue #3400: [SUPPORT] CoW table data size increasing x times the original data size for x number of runs

nsivabalan edited a comment on issue #3400:
URL: https://github.com/apache/hudi/issues/3400#issuecomment-904295065


   ok. let me try to go over how COW works and it might help you understand the increase in data size. 
   
   Let's say on first ingest, these are the data files created within hudi. for simplicity, lets assume just 1 partition. 
   
   df1_v1: 1Gb
   df2_v1: 1Gb
   df3_v1: 1Gb
   
   3 data files are created with version 1 and each is 1Gb size. 
   
   Now, lets say we we do an incremental load, (inserts + updates). Update records belong just to df1 and df3. 
   with COW, hudi created newer version for each data file touched. So, final state would be 
   // old data file versions.
   df1_v1
   df2_v1
   df3_v1
   // newer version of existing data files
   df1_v2
   df3_v2
   // new data files for new inserts
   df4_v1
   
   Now total size = (df1_v1 + df2_v1, df3_v1, df1_v2, df3_v2, df4_v1). 
   So, total disk size occupied by hudi is kind of dependent on how updates are applied. For instance, if there are updates to all data files (df1, df2, df3), every new commit will result in 2X the size. But if you have 1000s of data files, only some may get updated, the equation might differ. And depending on whether you have small file handling enabled or not, inserts could get bin packed into existing data files or routed to new data files. 
   Also, you have set max commits retained = 5. And hence only after 5 commits, older file versions might start to get deleted. 
   
   Wrt difference between insert_overwrite and bulk_insert w/ save mode as Overwrite is, hudi does cleaning async of the invalid files w/ insert_overwrite. in other words, files are not deleted synchronously. 
    
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org