You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "gudladona (via GitHub)" <gi...@apache.org> on 2023/03/17 23:12:06 UTC

[GitHub] [hudi] gudladona commented on issue #8199: [SUPPORT] OOM during a Sync/Async clean operation

gudladona commented on issue #8199:
URL: https://github.com/apache/hudi/issues/8199#issuecomment-1474500670

   We may have some indicators on what is causing this problem 
   
   we have a small file limit of 100MB, it appears that this works well (makes larger files and cleans smaller files) for an average partitions that meets the size requirements.
   
   however, for a partition thats very busy/high volume. it seems like its over bucketing the inserts into many files bec based on avg rec size and the size of new inserts it would always exceed the file size limits and causing it to write to a new file group
   
   example, here is number of file groups written for a single instant(commit) in this partition
   
   ```
   aws s3 ls s3://<prefix>/<table>/<tenant>/date=20230316/ | awk -F _ '{print $3}' | sort | uniq -c | sort -nk1  | tail
    167 20230316203454183.parquet
    168 20230316195218670.parquet
    168 20230316201208079.parquet
    170 20230316200728433.parquet
    175 20230316210557345.parquet
    180 20230316130454342.parquet
    182 20230316212237421.parquet
    211 20230316192405566.parquet
    245 20230316210251305.parquet
    263 20230316204926437.parquet
   ```
   
   As we can see here the shear number of small files in this partition is causing a HUGE json response from the driver there by triggering OOM errors. 
   
   we need help in figuring out how to tune this. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org