You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/11/21 19:50:44 UTC

[GitHub] [hudi] Limess edited a comment on issue #3933: [SUPPORT] Large amount of disk spill on initial upsert/bulk insert

Limess edited a comment on issue #3933:
URL: https://github.com/apache/hudi/issues/3933#issuecomment-974883342


   Thanks!
   
   We're using bulk insert for this job and are happy with the performance vs regular upsert.
   
   Re: parallelism, we bumped this up after:
   1. Reading https://cwiki.apache.org/confluence/display/HUDI/Tuning+Guide, I guess this is now out of date:
   	> We're setting parallelism based on the Tuning Guide which states to set it such that its atleast input_data_size/500MB.
   2. Observing the disk spill, we found increasing parallelism reduced it
   
   Smaller parquet files don't matter too much, if clustering can later fix the small files/sorting problems that sounds like a good thing to look at down the line (we haven't investigated clustering at all yet)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org