You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/09/22 13:35:30 UTC

[GitHub] [hudi] Rap70r commented on issue #3697: [SUPPORT] Performance Tuning: How to speed up stages?

Rap70r commented on issue #3697:
URL: https://github.com/apache/hudi/issues/3697#issuecomment-924937292


   Hello @xushiyan, Thank you for getting back to me.
   Just a clarification that above data size (1714 Megabytes, 1.4 million records) is the usual incremental data size we expect on each upsert cycle. The total size of the entire data set sitting on S3 for this particular Hudi collection is 6.2 GB, with approximately 60 million records.
   We used to have around 230 partitions but the time that takes for "UpsertPartitioner" increases significantly, as each partition goes up to over 100 MB. Considering this data size, what do you recommend as an ideal partition number?
   Also, do you recommend maybe increase the number of partitions to something like 5K and keep using the same instance type? Wouldn't that allow smaller instance types to handle small partitions faster? Or should we reduce the number of partitions and use larger instance type?
   For your second point, we use 25 Task instances of type c5.xlarge (4 vCore, 8 GiB memory). Using above configs, we get around 20 executors. What would be the recommended instance type/size for this type of data size? I was under the impression C5 type are generally recommended for this type of work.
   And for your third point, we are using 3000 for parallelism (hoodie.upsert.shuffle.parallelism). Should we increase that?
   And finally, is there a way we can increase the number of files under each partitions? Would that help?
   
   Thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org