You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "TengHuo (via GitHub)" <gi...@apache.org> on 2023/03/06 09:44:37 UTC

[GitHub] [hudi] TengHuo commented on pull request #6802: [HUDI-4924] Auto-tune dedup parallelism

TengHuo commented on PR #6802:
URL: https://github.com/apache/hudi/pull/6802#issuecomment-1455802492

   Hi @yihua 
   
   We found an issue in our DeltaStreamer pipeline recently. Our Kafka to Hudi DeltaStreamer pipeline is running slower than 0.10 when we upgraded to 0.12. After checking. We noticed that this issue was caused by this slow stage: `Building workload profile`.
   
   <img width="1782" alt="slow_build_workload_profile" src="https://user-images.githubusercontent.com/7539060/223072527-ea56fae5-d2d3-4843-a3f9-8db008c8f7bb.png">
   
   The parallelism of this stage was 10, which is from this line, https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieWriteHelper.java#L64
   
   So, even if we setup the config `hoodie.upsert.shuffle.parallelism` as 1000, it will be ignored by the parallelism of input records, which is the number of Kafka topic partition.
   
   May I ask if there is anyway we can improve it? Thanks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org