You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "yihua (via GitHub)" <gi...@apache.org> on 2023/01/23 18:33:20 UTC

[GitHub] [hudi] yihua commented on pull request #7723: [HUDI-5363] Removing default value for shuffle parallelism configs

yihua commented on PR #7723:
URL: https://github.com/apache/hudi/pull/7723#issuecomment-1400797484

   > The only structural change compared to current state is that we're not going to be overriding parallelism w/ default value of 200. If user specifies the config, it will still take precedence.
   > 
   > I was able to confirm in multiple benchmarks that avoiding setting parallelism w/ random value (200) brings considerable performance benefits:
   > 
   > 1. In case of bulk-insert: in that case we will follow natural partitioning of the dataset (ie we will have as many partitions as there are Parquet row-groups)
   > 2. In case of upsert/insert: in this case we might fallback to `spark.default.parallelism` which is deduce dynamically based on the # of cores available to the cluster which also seems superior to the existing state.
   
   @alexeykudinkin These are good scenarios to validate.  Could you also attach the screenshots of Spark UI here?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org