You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "alexeykudinkin (via GitHub)" <gi...@apache.org> on 2023/01/21 03:10:24 UTC

[GitHub] [hudi] alexeykudinkin commented on pull request #7723: [HUDI-5363] Removing default value for shuffle parallelism configs

alexeykudinkin commented on PR #7723:
URL: https://github.com/apache/hudi/pull/7723#issuecomment-1399162347

> 1: I am not sure if ParallelismHelper is really required as base class. we can simplify it as a static method which takes in a config property and numPartitions and returns deduced value.

Unfortunately, we have to abstract it in a way that would make it work w/ Java/Flink/Spark.

> 2: while I agree w/ the intent of the patch itself, we might need to think through if this will work for all cases and whether we are confident about switching this few days before code freeze. for eg, would this play a part in bloom indexes. 200 has been working w/o any issues even for larger batch for the most part. will think about the change and will post an update if I can think of scenarios that might be of interest to us.

The only structural change compared to current state is that we're not going to be overriding parallelism w/ default value of 200. If user specifies the config, it will still take precedence.

I was able to confirm in multiple benchmarks that avoiding setting parallelism w/ random value (200) brings considerable performance benefits:

1. In case of bulk-insert: in that case we will follow natural partitioning of the dataset (ie we will have as many partitions as there are Parquet row-groups)
2. In case of upsert/insert: in this case we might fallback to `spark.default.parallelism` which is deduce dynamically based on the # of cores available to the cluster which also seems superior to the existing state.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org