You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/21 23:51:30 UTC

[GitHub] [spark] HeartSaVioR edited a comment on pull request #35574: [SPARK-38237][SQL][SS] Allow `HashPartitioning` to satisfy `ClusteredDistribution` only with full clustering keys

HeartSaVioR edited a comment on pull request #35574:
URL: https://github.com/apache/spark/pull/35574#issuecomment-1047303366


   Thinking out loud, there could be more ideas to solve this. One rough idea:
   
   The loose requirement of ClusteredDistribution aims to avoid shuffle as many as possible, even if the number of partitions are quite small. If we can inject the shuffle in runtime based on stats then we can be very adaptive, but I wouldn't expect it. Instead, having a threshold (minimum) of the number of partitions (either heuristically or config) doesn't sound crazy for me. 
   
   Rationalization: ClusteredDistribution has a requirement for exact number of partitions, but if I checked right, nowhere uses it except AQE. (And it is only used for physical node of shuffle.) We simply consider the current partitioning as ideal whenever it satisfies the distribution requirement. Adjusting default number of shuffle partitions won't take in effect since there is no shuffle. Having a threshold (minimum) of the number of partitions would solve majority of issues. It still doesn't solve the case child has partitioned with sub-group keys which unfortunately has a bunch of partitions but skewed. But it is really an unusual case and probably OK to enforce end users to handle it manually.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org