You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/17 17:39:06 UTC

[GitHub] [spark] viirya commented on pull request #35552: [SPARK-38237][SQL][SS] Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

viirya commented on pull request #35552:
URL: https://github.com/apache/spark/pull/35552#issuecomment-1043228323


   > We figured out that HashClusteredDistribution is still desirable in some cases even without stateful operators; HashPartitioning with subset of grouping keys can satisfy ClusteredDistribution, which means the cardinality of the subset of grouping keys technically defines the max parallelism. Increasing the number of partitions does not always help to solve the skew of the partitions.
   
   I think this is understandable. It'd be better if you can provide an example in the description. But I'm bit confused that how it links to this renaming effort. Do you mean because `StatefulOpClusteredDistribution` is not only for stateful operation, so you propose to rename it back? As it was removed and renamed before, do we have any place that needs to use `HashClusteredDistribution` now?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org