You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/19 01:30:55 UTC

[GitHub] [spark] sigmod commented on a change in pull request #35574: [SPARK-38237][SQL][SS] Allow operator with `ClusteredDistribution` to require full clustering keys

sigmod commented on a change in pull request #35574:
URL: https://github.com/apache/spark/pull/35574#discussion_r810427748



##########
File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala
##########
@@ -56,7 +57,23 @@ case class EnsureRequirements(
     // Ensure that the operator's children satisfy their output distribution requirements.
     var children = originalChildren.zip(requiredChildDistributions).map {
       case (child, distribution) if child.outputPartitioning.satisfies(distribution) =>
-        child
+        (child.outputPartitioning, distribution) match {
+          case (p: HashPartitioning, d: ClusteredDistribution) =>
+            if (conf.getConf(SQLConf.REQUIRE_ALL_CLUSTER_KEYS_FOR_SOLE_PARTITION) &&
+              requiredChildDistributions.size == 1 && !p.isPartitionedOnFullKeys(d)) {
+              // Add an extra shuffle for `ClusteredDistribution` even though its child

Review comment:
       The `if` branch actually means the requirement is not satisfied.
   
   Either adding back `HashClusteredPartition` or adding a boolean into `ClusteredPartition` seems neater for me, because there're other existing call sites of `satisfies` and might be more future call sites such that we don't want to replicate special the logic there, e.g.,
   https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ValidateRequirements.scala#L34-L50
   
   I'd prefer to adding `HashClusteredDistribution` back as it's conceptually a different requirement from `ClusteredDistribution`. Dynamic dispatching is more preferable than if-else for polymorphism, in general.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org