You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/07 23:09:32 UTC

[GitHub] [spark] HeartSaVioR commented on a change in pull request #35419: [SPARK-38124][SQL][SS] Introduce StatefulOpClusteredDistribution and apply to all stateful operators

HeartSaVioR commented on a change in pull request #35419:
URL: https://github.com/apache/spark/pull/35419#discussion_r801135672



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -90,6 +90,34 @@ case class ClusteredDistribution(
   }
 }
 
+/**
+ * Represents the requirement of distribution on the stateful operator.
+ *
+ * Each partition in stateful operator initializes state store(s), which are independent with state
+ * store(s) in other partitions. Since it is not possible to repartition the data in state store,
+ * Spark should make sure the physical partitioning of the stateful operator is unchanged across
+ * Spark versions. Violation of this requirement may bring silent correctness issue.
+ *
+ * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
+ * stateful operator, only [[HashPartitioning]] can satisfy this distribution.
+ */
+case class StatefulOpClusteredDistribution(

Review comment:
       > Do we also need to update HashShuffleSpec so that two HashPartitionings can be compatible with each other when checking against StatefulOpClusteredDistributions? this is the previous behavior where Spark would avoid shuffle if both sides of the streaming join are co-partitioned.
   
   Each input must follow the required distribution provided from stateful operator to respect the requirement of state partitioning. state partitioning is the first class, so even both sides of the streaming join are co-partitioned, Spark must perform shuffle if they don't match with state partitioning. (If that was the previous behavior, we broke something.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org