You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/14 10:52:51 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

HeartSaVioR opened a new pull request #35512:
URL: https://github.com/apache/spark/pull/35512

### What changes were proposed in this pull request?

This PR proposes to add the context of current challenge on fixing distribution of stateful operator, even the distribution is a sort of "broken" now.

This PR addresses the review comment https://github.com/apache/spark/pull/35419#discussion_r801343068

### Why are the changes needed?

In SPARK-38124 we figured out the existing long-standing problem in stateful operator, but it is not easy to fix since the fix may break the existing query if the fix is not carefully designed. Anyone should also be pretty much careful when touching the required distribution. We want to document this explicitly to help others to be careful whenever someone is around the codebase.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Code comment only changes.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #35512:
URL: https://github.com/apache/spark/pull/35512#discussion_r806123708



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have

Review comment:
       "applied only to stream-stream join"?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have
+ * been using ClusteredDistribution, which could construct the physical partitioning of the state
+ * in different way. (ClusteredDistribution requires relaxed condition and multiple
+ * partitionings can satisfy the requirement.) We need to construct the way to fix this with
+ * minimizing possibility to break the existing checkpoints.
+ *
+ * TODO: SPARK-38204 to address above note.

Review comment:
       nit nit: I saw we usually use the pattern `TODO(SPARK-38204)` 

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have
+ * been using ClusteredDistribution, which could construct the physical partitioning of the state
+ * in different way. (ClusteredDistribution requires relaxed condition and multiple

Review comment:
       maybe "in different way (ClusteredDistribution requires) ...": no dot after "way".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] sunchao commented on a change in pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

sunchao commented on a change in pull request #35512:
URL: https://github.com/apache/spark/pull/35512#discussion_r806123708



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have

Review comment:
       "applied only to stream-stream join"?

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have
+ * been using ClusteredDistribution, which could construct the physical partitioning of the state
+ * in different way. (ClusteredDistribution requires relaxed condition and multiple
+ * partitionings can satisfy the requirement.) We need to construct the way to fix this with
+ * minimizing possibility to break the existing checkpoints.
+ *
+ * TODO: SPARK-38204 to address above note.

Review comment:
       nit nit: I saw we usually use the pattern `TODO(SPARK-38204)` 

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have
+ * been using ClusteredDistribution, which could construct the physical partitioning of the state
+ * in different way. (ClusteredDistribution requires relaxed condition and multiple

Review comment:
       maybe "in different way (ClusteredDistribution requires) ...": no dot after "way".




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a change in pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on a change in pull request #35512:
URL: https://github.com/apache/spark/pull/35512#discussion_r806210501



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have
+ * been using ClusteredDistribution, which could construct the physical partitioning of the state
+ * in different way. (ClusteredDistribution requires relaxed condition and multiple
+ * partitionings can satisfy the requirement.) We need to construct the way to fix this with
+ * minimizing possibility to break the existing checkpoints.
+ *
+ * TODO: SPARK-38204 to address above note.

Review comment:
       We seem to use both, but I see more usages on () so I'll follow it. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a change in pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on a change in pull request #35512:
URL: https://github.com/apache/spark/pull/35512#discussion_r806210501



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala
##########
@@ -101,6 +101,14 @@ case class ClusteredDistribution(
  * Since this distribution relies on [[HashPartitioning]] on the physical partitioning of the
  * stateful operator, only [[HashPartitioning]] (and HashPartitioning in
  * [[PartitioningCollection]]) can satisfy this distribution.
+ *
+ * NOTE: This is applied only stream-stream join as of now. For other stateful operators, we have
+ * been using ClusteredDistribution, which could construct the physical partitioning of the state
+ * in different way. (ClusteredDistribution requires relaxed condition and multiple
+ * partitionings can satisfy the requirement.) We need to construct the way to fix this with
+ * minimizing possibility to break the existing checkpoints.
+ *
+ * TODO: SPARK-38204 to address above note.

Review comment:
       We seem to use both, but I see more usages on () so I'll follow it. Thanks!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on pull request #35512:
URL: https://github.com/apache/spark/pull/35512#issuecomment-1039793243


   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR closed pull request #35512:
URL: https://github.com/apache/spark/pull/35512


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on pull request #35512:
URL: https://github.com/apache/spark/pull/35512#issuecomment-1039793243


   Thanks! Merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR closed pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR closed pull request #35512:
URL: https://github.com/apache/spark/pull/35512


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #35512: [SPARK-38124][SS][FOLLOWUP] Document the current challenge on fixing distribution of stateful operator

Posted by GitBox <gi...@apache.org>.

HeartSaVioR commented on pull request #35512:
URL: https://github.com/apache/spark/pull/35512#issuecomment-1038940024


   cc. @cloud-fan @viirya @sunchao @xuanyuanking @c21 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org