You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/18 03:04:11 UTC

[GitHub] [spark] HeartSaVioR opened a new pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

HeartSaVioR opened a new pull request #35908:
URL: https://github.com/apache/spark/pull/35908


   ### What changes were proposed in this pull request?
   
   This PR proposes to use HashClusteredDistribution for stateful operators which requires exact order of clustering keys without allowing sub-clustering keys, so that stateful operators will have consistent partitioning across lifetime of the query. 
   (It doesn't cover the case grouping keys are changed. We have state schema checker verifying on the changes, but changing name is allowed so swapping keys with same data type is still allowed. So there are still grey areas.)
   
   The change will break the existing queries having checkpoint in prior to Spark 3.2.2 and bring silent correctness issues. To remedy the problem, we introduce a new internal config `spark.sql.streaming.statefulOperator.useStrictDistribution`, which defaults to true for new queries but defaults to false for queries starting from checkpoint in prior to Spark 3.2.2. If the new config is set to false, stateful operator will use ClusteredDistribution which retains the old requirement of child distribution. 
   
   Note that in this change we don't fix the root problem against old checkpoints. Long-term fix should be crafted carefully, after collecting evidence on the impact of SPARK-38204. (e.g. how many queries on end users would encounter SPARK-38204.)
   
   This PR adds E2E tests for the cases which trigger SPARK-38204, and verify the behavior with new query (3.2.2) & old query (in prior to 3.2.2).
   
   ### Why are the changes needed?
   
   Please refer the description of JIRA issue [SPARK-38024](https://issues.apache.org/jira/browse/SPARK-38204) for details, since the description is quite long to include here.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, stateful operators no longer accept the child output partitioning having subset of grouping keys and trigger additional shuffle. This will ensure consistent partitioning with stateful operators across lifetime of the query.
   
   ### How was this patch tested?
   
   New UTs including backward compatibility are added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #35908:
URL: https://github.com/apache/spark/pull/35908#issuecomment-1075806762


   Addressed via [c6dd39d](https://github.com/apache/spark/commit/c6dd39da4132fc4861ae5fed2684d63d817c0ba2)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR closed pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR closed pull request #35908:
URL: https://github.com/apache/spark/pull/35908


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #35908:
URL: https://github.com/apache/spark/pull/35908#issuecomment-1071982909


   cc. @viirya @xuanyuanking @c21 
   This is a backport PR of #35673 against 3.2 branch. I had to use HashClusteredDistribution here because porting back the change on StatefulOpClusteredDistribution required more changes and expected.
   Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #35908:
URL: https://github.com/apache/spark/pull/35908#issuecomment-1071982909


   cc. @viirya @xuanyuanking @c21 
   This is a backport PR of #35673 against 3.2 branch. I had to use HashClusteredDistribution here because porting back the change on StatefulOpClusteredDistribution required more changes than I expected.
   Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #35908:
URL: https://github.com/apache/spark/pull/35908#issuecomment-1071982909


   cc. @viirya @xuanyuanking @c21 
   This is a backport PR of #35673 against 3.2 branch. I had to use HashClusteredDistribution here because porting back the change on StatefulOpClusteredDistribution required more changes and expected.
   Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR edited a comment on pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR edited a comment on pull request #35908:
URL: https://github.com/apache/spark/pull/35908#issuecomment-1071982909


   cc. @viirya @xuanyuanking @c21 
   This is a backport PR of #35673 against 3.2 branch. I had to use HashClusteredDistribution here because porting back the change on StatefulOpClusteredDistribution required more changes than I expected.
   Please take a look. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #35908: [SPARK-38204][SS][3.2] Use HashClusteredDistribution for stateful operators with respecting backward compatibility

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #35908:
URL: https://github.com/apache/spark/pull/35908#issuecomment-1075805861


   Thanks for reviewing! Merging to 3.2.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org