You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/09/02 01:42:52 UTC

[GitHub] [spark] liwensun opened a new pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

liwensun opened a new pull request #29623:
URL: https://github.com/apache/spark/pull/29623


   ### What changes were proposed in this pull request?
   
   PropagateEmptyRelation will not be applied to the LIMIT operator in streaming queries.
   
   ### Why are the changes needed?
   
   Right now, the limit operator in a streaming query may get optimized away when the relation is empty. This can be problematic for stateful streaming, as this empty batch will not write any state store files, and the next batch will fail when trying to read these state store files and throw a file not found error.
   
   We should not let PropagateEmptyRelation optimize away the Limit operator for streaming queries.
   
   This PR is intended to apply a small and safe fix for PropagateEmptyRelation. A fundamental fix that can prevent this from happening again in the future and in other optimizer rules is more desirable, but that's a much larger task.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   unit tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685394528






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29623:
URL: https://github.com/apache/spark/pull/29623#discussion_r481858591



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala
##########
@@ -257,4 +257,10 @@ class PropagateEmptyRelationSuite extends PlanTest {
     val optimized = Optimize.execute(query.analyze)
     assert(optimized.resolved)
   }
+
+  test("should not optimize away limit if streaming") {

Review comment:
       shall we add a JIRA prefix?
   ```suggestion
     test("SPARK-32776: should not optimize away limit if streaming") {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29623:
URL: https://github.com/apache/spark/pull/29623#discussion_r481858247



##########
File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala
##########
@@ -257,4 +257,10 @@ class PropagateEmptyRelationSuite extends PlanTest {
     val optimized = Optimize.execute(query.analyze)
     assert(optimized.resolved)
   }
+
+  test("should not optimize away limit if streaming") {

Review comment:
       shall we add a JIRA prefix? `test("SPARK-32776: should not optimize away limit if streaming") {`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29623: [SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685462151


   I am just going to merge since the PR is ready to go.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685231472






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #29623: [SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #29623:
URL: https://github.com/apache/spark/pull/29623


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685231472






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685394528






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685233452


   **[Test build #128174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128174/testReport)** for PR 29623 at commit [`6ae55d4`](https://github.com/apache/spark/commit/6ae55d4030e42030658618327eb3924f60f752e4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685389784


   **[Test build #128174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128174/testReport)** for PR 29623 at commit [`6ae55d4`](https://github.com/apache/spark/commit/6ae55d4030e42030658618327eb3924f60f752e4).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685267625


   cc @HeartSaVioR 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685233452


   **[Test build #128174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128174/testReport)** for PR 29623 at commit [`6ae55d4`](https://github.com/apache/spark/commit/6ae55d4030e42030658618327eb3924f60f752e4).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] liwensun commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
liwensun commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685231370


   @cloud-fan 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HeartSaVioR commented on pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HeartSaVioR commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685328147


   Btw, looks like we are missing to update the guide doc once we address limitation. According to the guide, limit is not supported in streaming dataset.
   http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29623: [SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29623:
URL: https://github.com/apache/spark/pull/29623#issuecomment-685463270


   Merged to master and branch-3.0.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #29623: [SPARK-32776] Limit in streaming should not be optimized away by PropagateEmptyRelation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #29623:
URL: https://github.com/apache/spark/pull/29623#discussion_r481860009



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala
##########
@@ -1141,6 +1143,42 @@ class StreamingQuerySuite extends StreamTest with BeforeAndAfter with Logging wi
     }
   }
 
+  testQuietly("limit on empty batch should not cause state store error") {

Review comment:
       ```suggestion
     testQuietly("SPARK-32776: limit on empty batch should not cause state store error") {
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org