You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/17 01:58:16 UTC

[GitHub] [spark] viirya opened a new pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

viirya opened a new pull request #28846:
URL: https://github.com/apache/spark/pull/28846


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   This patch changes the current way of creating query stages in AQE. Instead of creating query stages in batch, incrementally creating query stage can bring the optimization in earlier. It could avoid unnecessary local shuffle.
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   The current way of creating query stage in AQE is in batch. For example, the children of a sort merge join will be materialized as query stages in a batch. Then AQE brings the optimization in and optimize sort merge join to broadcast join. Except for the broadcasted exchange, we don't need do any exchange on another side of join but we already materialized the exchange. Currently AQE wraps the materialized exchange with local reader, but it still brings unnecessary I/O. We can avoid unnecessary local shuffle by incrementally creating query stage.
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Unit tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645195541






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670576359






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645634571


   Use an example to elaborate it. This query `SELECT * FROM testData join testData2 ON key = a where value = '1'` is one of test case in `AdaptiveQueryExecSuite`.
   
   The adaptivePlan in current master:
   ```
    *(3) BroadcastHashJoin [key#13], [a#23], Inner, BuildLeft
   :- BroadcastQueryStage 2
   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#144]
   :     +- CustomShuffleReader local
   :        +- ShuffleQueryStage 0
   :           +- Exchange hashpartitioning(key#13, 5), true, [id=#110]
   :              +- *(1) Filter (isnotnull(value#14) AND (value#14 = 1))
   :                 +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) AS value#14]                                                                                                  
   :                    +- Scan[obj#12]
   +- CustomShuffleReader local
      +- ShuffleQueryStage 1
         +- Exchange hashpartitioning(a#23, 5), true, [id=#121]
            +- *(2) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
               +- Scan[obj#22]
   ```
   
   In above adaptivePlan, AQE produces two `ShuffleQueryStage`s because two exchanges were materialized in batch and then AQE re-optimizes the query. Although AQE can optimize SortMergeJoin as BroadcastHashJoin, the exchanges were already materialized and only thing AQE can do is reading it with local reader.
   
   The adaptivePlan in this change:
   ```
   *(2) BroadcastHashJoin [key#13], [a#23], Inner, BuildLeft                                  
   :- BroadcastQueryStage 1
   :  +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#137]
   :     +- CustomShuffleReader local
   :        +- ShuffleQueryStage 0
   :           +- Exchange hashpartitioning(key#13, 5), true, [id=#110]
   :              +- *(1) Filter (isnotnull(value#14) AND (value#14 = 1))
   :                 +- *(1) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, from
   String, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false) AS value#14]
   :                    +- Scan[obj#12]
   +- *(2) SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS
   b#24]
      +- Scan[obj#22]
   ```
   
   In this change, AQE only materializes one exchange and then optimizes SortMergeJoin as BroadcastHashJoin. After that, we don't need to produce another exchange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670620998






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647270331


   And as @maryannxue said, you may trigger the large side first and it doesn't make sense to hold off. Ideally we should trigger both sides and cancel the large side if the small side completes very quickly. It will be great if you can explore the cancelation approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670620998


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646127629


   > It creates one stage first and then re-optimize the join.
   
   This is the confusing part. Creating a stage is not enough, we must wait for it to complete, then we can know the size and optimize the join to broadcast join.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645328325






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647281932


   @cloud-fan Thanks for clarifying. The idea sounds worth exploring as we can avoid local shuffle under current parallelism design of independent stages in AQE. I will explore the possibility.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646089689


   This change does not create and materialize all query stages of the join in a batch. It creates and materializes one stage first and then re-optimize the join. So once it makes the join as broadcast join, it won't create the unnecessary exchange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645328325






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646172235


   I updated previous comment. "It creates and materializes one stage first..."


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670621001


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/127207/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646191835


   > It creates and materializes
   
   Do you mean to trigger the materialization or wait for it to complete?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
JkSelf commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647347069


   In our previous 3TB TPC-DS benchmark, the perf improvement is mainly benefit from the coalescing  shuffle partitions and SMJ -> BHJ two features. The result is [here](https://docs.google.com/spreadsheets/d/1uija2AFblciMcYzU4jnPiy6I8mU8-M0-HwSNNns5aLU/edit?usp=sharing) for reference.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645149223


   **[Test build #124152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124152/testReport)** for PR 28846 at commit [`523e1d5`](https://github.com/apache/spark/commit/523e1d592beddf90331f77f57aff50af9dfea12b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645195049


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645146152


   **[Test build #124149 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124149/testReport)** for PR 28846 at commit [`e171a6c`](https://github.com/apache/spark/commit/e171a6cf65a6d29ed6bdc2d961effded185f9cbd).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
maryannxue commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646235383


   The question is: This means you hold off the other stage, right? Shouldn't it cause any regressions? If this is eventually a SMJ instead of a BHJ, one of the stages will be delayed.
   And how do you know if you are starting the larger stage or the smaller one first??


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645837939


   How do you achieve it? Do you hold off the execution of one query stage, and wait until another query stage completes?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645102759


   **[Test build #124149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124149/testReport)** for PR 28846 at commit [`e171a6c`](https://github.com/apache/spark/commit/e171a6cf65a6d29ed6bdc2d961effded185f9cbd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645192949


   **[Test build #124152 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124152/testReport)** for PR 28846 at commit [`523e1d5`](https://github.com/apache/spark/commit/523e1d592beddf90331f77f57aff50af9dfea12b).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646172235


   I updated previous comment. "It creates and materializes one stage first..."
   
   You can see the query plan in previous comment, it optimizes the join to broadcast join. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645146298






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645327299


   **[Test build #124159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124159/testReport)** for PR 28846 at commit [`523e1d5`](https://github.com/apache/spark/commit/523e1d592beddf90331f77f57aff50af9dfea12b).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646241038


   Does triggering all stages of a join, mean they are running at the same time actually? I think it means they are put into scheduler. When the stages are put to running depends on resources provision. If first stage uses all resources, I think later stage still needs to held off?
   
   It is also related to one question I have, the speed-up of AQE is gained by triggering all stages (not holding off other stage as you said) together, or optimizing join from SMJ to BHJ (only consider join case)? I may misunderstand, but before having AQE in SparkSQL, I think we don't trigger all stages like that too, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646089689


   This change does not create and materialize all query stages of the join in a batch. It creates and materialize one stage first and then re-optimize the join. So once it makes the join as broadcast join, it won't create the unnecessary exchange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670575829


   **[Test build #127207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127207/testReport)** for PR 28846 at commit [`6bb0b63`](https://github.com/apache/spark/commit/6bb0b6331d32f5517e482df3cf1c24fe91b97836).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645146303


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124149/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646089689


   This change does not create and materialize all query stages of the join in a batch. It creates one stage first and then re-optimize the join. So once it makes the join as broadcast join, it won't create the unnecessary exchange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646241038


   Does triggering all stages of a join, mean they are running at the same time actually? I think it means they are put into scheduler. When the stages are put to running depends on resources provision. If first stage uses all resources, I think later stage still needs to held off?
   
   It is also related to one question I have, the speed-up of AQE is gained by triggering all stages (not holding off other stage as you said) together, or optimizing join from SMJ to BHJ (if we only consider join case)? I may misunderstand, but before having AQE in SparkSQL, I think we don't trigger all stages like that too, right?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645193007


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/124152/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-647269450


   > If first stage uses all resources, I think later stage still needs to held off?
   
   That's true, but that's an assumption. It's also possible that these 2 jobs indeed run together.
   
   > the speed-up of AQE is gained by triggering all stages (not holding off other stage as you said) together, or optimizing join from SMJ to BHJ (if we only consider join case)
   
   In the benchmark, the default parallelism takes all the CPU cores. I think the most perf gain should be from shuffle partition coalescing and SMJ -> BHJ. cc @JkSelf 
   
   That said, by design AQE triggers all independent stages at the same time, to maximize the parallelism. And it's helpful if the resource is sufficient (or auto-scaling). I don't think we should change this design.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670569547


   Hi, @viirya . Could you rebase this PR to the `master` branch please?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645193001


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-727668820


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646127629


   > It creates one stage first and then re-optimize the join.
   
   This is the confusing part. Creating a stage is not enough, we must wait for it to complete, then we cam know the size and optimize the join to broadcast join.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645102759


   **[Test build #124149 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124149/testReport)** for PR 28846 at commit [`e171a6c`](https://github.com/apache/spark/commit/e171a6cf65a6d29ed6bdc2d961effded185f9cbd).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646213264


   It needs to wait for it to complete, this is how AQE does.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645198156


   **[Test build #124159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124159/testReport)** for PR 28846 at commit [`523e1d5`](https://github.com/apache/spark/commit/523e1d592beddf90331f77f57aff50af9dfea12b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645149223


   **[Test build #124152 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124152/testReport)** for PR 28846 at commit [`523e1d5`](https://github.com/apache/spark/commit/523e1d592beddf90331f77f57aff50af9dfea12b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645102991






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645102991






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] viirya edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
viirya edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-646213264


   It needs to wait for it to complete, this is how AQE does. As you said, we need to know the size of exchange.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670575829


   **[Test build #127207 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127207/testReport)** for PR 28846 at commit [`6bb0b63`](https://github.com/apache/spark/commit/6bb0b6331d32f5517e482df3cf1c24fe91b97836).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645149534






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670620880


   **[Test build #127207 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127207/testReport)** for PR 28846 at commit [`6bb0b63`](https://github.com/apache/spark/commit/6bb0b6331d32f5517e482df3cf1c24fe91b97836).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-670576359






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645149534






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #28846:
URL: https://github.com/apache/spark/pull/28846


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645198156


   **[Test build #124159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124159/testReport)** for PR 28846 at commit [`523e1d5`](https://github.com/apache/spark/commit/523e1d592beddf90331f77f57aff50af9dfea12b).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645146298


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan edited a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan edited a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645380472


   Can you elaborate it more? How does this optimization help to plan broadcast join?
   
   > For example, the children of a sort merge join will be materialized as query stages in a batch. Then AQE brings the optimization in and optimize sort merge join to broadcast join.
   
   The AQE needs to wait for the stage to finish, so that it knows the size and can change SMJ to BHJ. How can we avoid unnecessary I/O after the stage is finished?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645380472


   Can you elaborate it more? How does this optimization help to plan broadcast join?
   
   > For example, the children of a sort merge join will be materialized as query stages in a batch. Then AQE brings the optimization in and optimize sort merge join to broadcast join.
   
   The AQE needs to wait for the stage to finish, so that it knows the size and can change to SMJ. How can we avoid unnecessary I/O after the stage is finished?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645195541






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #28846: [SPARK-32012][SQL] Incrementally create and materialize query stage to avoid unnecessary local shuffle

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #28846:
URL: https://github.com/apache/spark/pull/28846#issuecomment-645193001






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org