You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/18 06:01:59 UTC

[GitHub] [spark] JkSelf opened a new pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

JkSelf opened a new pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616
 
 
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   This PR will add the user guide for AQE and the detailed configurations about the three mainly features in AQE.
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   ### Why are the changes needed?
   Add the detailed configurations.
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   
   ### Does this PR introduce any user-facing change?
   No
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If no, write 'No'.
   -->
   
   
   ### How was this patch tested?
   only add doc no need ut.
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591734035
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r381285141
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. And this feature can bring about 1.28x performance gain with query 38 in 3TB TPC-DS.
 
 Review comment:
   `And this feature can bring about 1.28x performance gain with query 38 in 3TB TPC-DS.` This is not useful... how about something like
   ```
   This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to
   set a proper shuffle partition number that just fits your data. You just need to set a large enough number and
   Spark can pick the proper shuffle partition number at runtime.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380469743
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
 
 Review comment:
   post-shuffle partitions number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r391447525
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There are two following sub-configurations in this optimization rule. Data skew can severely downgrade performance of join queries. And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   Let's make it simple:
   
   Data skew can severely downgrade the performance of join queries. This feature dynamically handles skew in sort-merge join by splitting (and replicating if needed) skewed tasks into roughly evenly sized tasks. It takes effect when both abc and xyz configurations are enabled. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591810675
 
 
   **[Test build #119003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119003/testReport)** for PR 27616 at commit [`8ff8b71`](https://github.com/apache/spark/commit/8ff8b712725c69f68b6953b22c26896c4e0760b0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591811430
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598493733
 
 
   **[Test build #119733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119733/testReport)** for PR 27616 at commit [`9edc644`](https://github.com/apache/spark/commit/9edc644a66858b42ed6428caf33da3076bda35f0).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392834156
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
 
 Review comment:
   `There are four following sub-configurations in this optimization rule.` Can we remove this sentence? This looks not useful as users can see all the configs in the following table.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599533110
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599347058
 
 
   **[Test build #119832 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119832/testReport)** for PR 27616 at commit [`13aa51b`](https://github.com/apache/spark/commit/13aa51b245ecabf5ab38ba2b446196db5a79cb4e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380469633
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
 
 Review comment:
   `As of Spark 3.0, there are three major features ...`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595655574
 
 
   **[Test build #119455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119455/testReport)** for PR 27616 at commit [`4fdbe10`](https://github.com/apache/spark/commit/4fdbe10686476acf1effc420efa41faa3243cd6e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392832667
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
 
 Review comment:
   `Coalescing Post Shuffle Partitios`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599380125
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24573/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r391441866
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   `You just need to set a large enough number` -> `You just need to set a large enough initial number of shuffle partitions via xxx config, and ...`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599380117
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390167314
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   shuffle partitions number -> a shuffle partition number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599376434
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587315290
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595762987
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119455/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598552907
 
 
   **[Test build #119733 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119733/testReport)** for PR 27616 at commit [`9edc644`](https://github.com/apache/spark/commit/9edc644a66858b42ed6428caf33da3076bda35f0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587298476
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118613/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392836209
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
+AQE converts the sort merge join to broad cast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This feature can optimize the shuffle reader to local shuffle reader after converting the sort merge join to broadcast hash join at runtime and if no additional shuffle is introduced. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are enabled. This feature can improve the performance by saving the network overhead of shuffle process.
 
 Review comment:
   ```
   AQE converts sort-merge join to broadcast hash join when the runtime statistics of
   any join side is smaller than the broadcast hash join threshold. This is not as efficient as
   planning a broadcast hash join in the first place, but it's better than keep doing the
   sort-merge join, as we can save the sorting of both the join sides, and read shuffle files locally to save network
   traffic(if `spark.sql.adaptive.localShuffleReader.enabled` is true)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591815017
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119002/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
JkSelf commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599377361
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595653407
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390160787
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
 
 Review comment:
   Where is the description for `skewed join optimization`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r391442548
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
 
 Review comment:
   We can just write `64 MB`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595762187
 
 
   **[Test build #119455 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119455/testReport)** for PR 27616 at commit [`4fdbe10`](https://github.com/apache/spark/commit/4fdbe10686476acf1effc420efa41faa3243cd6e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390164628
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There are two following sub-configurations in this optimization rule. Data skew can severely downgrade performance of join queries. And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join ->
   
   This feature splits the skewed partition into multiple parallel tasks instead of the original single task in order to reduce the overhead of skewed join.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390166302
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. ->
   
   when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configurations are enabled.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r381281574
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partition number, optimizing local shuffle reader and optimizing skewed join.
 
 Review comment:
   nit: `coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380469954
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Num
 
 Review comment:
   Number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392834301
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
 
 Review comment:
   Converting sort-merge join to broadcast join

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599536311
 
 
   **[Test build #119868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119868/testReport)** for PR 27616 at commit [`8289200`](https://github.com/apache/spark/commit/82892008104b2b35e1c3260889fd65fa4004c101).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599347352
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392166793
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
 
 Review comment:
   `local shuffle reader optimization` is just one improvement of `converting sort-merge join to broadcast join`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587319348
 
 
   **[Test build #118618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118618/testReport)** for PR 27616 at commit [`727f57f`](https://github.com/apache/spark/commit/727f57f1bfba53a486f87776646d160eb8061258).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380470558
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Num
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. 
 
 Review comment:
   shall we introduce the benefits of this feature?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591734040
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23750/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587314887
 
 
   **[Test build #118618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118618/testReport)** for PR 27616 at commit [`727f57f`](https://github.com/apache/spark/commit/727f57f1bfba53a486f87776646d160eb8061258).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390162568
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There are two following sub-configurations in this optimization rule. Data skew can severely downgrade performance of join queries. And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   I think you should mention "Data skew can severely downgrade performance of join queries" first if you target to describe the background.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599379649
 
 
   **[Test build #119843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119843/testReport)** for PR 27616 at commit [`13aa51b`](https://github.com/apache/spark/commit/13aa51b245ecabf5ab38ba2b446196db5a79cb4e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591733607
 
 
   **[Test build #119002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119002/testReport)** for PR 27616 at commit [`52fcdf3`](https://github.com/apache/spark/commit/52fcdf3c40cb36928a4cad7b01e9396fbeb9870f).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587294911
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380469476
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
 
 Review comment:
   makes use

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599533122
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24597/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599515684
 
 
   **[Test build #119843 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119843/testReport)** for PR 27616 at commit [`13aa51b`](https://github.com/apache/spark/commit/13aa51b245ecabf5ab38ba2b446196db5a79cb4e).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587315297
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23371/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587298472
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599347356
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24562/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598052103
 
 
   A high-level comment: "Optimize Local Shuffle Reader" should be a sub-feature of converting SMJ to BHJ. We should mention that instead.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598553535
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119733/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392832517
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
 
 Review comment:
   `skew join` is actually more commonly used.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r391442367
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
 
 Review comment:
   can we copy the doc from `SQLConf`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599347356
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24562/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595762974
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598553530
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390161131
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There are two following sub-configurations in this optimization rule. Data skew can severely downgrade performance of join queries. And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   `choose` -> `chooses`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380471798
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Num
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. 
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.minNumPostShufflePartitions</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.maxNumPostShufflePartitions</code></td>
+     <td>Int.MaxValue</td>
+     <td>
+       The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.targetPostShuffleInputSize</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.localShuffleReader.enabled` configuration properties are both enabled.
 
 Review comment:
   ditto, users care more about the benefit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595762987
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119455/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591735908
 
 
   **[Test build #119003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119003/testReport)** for PR 27616 at commit [`8ff8b71`](https://github.com/apache/spark/commit/8ff8b712725c69f68b6953b22c26896c4e0760b0).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587319492
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118618/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591735908
 
 
   **[Test build #119003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119003/testReport)** for PR 27616 at commit [`8ff8b71`](https://github.com/apache/spark/commit/8ff8b712725c69f68b6953b22c26896c4e0760b0).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587294511
 
 
   **[Test build #118613 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118613/testReport)** for PR 27616 at commit [`19a381b`](https://github.com/apache/spark/commit/19a381b2d5af5128e821d233f5c997730a6d8c36).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591734035
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
JkSelf commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598493569
 
 
   @HyukjinKwon @cloud-fan  Thanks for your review. Updated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392836777
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
+AQE converts the sort merge join to broad cast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This feature can optimize the shuffle reader to local shuffle reader after converting the sort merge join to broadcast hash join at runtime and if no additional shuffle is introduced. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+
+### Optimize Skewed Join
 
 Review comment:
   Optimizing Skew Join

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599517876
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598494071
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24462/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591811439
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119003/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390160787
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
 
 Review comment:
   Where is the description for `skewed join optimization`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598494068
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599686305
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119868/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591815007
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587319486
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392834301
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
 
 Review comment:
   Converting sort-merge join to broadcast hash join

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599376253
 
 
   **[Test build #119832 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119832/testReport)** for PR 27616 at commit [`13aa51b`](https://github.com/apache/spark/commit/13aa51b245ecabf5ab38ba2b446196db5a79cb4e).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598493733
 
 
   **[Test build #119733 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119733/testReport)** for PR 27616 at commit [`9edc644`](https://github.com/apache/spark/commit/9edc644a66858b42ed6428caf33da3076bda35f0).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599602843
 
 
   thanks, merging to master/3.0!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
JkSelf commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587293946
 
 
   cc @cloud-fan 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392836209
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
+AQE converts the sort merge join to broad cast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This feature can optimize the shuffle reader to local shuffle reader after converting the sort merge join to broadcast hash join at runtime and if no additional shuffle is introduced. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are enabled. This feature can improve the performance by saving the network overhead of shuffle process.
 
 Review comment:
   ```
   AQE converts sort-merge join to broadcast hash join when the runtime statistics of
   any join side is smaller than the broadcast hash join threshold. This is not as efficient as
   planning a broadcast hash join in the first place, but it's better than keep doing the
   sort-merge join, as we can save the sorting and read shuffle files locally to save network
   traffic(if `spark.sql.adaptive.localShuffleReader.enabled` is true)
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599379649
 
 
   **[Test build #119843 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119843/testReport)** for PR 27616 at commit [`13aa51b`](https://github.com/apache/spark/commit/13aa51b245ecabf5ab38ba2b446196db5a79cb4e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591815007
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-596965096
 
 
   cc @maryannxue 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r381281759
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. And this feature can bring about 1.28x performance gain with query 38 in 3TB TPC-DS.
 
 Review comment:
   let's update to the latest config names.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599380125
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24573/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390164791
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
 
 Review comment:
   `optimize` -> `optimizes`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390158893
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
 
 Review comment:
   There's a leading space here which should be removed. Also newline should be inserted prior to '###'.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595653415
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24189/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390168547
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   > You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
   
   It sounds like a marketing phrase. I would write like Spark picks a proper shuffle number at runtime once you set the number large enough.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599517894
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119843/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599347352
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587294911
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591811439
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119003/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595762974
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390161407
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There are two following sub-configurations in this optimization rule. Data skew can severely downgrade performance of join queries. And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   when enabling both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587298385
 
 
   **[Test build #118613 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118613/testReport)** for PR 27616 at commit [`19a381b`](https://github.com/apache/spark/commit/19a381b2d5af5128e821d233f5c997730a6d8c36).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587298476
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118613/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595653415
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24189/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599347058
 
 
   **[Test build #119832 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119832/testReport)** for PR 27616 at commit [`13aa51b`](https://github.com/apache/spark/commit/13aa51b245ecabf5ab38ba2b446196db5a79cb4e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
JkSelf commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380492530
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
 
 Review comment:
   updated.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390165994
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
 
 Review comment:
   when when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. ->
   
   when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration are enabled.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591733607
 
 
   **[Test build #119002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119002/testReport)** for PR 27616 at commit [`52fcdf3`](https://github.com/apache/spark/commit/52fcdf3c40cb36928a4cad7b01e9396fbeb9870f).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392833049
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
 
 Review comment:
   `when both xxx and yyy configuration properties are enabled` reads a little weird. How about `when both xxx and yyy configurations are true`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587319492
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118618/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599536311
 
 
   **[Test build #119868 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119868/testReport)** for PR 27616 at commit [`8289200`](https://github.com/apache/spark/commit/82892008104b2b35e1c3260889fd65fa4004c101).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392834386
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
+
+### Coalescing Post Shuffle Partition Number
+This feature coalesces the post shuffle partitions based on the map output statistics when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partition number when running queries. You do not need to set a proper shuffle partition number to fit your dataset. Spark can pick the proper shuffle partition number at runtime once you set a large enough initial number of shuffle partitions via `spark.sql.adaptive.coalescePartitions.initialPartitionNum` configuration.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is true, Spark will coalesce contiguous shuffle partitions according to the target size (specified by <code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code>), to avoid too many small tasks.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>Default Parallelism</td>
+     <td>
+       The minimum number of shuffle partitions after coalescing. If not set, the default value is the default parallelism of the Spark cluster. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The initial number of shuffle partitions before coalescing. By default it equals to <code>spark.sql.shuffle.partitions</code>. This configuration only has an effect when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>64 MB</td>
+     <td>
+       The advisory size in bytes of the shuffle partition during adaptive optimization (when <code>spark.sql.adaptive.enabled</code> is true). It takes effect when Spark coalesces small shuffle partitions or splits skewed shuffle partition.
+     </td>
+   </tr>
+ </table>
+ 
+### Optimize Local Shuffle Reader
+AQE converts the sort merge join to broad cast hash join when the runtime statistics of any join side is smaller than the broadcast hash join threshold. This feature can optimize the shuffle reader to local shuffle reader after converting the sort merge join to broadcast hash join at runtime and if no additional shuffle is introduced. It takes effect when both `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are enabled. This feature can improve the performance by saving the network overhead of shuffle process.
 
 Review comment:
   `broad cast` -> `broadcast`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587315297
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23371/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598553535
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119733/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587294916
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23366/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599376437
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119832/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390160958
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
 
 Review comment:
   ditto

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r381286065
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. And this feature can bring about 1.28x performance gain with query 38 in 3TB TPC-DS.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.minNumPostShufflePartitions</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.maxNumPostShufflePartitions</code></td>
+     <td>Int.MaxValue</td>
+     <td>
+       The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.targetPostShuffleInputSize</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.localShuffleReader.enabled` configuration properties are both enabled. This feature and coalescing post shuffle partition number feature can bring about 1.76x performance gain with query 77 in 3TB TPC-DS.  
 
 Review comment:
   ditto, don't put perf number in a user guide. Just briefly explain how it affects user queries. E.g. save network traffic 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599686296
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392166603
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
 
 Review comment:
   including coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390167314
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   shuffle partitions number -> shuffle partition number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595653407
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599517894
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119843/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591814237
 
 
   **[Test build #119002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119002/testReport)** for PR 27616 at commit [`52fcdf3`](https://github.com/apache/spark/commit/52fcdf3c40cb36928a4cad7b01e9396fbeb9870f).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599686296
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r392832278
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,63 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing coalescing post-shuffle partitions, converting sort-merge join to broadcast join, and skewed join optimization.
 
 Review comment:
   typo: `coalescing coalescing`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599380117
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-595655574
 
 
   **[Test build #119455 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119455/testReport)** for PR 27616 at commit [`4fdbe10`](https://github.com/apache/spark/commit/4fdbe10686476acf1effc420efa41faa3243cd6e).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390167605
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   shuffle partitions number -> the shuffle partition number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599684525
 
 
   **[Test build #119868 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/119868/testReport)** for PR 27616 at commit [`8289200`](https://github.com/apache/spark/commit/82892008104b2b35e1c3260889fd65fa4004c101).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599376437
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119832/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587314887
 
 
   **[Test build #118618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118618/testReport)** for PR 27616 at commit [`727f57f`](https://github.com/apache/spark/commit/727f57f1bfba53a486f87776646d160eb8061258).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587298472
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591811430
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591815017
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119002/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587319486
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587294511
 
 
   **[Test build #118613 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118613/testReport)** for PR 27616 at commit [`19a381b`](https://github.com/apache/spark/commit/19a381b2d5af5128e821d233f5c997730a6d8c36).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390167314
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   shuffle partitions number -> a shuffle partition number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598553530
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599533110
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
JkSelf commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380492640
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
+ ### Coalescing Post Shuffle Partition Num
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. 
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.minNumPostShufflePartitions</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.maxNumPostShufflePartitions</code></td>
+     <td>Int.MaxValue</td>
+     <td>
+       The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.shuffle.targetPostShuffleInputSize</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.shuffle.localShuffleReader.enabled` configuration properties are both enabled.
 
 Review comment:
   add the performance data both three features.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390165372
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
 
 Review comment:
   `no additional shuffle introduced` -> `no additional shuffle is introduced`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390161183
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
+ This feature optimize the shuffle reader to local shuffle reader when converting the sort merge join to broadcast hash join in runtime and no additional shuffle introduced. It takes effect when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.localShuffleReader.enabled` configuration properties are both enabled. This feature can improve the performance by saving the network overhead of shuffle process.
+ ### Optimize Skewed Join
+ This feature choose the skewed partition and creates multi tasks to handle the skewed partition when both enable `spark.sql.adaptive.enabled` and `spark.sql.adaptive.skewJoin.enabled`. There are two following sub-configurations in this optimization rule. Data skew can severely downgrade performance of join queries. And this feature can split the skewed partition into multi parallel tasks instead of original 1 task to reduce the overhead of skewed join.
 
 Review comment:
   `multi` -> `multiple`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587315290
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599517876
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-587294916
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23366/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-591734040
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23750/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864][SQL][DOC] add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599686305
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/119868/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r391442719
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
 
 Review comment:
   ditto, let's make sure all the config doc are the same with `SQLConf`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599533122
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24597/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598494071
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/24462/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390158976
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
+ <table class="table">
+   <tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.enabled</code></td>
+     <td>true</td>
+     <td>
+       When true and <code>spark.sql.adaptive.enabled</code> is enabled, spark will reduce the post shuffle partitions number based on the map output statistics.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.minPartitionNum</code></td>
+     <td>1</td>
+     <td>
+       The advisory minimum number of post-shuffle partitions used when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled. It is suggested to be almost 2~3x of the parallelism when doing benchmark.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.coalescePartitions.initialPartitionNum</code></td>
+     <td>200</td>
+     <td>
+       The advisory number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to <code>spark.sql.shuffle.partitions</code>.
+     </td>
+   </tr>
+   <tr>
+     <td><code>spark.sql.adaptive.advisoryPartitionSizeInBytes</code></td>
+     <td>67108864 (64 MB)</td>
+     <td>
+       The target post-shuffle input size in bytes of a task when <code>spark.sql.adaptive.enabled</code> and <code>spark.sql.adaptive.coalescePartitions.enabled</code> are both enabled.
+     </td>
+   </tr>
+ </table>
+ 
+ ### Optimize Local Shuffle Reader
 
 Review comment:
   There's another leading space here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r380469743
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,75 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that make use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. There are three mainly feature in AQE, including coalescing post partition number, optimizing local shuffle reader and optimizing skewed join.
 
 Review comment:
   post-shuffle partition number

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-599376434
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#issuecomment-598494068
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on a change in pull request #27616: [SPARK-30864] [SQL]add the user guide for Adaptive Query Execution
URL: https://github.com/apache/spark/pull/27616#discussion_r390166463
 
 

 ##########
 File path: docs/sql-performance-tuning.md
 ##########
 @@ -186,3 +186,61 @@ The "REPARTITION_BY_RANGE" hint must have column names and a partition number is
     SELECT /*+ REPARTITION(3, c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t
     SELECT /*+ REPARTITION_BY_RANGE(3, c) */ * FROM t
+
+## Adaptive Query Execution
+Adaptive Query Execution (AQE) is an optimization technique in Spark SQL that makes use of the runtime statistics to choose the most efficient query execution plan. AQE is disabled by default. Spark SQL can use the umbrella configuration of `spark.sql.adaptive.enabled` to control whether turn it on/off. As of Spark 3.0, there are three major features in AQE, including coalescing post-shuffle partitions, local shuffle reader optimization and skewed join optimization.
+ ### Coalescing Post Shuffle Partition Number
+ This feature coalesces the post shuffle partitions based on the map output statistics when `spark.sql.adaptive.enabled` and `spark.sql.adaptive.coalescePartitions.enabled` configuration properties are both enabled. There are four following sub-configurations in this optimization rule. This feature simplifies the tuning of shuffle partitions number when running queries. You don't need to set a proper shuffle partition number to fit your dataset. You just need to set a large enough number and Spark can pick the proper shuffle partition number at runtime.
 
 Review comment:
   Please avoid abbreviation in the public documentation. don't -> do not

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org