You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/21 13:27:11 UTC

[GitHub] [spark] cloud-fan opened a new pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

cloud-fan opened a new pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669
 
 
   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   Use the average size of the non-skewed partitions as the target size when splitting skewed partitions, instead of ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   The goal of skew join optimization is to make the data distribution move even. So it makes more sense the use the average size of the non-skewed partitions as the target size.
   
   ### Does this PR introduce any user-facing change?
   <!--
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If no, write 'No'.
   -->
   no
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   existing tests

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590736143
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590425152
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590328823
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r384162163
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -34,6 +34,30 @@ import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ShuffleExcha
 import org.apache.spark.sql.execution.joins.SortMergeJoinExec
 import org.apache.spark.sql.internal.SQLConf
 
+/**
+ * A rule to optimize skewed joins to avoid straggler tasks whose share of data are significantly
+ * larger than those of the rest of the tasks.
+ *
+ * The general idea is to divide each skew partition into smaller partitions and replicate its
+ * matching partition on the other side of the join so that they can run in parallel tasks.
+ * Note that when matching partitions from the left side and the right side both have skew,
+ * it will become a cartesian product of splits from left and right joining together.
+ *
+ * For example, assume the Sort-Merge join has 4 partitions:
+ * left:  [L1, L2, L3, L4]
+ * right: [R1, R2, R3, R4]
+ *
+ * Let's say L2, L4 and R3, R4 are skewed, and each of them get split into 2 sub-partitions. This
+ * is scheduled to run 4 tasks at the beginning: (L1, R1), (L2, R2), (L2, R2), (L2, R2).
 
 Review comment:
   This seems to be a mistake. Did you want to say the following?
   ```
   - (L1, R1), (L2, R2), (L2, R2), (L2, R2).
   + (L1, R1), (L2, R2), (L3, R3), (L4, R4).
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590328836
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23619/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590421669
 
 
   **[Test build #118877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118877/testReport)** for PR 27669 at commit [`5ef5837`](https://github.com/apache/spark/commit/5ef58370bfd8a637e7738f4734fb8a181df976d6).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383270850
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +84,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
+    val nonSkewSizes = stats.bytesByPartitionId.filterNot(isSkewed(_, medianSize))
 
 Review comment:
   This method is called before entering the main loop, so it's not repetitive calc.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382680164
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -445,6 +445,7 @@ object SQLConf {
         " this factor multiple the median partition size and also larger than " +
         s" ${ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD.key}")
       .intConf
+      .checkValue(_ > 0, "The skew factor must be positive.")
 
 Review comment:
   We no longer use `ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD`, right? Let's remove it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589781892
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590422536
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23627/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590422517
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383654941
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +85,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
 
 Review comment:
   The problem with the old approach was the new skew partition size after split can be much smaller than that of the non-skew partition size. Being small itself is not a problem, but having more splits may come with a price, esp. with both side skews, and meanwhile if non-skew partitions take longer to finish, it wouldn't be worth that price.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590759332
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590425163
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118877/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383362699
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -72,12 +109,14 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
   /**
    * Split the skewed partition based on the map size and the max split number.
    */
-  private def getMapStartIndices(stage: ShuffleQueryStageExec, partitionId: Int): Array[Int] = {
+  private def getMapStartIndices(
+      stage: ShuffleQueryStageExec,
+      partitionId: Int,
+      targetSize: Long): Array[Int] = {
     val shuffleId = stage.shuffle.shuffleDependency.shuffleHandle.shuffleId
     val mapPartitionSizes = getMapSizesForReduceId(shuffleId, partitionId)
     val avgPartitionSize = mapPartitionSizes.sum / mapPartitionSizes.length
-    val advisoryPartitionSize = math.max(avgPartitionSize,
-      conf.getConf(SQLConf.ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD))
+    val advisoryPartitionSize = math.max(avgPartitionSize, targetSize)
 
 Review comment:
   do we still need `avgPartitionSize`? we shouldn't have a problem even if targetSize smaller than average, right? coz the worse case would always be one mapper per task.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590715339
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382657413
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -240,6 +280,10 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
       rightStats: MapOutputStatistics,
       nonSkewPartitionIndices: Seq[Int]): Seq[ShufflePartitionSpec] = {
     assert(nonSkewPartitionIndices.nonEmpty)
+    if (!conf.getConf(SQLConf.REDUCE_POST_SHUFFLE_PARTITIONS_ENABLED)) {
 
 Review comment:
   nit: can we combine this `if` with the one below: `if (nonSkewPartitionIndices.length == 1)`

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382580651
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -240,6 +280,10 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
       rightStats: MapOutputStatistics,
       nonSkewPartitionIndices: Seq[Int]): Seq[ShufflePartitionSpec] = {
     assert(nonSkewPartitionIndices.nonEmpty)
+    if (!conf.getConf(SQLConf.REDUCE_POST_SHUFFLE_PARTITIONS_ENABLED)) {
 
 Review comment:
   not related to this PR but a small fix: we shouldn't coalesce partitions if the config is off.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589781892
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590331403
 
 
   **[Test build #118870 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118870/testReport)** for PR 27669 at commit [`1b38b71`](https://github.com/apache/spark/commit/1b38b71db2010e44aa4452afdbf8f7cc24bfd915).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590422536
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23627/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382881205
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -445,6 +445,7 @@ object SQLConf {
         " this factor multiple the median partition size and also larger than " +
         s" ${ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD.key}")
       .intConf
+      .checkValue(_ > 0, "The skew factor must be positive.")
 
 Review comment:
   We are using https://github.com/apache/spark/pull/27669/files#diff-2d6bea6eed43ca6f37fe3531cb574069R93 now, as we are trying to make the same target partition size for both coalesced non-skew partitions and skew partitions after split if the avg non-skew size is small.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590758787
 
 
   **[Test build #118916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118916/testReport)** for PR 27669 at commit [`4755526`](https://github.com/apache/spark/commit/4755526bd20f44aa24f91a120c98518330081398).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590895953
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118916/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590758787
 
 
   **[Test build #118916 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118916/testReport)** for PR 27669 at commit [`4755526`](https://github.com/apache/spark/commit/4755526bd20f44aa24f91a120c98518330081398).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590328368
 
 
   **[Test build #118870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118870/testReport)** for PR 27669 at commit [`1b38b71`](https://github.com/apache/spark/commit/1b38b71db2010e44aa4452afdbf8f7cc24bfd915).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590736153
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118901/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590714862
 
 
   **[Test build #118901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118901/testReport)** for PR 27669 at commit [`4755526`](https://github.com/apache/spark/commit/4755526bd20f44aa24f91a120c98518330081398).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589781157
 
 
   **[Test build #118793 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118793/testReport)** for PR 27669 at commit [`4a64c0e`](https://github.com/apache/spark/commit/4a64c0ea1472a9c65156ead6c82dc424ef8f9591).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590421669
 
 
   **[Test build #118877 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118877/testReport)** for PR 27669 at commit [`5ef5837`](https://github.com/apache/spark/commit/5ef58370bfd8a637e7738f4734fb8a181df976d6).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r387341483
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +85,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
 
 Review comment:
   After coming across the config description of `ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD`, I probably get @JkSelf 's point. In the description, it is meant to test if a partition is skewed... but the way it is actually used here in this class, it is more like the target size for splitting the skewed partitions.
   So we need to changes here:
   1. bring this conf back and use it in `isSkewed` instead.
   2. if eventually the entire "skewed" partition is not split at all because the size is smaller than the target size, we need to avoid adding the SkewDesc for that partition.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590736143
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589781899
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118793/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
JkSelf commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382878571
 
 

 ##########
 File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
 ##########
 @@ -445,6 +445,7 @@ object SQLConf {
         " this factor multiple the median partition size and also larger than " +
         s" ${ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD.key}")
       .intConf
+      .checkValue(_ > 0, "The skew factor must be positive.")
 
 Review comment:
   When the `nonSkewSizes ` is very small. The `targetSize ` will be small without this config. Then it will split more small task when handling skewed partition? So we may need this config.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590756268
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589655456
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383358437
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +84,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
+    val nonSkewSizes = stats.bytesByPartitionId.filterNot(isSkewed(_, medianSize))
 
 Review comment:
   yeah, you are right!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590759346
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23665/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589654980
 
 
   **[Test build #118793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118793/testReport)** for PR 27669 at commit [`4a64c0e`](https://github.com/apache/spark/commit/4a64c0ea1472a9c65156ead6c82dc424ef8f9591).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590331433
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590328368
 
 
   **[Test build #118870 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118870/testReport)** for PR 27669 at commit [`1b38b71`](https://github.com/apache/spark/commit/1b38b71db2010e44aa4452afdbf8f7cc24bfd915).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589652426
 
 
   @JkSelf @maryannxue 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589655456
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590895953
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118916/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590331440
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118870/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589781899
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118793/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590328836
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23619/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382661964
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +84,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
+    val nonSkewSizes = stats.bytesByPartitionId.filterNot(isSkewed(_, medianSize))
 
 Review comment:
   We can move this repetitive calc of non-skew average size out of this method, which will leave us just two local variables before the main loop below: `targetPostShuffleSizeConf` and `nonSkewAvgSize`. Then we only need to do: `max(targetPostShuffleSizeConf, nonSkewAvgSize)` without a method call.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590331433
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590895942
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590736046
 
 
   **[Test build #118901 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118901/testReport)** for PR 27669 at commit [`4755526`](https://github.com/apache/spark/commit/4755526bd20f44aa24f91a120c98518330081398).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590425163
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118877/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382679191
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -34,6 +34,29 @@ import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ShuffleExcha
 import org.apache.spark.sql.execution.joins.SortMergeJoinExec
 import org.apache.spark.sql.internal.SQLConf
 
+/**
+ * A rule to optimize skewed joins to avoid one or a little tasks processing most of the data.
+ *
+ * The general idea is to treat a Sort-Merge join as many sub-joins that each sub-join processes
 
 Review comment:
   It's not accurate to say "sub-join" here. How about:
   The general idea is to divide each skew partition into smaller partitions and replicate its matching partition on the other side of the join so that they can run in parallel tasks. Note that when matching partitions from the left side and the right side both have skew, it will become a cartesian product of splits from left and right joining together.
   
   And let's replace the term "sub-join" accordingly in the example below as well.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382668795
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -72,14 +108,16 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
   /**
    * Split the skewed partition based on the map size and the max split number.
    */
-  private def getMapStartIndices(stage: ShuffleQueryStageExec, partitionId: Int): Array[Int] = {
+  private def getMapStartIndices(
+      stage: ShuffleQueryStageExec,
+      partitionId: Int,
+      targetSize: Long): Array[Int] = {
     val shuffleId = stage.shuffle.shuffleDependency.shuffleHandle.shuffleId
     val mapPartitionSizes = getMapSizesForReduceId(shuffleId, partitionId)
     val maxSplits = math.min(conf.getConf(
       SQLConf.ADAPTIVE_EXECUTION_SKEWED_PARTITION_MAX_SPLITS), mapPartitionSizes.length)
     val avgPartitionSize = mapPartitionSizes.sum / maxSplits
-    val advisoryPartitionSize = math.max(avgPartitionSize,
 
 Review comment:
   Can we remove this maxSplits and avgPartitionSize in this PR?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590425152
 
 
   Merged build finished. Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590715354
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23650/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590759346
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23665/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590331440
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118870/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590715339
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590759332
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382666207
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -34,6 +34,29 @@ import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ShuffleExcha
 import org.apache.spark.sql.execution.joins.SortMergeJoinExec
 import org.apache.spark.sql.internal.SQLConf
 
+/**
+ * A rule to optimize skewed joins to avoid one or a little tasks processing most of the data.
+ *
+ * The general idea is to treat a Sort-Merge join as many sub-joins that each sub-join processes
+ * the data of a left-side partition and a right-side partition. For each sub-join, split the skewed
+ * partition into sub-partitions and do a cartesian product of sub-partitions from left and
+ * right sides.
+ *
+ * For example, assume the Sort-Merge join has 4 partitions:
+ * left:  [L1, L2, L3, L4]
+ * right: [R1, R2, R3, R4]
+ *
+ * Let's say L2, L4 and R3, R4 are skewed, and each of them get split into 2 sub-partitions. This
+ * has 4 sub-joins at the beginning: (L1, R1), (L2, R2), (L2, R2), (L2, R2).
+ * This rule expands it to 9 sub-joins:
+ * (L1, R1),
+ * (L2-1, R2), (L2-2, R2),
+ * (L3, R3-1), (L3, R3-2),
+ * (L4-1, R4-1), (L4-2, R4-1), (L4-1, R4-2), (L4-2, R4-2)
+ * Each sub-join is executed as a Spark task physically, so we end up with more parallelism.
+ *
+ * Note that, this rule also coalesces non-skewed partitions like `ReduceNumShufflePartitions`.
 
 Review comment:
   nit: when ... is enabled.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590328823
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile closed pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
gatorsmile closed pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
JkSelf commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383604131
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +85,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
 
 Review comment:
   When user enable skewed join optimization and want to change the skewed condition by adjusting the `targetPostShuffleSize`. If we use the `SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE` here, it may also effect the task numbers in map stage.  It is better to use the `ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD` config  to set the `targetPostShuffleSize` in skewed join optimization?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r384250111
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -34,6 +34,30 @@ import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ShuffleExcha
 import org.apache.spark.sql.execution.joins.SortMergeJoinExec
 import org.apache.spark.sql.internal.SQLConf
 
+/**
+ * A rule to optimize skewed joins to avoid straggler tasks whose share of data are significantly
+ * larger than those of the rest of the tasks.
+ *
+ * The general idea is to divide each skew partition into smaller partitions and replicate its
+ * matching partition on the other side of the join so that they can run in parallel tasks.
+ * Note that when matching partitions from the left side and the right side both have skew,
+ * it will become a cartesian product of splits from left and right joining together.
+ *
+ * For example, assume the Sort-Merge join has 4 partitions:
+ * left:  [L1, L2, L3, L4]
+ * right: [R1, R2, R3, R4]
+ *
+ * Let's say L2, L4 and R3, R4 are skewed, and each of them get split into 2 sub-partitions. This
+ * is scheduled to run 4 tasks at the beginning: (L1, R1), (L2, R2), (L2, R2), (L2, R2).
 
 Review comment:
   ah yes! will fix it soon

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590422517
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589655459
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23544/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589654980
 
 
   **[Test build #118793 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118793/testReport)** for PR 27669 at commit [`4a64c0e`](https://github.com/apache/spark/commit/4a64c0ea1472a9c65156ead6c82dc424ef8f9591).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382669699
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -34,6 +34,29 @@ import org.apache.spark.sql.execution.exchange.{EnsureRequirements, ShuffleExcha
 import org.apache.spark.sql.execution.joins.SortMergeJoinExec
 import org.apache.spark.sql.internal.SQLConf
 
+/**
+ * A rule to optimize skewed joins to avoid one or a little tasks processing most of the data.
 
 Review comment:
   A rule to optimize skewed joins to avoid straggler tasks whose share of data are significantly larger than those of the rest of the tasks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-589655459
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23544/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590895942
 
 
   Merged build finished. Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590894862
 
 
   **[Test build #118916 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118916/testReport)** for PR 27669 at commit [`4755526`](https://github.com/apache/spark/commit/4755526bd20f44aa24f91a120c98518330081398).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383653912
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +85,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
 
 Review comment:
   Why would user want the new partition size after split to be different from the sizes of non-skew partition size? The goal of this rule is to coordinate all partitions to be around the same size if possible...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590714862
 
 
   **[Test build #118901 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118901/testReport)** for PR 27669 at commit [`4755526`](https://github.com/apache/spark/commit/4755526bd20f44aa24f91a120c98518330081398).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] JkSelf commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
JkSelf commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r382878634
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -72,14 +108,16 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
   /**
    * Split the skewed partition based on the map size and the max split number.
    */
-  private def getMapStartIndices(stage: ShuffleQueryStageExec, partitionId: Int): Array[Int] = {
+  private def getMapStartIndices(
+      stage: ShuffleQueryStageExec,
+      partitionId: Int,
+      targetSize: Long): Array[Int] = {
     val shuffleId = stage.shuffle.shuffleDependency.shuffleHandle.shuffleId
     val mapPartitionSizes = getMapSizesForReduceId(shuffleId, partitionId)
     val maxSplits = math.min(conf.getConf(
       SQLConf.ADAPTIVE_EXECUTION_SKEWED_PARTITION_MAX_SPLITS), mapPartitionSizes.length)
     val avgPartitionSize = mapPartitionSizes.sum / maxSplits
-    val advisoryPartitionSize = math.max(avgPartitionSize,
 
 Review comment:
   remove the max splits in [PR#27673](https://github.com/apache/spark/pull/27673)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590736153
 
 
   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118901/
   Test FAILed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] gatorsmile commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
gatorsmile commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-591096662
 
 
   Thanks! Merged to master/3.0

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-591102738
 
 
   cc @dbtsai 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590715354
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23650/
   Test PASSed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
SparkQA commented on issue #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#issuecomment-590425115
 
 
   **[Test build #118877 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118877/testReport)** for PR 27669 at commit [`5ef5837`](https://github.com/apache/spark/commit/5ef58370bfd8a637e7738f4734fb8a181df976d6).
    * This patch **fails to build**.
    * This patch merges cleanly.
    * This patch adds no public classes.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383738095
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -61,6 +85,19 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
     }
   }
 
+  /**
+   * The goal of skew join optimization is to make the data distribution more even. The target size
+   * to split skewed partitions is the average size of non-skewed partition, or the
+   * target post-shuffle partition size if avg size is smaller than it.
+   */
+  private def targetSize(stats: MapOutputStatistics, medianSize: Long): Long = {
+    val targetPostShuffleSize = conf.getConf(SQLConf.SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE)
 
 Review comment:
   @JkSelf do you have any real-world use cases for it? I noticed it as well but have the same feeling with @maryannxue : why would users set a different value?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions

Posted by GitBox <gi...@apache.org>.
maryannxue commented on a change in pull request #27669: [SPARK-30918][SQL] improve the splitting of skewed partitions
URL: https://github.com/apache/spark/pull/27669#discussion_r383359019
 
 

 ##########
 File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala
 ##########
 @@ -236,7 +277,8 @@ case class OptimizeSkewedJoin(conf: SQLConf) extends Rule[SparkPlan] {
       rightStats: MapOutputStatistics,
       nonSkewPartitionIndices: Seq[Int]): Seq[ShufflePartitionSpec] = {
     assert(nonSkewPartitionIndices.nonEmpty)
-    if (nonSkewPartitionIndices.length == 1) {
+    val isEnabled = conf.getConf(SQLConf.REDUCE_POST_SHUFFLE_PARTITIONS_ENABLED)
 
 Review comment:
   nit: isCoalesceEnabled

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org