You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/12/15 03:44:25 UTC

[GitHub] [spark] zhengruifeng commented on pull request #34602: [SPARK-37328][SQL] Fix bug that OptimizeSkewedJoin may not work after it was moved from queryStageOptimizerRules to queryStagePreparationRules.

zhengruifeng commented on pull request #34602:
URL: https://github.com/apache/spark/pull/34602#issuecomment-994264151


   @advancedxy  Sorry for the late reply and thanks for ping me.
   
   I did a quick test with https://github.com/apache/spark/pull/33893
   
   Unfortunately, https://github.com/apache/spark/pull/33893 failed to handle the case, since `Exchange` nodes instead of `ShuffleQueryStage` were passed. https://github.com/apache/spark/pull/33893  now expect that all leave should be `QueryStageExec`.
   
   test code:
   ```
   spark.conf.set("spark.sql.adaptive.enabled", true)
   spark.conf.set("spark.sql.adaptive.skewJoin.enabled", true)
   spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
   spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", false)
   spark.conf.set("spark.sql.shuffle.partitions", 10)
   spark.conf.set("spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes", "100")
   spark.conf.set("spark.sql.adaptive.advisoryPartitionSizeInBytes", "100")
   
   spark.range(0, 1000, 1, 10).selectExpr("id % 3 as key1", "id % 3 as value1").createOrReplaceTempView("skewData1")
   spark.range(0, 1000, 1, 10).selectExpr("id % 1 as key2", "id as value2").createOrReplaceTempView("skewData2")
   spark.range(0, 1000, 1, 10).selectExpr("id % 1 as key3", "id as value3").createOrReplaceTempView("skewData3")
   
   
   spark.sql("SELECT key1 FROM skewData1 JOIN skewData2 ON key1 = key2 JOIN skewData3 ON value2 = value3").write.mode("overwrite").parquet("/tmp/tmp1.parquet")
   ```
   
   
   related log:
   
   ```
   21/12/15 11:15:33 DEBUG SparkSqlParser: Parsing command: SELECT key1 FROM skewData1 JOIN skewData2 ON key1 = key2 JOIN skewData3 ON value2 = value3
   21/12/15 11:15:34 DEBUG OptimizeSkewedJoin: Optimizing Project #75: ShuffledJoins: [SortMergeJoin, SortMergeJoin]
   21/12/15 11:15:34 DEBUG OptimizeSkewedJoin: Optimizing Project #75: Do NOT support operators [Exchange, Exchange, Range, Exchange, Range, Exchange, Range]
   21/12/15 11:15:35 DEBUG OptimizeSkewedJoin: Optimizing Project #161: ShuffledJoins: [SortMergeJoin, SortMergeJoin]
   21/12/15 11:15:35 DEBUG OptimizeSkewedJoin: Optimizing Project #161: Do NOT support operators [Exchange]
   21/12/15 11:15:35 DEBUG OptimizeSkewedJoin: Optimizing Project #200: ShuffledJoins: [SortMergeJoin, SortMergeJoin]
   21/12/15 11:15:35 DEBUG OptimizeSkewedJoin: Optimizing Project #200: Do NOT support operators [Exchange]
   21/12/15 11:15:35 DEBUG OptimizeSkewedJoin: Optimizing Project #247: ShuffledJoins: [SortMergeJoin]
   21/12/15 11:15:36 DEBUG OptimizeSkewedJoin: Optimizing Project #261: ShuffledJoins: [SortMergeJoin]
   21/12/15 11:15:36 DEBUG OptimizeSkewedJoin: Optimizing Project #261: ShuffleQueryStages: [3, 2]
   21/12/15 11:15:36 DEBUG OptimizeSkewedJoin: Optimizing Project #261: Splittable ShuffleQueryStages: [3, 2]
   21/12/15 11:15:36 DEBUG OptimizeSkewedJoin: Optimizing Project #261: Optimizing ShuffleQueryStage #3 in skew join, size info: median size: 21184, max size: 26854, min size: 18341, avg size: 21544
   21/12/15 11:15:36 DEBUG OptimizeSkewedJoin: Optimizing Project #261: Optimizing ShuffleQueryStage #2 in skew join, size info: median size: 1227, max size: 1308, min size: 1142, avg size: 1230
   21/12/15 11:15:36 DEBUG OptimizeSkewedJoin: Optimizing Project #261: Totally 0 skew partitions found
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org