You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/11 12:13:58 UTC

[GitHub] [spark] wangyum opened a new pull request, #36845: [SPARK-39447][SQL] Only non-broadcast query stage can propagate empty relation

wangyum opened a new pull request, #36845:
URL: https://github.com/apache/spark/pull/36845

   ### What changes were proposed in this pull request?
   
   This PR updates `AQEPropagateEmptyRelation` to make it only non-broadcast query stage can propagate empty relation.
   
   ### Why are the changes needed?
   
   Fix bug, otherwise TPC-DS q5 will fail if run on empty dataset.
   ```
   Caused by: java.lang.AssertionError: assertion failed
   	at scala.Predef$.assert(Predef.scala:208)
   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.$anonfun$doExecuteBroadcast$1(AdaptiveSparkPlanExec.scala:364)
   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.withFinalPlanUpdate(AdaptiveSparkPlanExec.scala:371)
   	at org.apache.spark.sql.execution.adaptive.AdaptiveSparkPlanExec.doExecuteBroadcast(AdaptiveSparkPlanExec.scala:363)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeBroadcast$1(SparkPlan.scala:207)
   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:232)
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   A simple query can't  reproduce the issue. So I updated `TPCDSQuerySuite` to cover this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
maryannxue commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1162588818

   @ulysses-you Please see if https://github.com/apache/spark/pull/36953 works for you. If no, you can add your specific check into the framework once it is merged. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #36845: [SPARK-39447][SQL] Only non-broadcast query stage can propagate empty relation

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1153742980

   I can re-produce it by: 
   ```sql
   CREATE TABLE t1(c1 int) USING PARQUET PARTITIONED BY (p1 string);
   CREATE TABLE t2(c2 int) USING PARQUET PARTITIONED BY (p2 string);
   
   SELECT * from (
   SELECT /*+ merge(t1) */ p1 FROM t1 JOIN t2 ON c1 = c2
   ) x JOIN t2 ON p1 = p2
   WHERE
   c2 > 0
   ```
   
   The reason is, AQE + DPP will insert a broadcast exchange at the top of `AdaptiveSparkPlanExec` when it is broadcast reusable. There exists some hacky code for this behavior during AQE `re-optimize`:
   
   ```scala
   // When both enabling AQE and DPP, `PlanAdaptiveDynamicPruningFilters` rule will
   // add the `BroadcastExchangeExec` node manually in the DPP subquery,
   // not through `EnsureRequirements` rule. Therefore, when the DPP subquery is complicated
   // and need to be re-optimized, AQE also need to manually insert the `BroadcastExchangeExec`
   // node to prevent the loss of the `BroadcastExchangeExec` node in DPP subquery.
   // Here, we also need to avoid to insert the `BroadcastExchangeExec` node when the newPlan
   // is already the `BroadcastExchangeExec` plan after apply the `LogicalQueryStageStrategy` rule.
   val finalPlan = currentPhysicalPlan match {
     case b: BroadcastExchangeLike
       if (!newPlan.isInstanceOf[BroadcastExchangeLike]) => b.withNewChildren(Seq(newPlan))
     case _ => newPlan
   }
   ```
   
   However, this code does not match if the top level broadcast exchange is wrapped by query stage. This case will happen if the broadcast exchange which is added by DPP is running before than the normal broadcast exchange(e.g. introduced by join).
   
   So we can match `BroadcastQueryStage(_, ReusedExchangeExec, _)` and skip the optimization. It is no meaning to optimize a child inside a reused exchange which is only for broadcast.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1161499556

   WDYT @maryannxue ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] maryannxue commented on pull request #36845: [SPARK-39447][SQL] Only non-broadcast query stage can propagate empty relation

Posted by GitBox <gi...@apache.org>.
maryannxue commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1154125105

   @wangyum I don't think this fix will solve all problems. Empty relation propagation can cause invalid plans in other ways, e.g., a `BroadcastExchangeExec` ending up as the child of a random plan node other than BHJ or BNLJ. So you can approach this problem from a different angle. Have an invalid plan checker to detect those patterns and throw a specific exception that can be caught by `AdaptiveSparkPlanExec`'s replanning method and so it can void the latest plan and keep the old valid plan in use.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1159881833

   looks a correct fix to me, cc @cloud-fan


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #36845: [SPARK-39447][SQL] Only non-broadcast query stage can propagate empty relation

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1153340424

   cc @maryannxue FYI


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1165287956

   After offline discuss with @wangyum , I will take over it. see https://github.com/apache/spark/pull/36974


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum closed pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast
URL: https://github.com/apache/spark/pull/36845


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] wangyum commented on pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
wangyum commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1165289356

   Thank you @ulysses-you 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on a diff in pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
ulysses-you commented on code in PR #36845:
URL: https://github.com/apache/spark/pull/36845#discussion_r901197878


##########
sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala:
##########
@@ -1616,6 +1616,23 @@ abstract class DynamicPartitionPruningSuiteBase
       assert(collectDynamicPruningExpressions(df.queryExecution.executedPlan).size === 1)
     }
   }
+
+  test("SPARK-39447: Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast") {

Review Comment:
   seems we can put this in `DynamicPartitionPruningV1SuiteAEOn` so we can avoid running redundantly



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #36845: [SPARK-39447][SQL] Avoid AssertionError in AdaptiveSparkPlanExec.doExecuteBroadcast

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #36845:
URL: https://github.com/apache/spark/pull/36845#issuecomment-1165178491

   @wangyum does the test pass now? If not, we should still implement your fix with the new framework @maryannxue added.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org