You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bersprockets (via GitHub)" <gi...@apache.org> on 2023/03/27 22:02:21 UTC

[GitHub] [spark] bersprockets opened a new pull request, #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

bersprockets opened a new pull request, #40569:
URL: https://github.com/apache/spark/pull/40569

   ### What changes were proposed in this pull request?
   
   Change `PlanSubqueries` to set `shouldBroadcast` to true when instantiating an `InSubqueryExec` instance.
   
   ### Why are the changes needed?
   
   The below left outer join gets an error:
   ```
   create or replace temp view v1 as
   select * from values
   (1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
   (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2),
   (3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
   as v1(key, value1, value2, value3, value4, value5, value6, value7, value8, value9, value10);
   
   create or replace temp view v2 as
   select * from values
   (1, 2),
   (3, 8),
   (7, 9)
   as v2(a, b);
   
   create or replace temp view v3 as
   select * from values
   (3),
   (8)
   as v3(col1);
   
   set spark.sql.codegen.maxFields=10; -- let's make maxFields 10 instead of 100
   set spark.sql.adaptive.enabled=false;
   
   select *
   from v1
   left outer join v2
   on key = a
   and key in (select col1 from v3);
   ```
   The join fails during predicate codegen:
   ```
   23/03/27 12:24:12 WARN Predicate: Expr codegen error and falling back to interpreter mode
   java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished
   	at scala.Predef$.require(Predef.scala:281)
   	at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
   	at org.apache.spark.sql.execution.InSubqueryExec.doGenCode(subquery.scala:156)
   	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:201)
   	at scala.Option.getOrElse(Option.scala:189)
   	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:196)
   	at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.$anonfun$generateExpressions$2(CodeGenerator.scala:1278)
   	at scala.collection.immutable.List.map(List.scala:293)
   	at org.apache.spark.sql.catalyst.expressions.codegen.CodegenContext.generateExpressions(CodeGenerator.scala:1278)
   	at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.create(GeneratePredicate.scala:41)
   	at org.apache.spark.sql.catalyst.expressions.codegen.GeneratePredicate$.generate(GeneratePredicate.scala:33)
   	at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:73)
   	at org.apache.spark.sql.catalyst.expressions.Predicate$.createCodeGeneratedObject(predicates.scala:70)
   	at org.apache.spark.sql.catalyst.expressions.CodeGeneratorWithInterpretedFallback.createObject(CodeGeneratorWithInterpretedFallback.scala:51)
   	at org.apache.spark.sql.catalyst.expressions.Predicate$.create(predicates.scala:86)
   	at org.apache.spark.sql.execution.joins.HashJoin.boundCondition(HashJoin.scala:146)
   	at org.apache.spark.sql.execution.joins.HashJoin.boundCondition$(HashJoin.scala:140)
   	at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition$lzycompute(BroadcastHashJoinExec.scala:40)
   	at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.boundCondition(BroadcastHashJoinExec.scala:40)
   ```
   It fails again after fallback to interpreter mode:
   ```
   23/03/27 12:24:12 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
   java.lang.IllegalArgumentException: requirement failed: input[0, int, false] IN subquery#34 has not finished
   	at scala.Predef$.require(Predef.scala:281)
   	at org.apache.spark.sql.execution.InSubqueryExec.prepareResult(subquery.scala:144)
   	at org.apache.spark.sql.execution.InSubqueryExec.eval(subquery.scala:151)
   	at org.apache.spark.sql.catalyst.expressions.InterpretedPredicate.eval(predicates.scala:52)
   	at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2(HashJoin.scala:146)
   	at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$boundCondition$2$adapted(HashJoin.scala:146)
   	at org.apache.spark.sql.execution.joins.HashJoin.$anonfun$outerJoin$1(HashJoin.scala:205)
   ```
   Both the predicate codegen and the evaluation fail for the same reason: `PlanSubqueries` creates `InSubqueryExec` with `shouldBroadcast=false`. The driver waits for the subquery to finish, but it's the executor that uses the results of the subquery (for predicate codegen or evaluation). Because `shouldBroadcast` is set to false, the result is stored in a transient field (`InSubqueryExec#result`), so the result of the subquery is not serialized when the `InSubqueryExec` instance is sent to the executor.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New unit test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "HyukjinKwon (via GitHub)" <gi...@apache.org>.
HyukjinKwon commented on PR #40569:
URL: https://github.com/apache/spark/pull/40569#issuecomment-1486087962

   cc @allisonwang-db 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] allisonwang-db commented on a diff in pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.
allisonwang-db commented on code in PR #40569:
URL: https://github.com/apache/spark/pull/40569#discussion_r1149956290


##########
sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala:
##########
@@ -2695,4 +2695,26 @@ class SubquerySuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-42937: Outer join with subquery in condition") {
+    withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false",

Review Comment:
   Why does this fail only when AQE is disabled?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] ulysses-you commented on pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "ulysses-you (via GitHub)" <gi...@apache.org>.
ulysses-you commented on PR #40569:
URL: https://github.com/apache/spark/pull/40569#issuecomment-1486295355

   lgtm, it seems we missed update non-AQE part when we optimize out the DDP broadcast value in https://github.com/apache/spark/pull/34051.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on a diff in pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "bersprockets (via GitHub)" <gi...@apache.org>.
bersprockets commented on code in PR #40569:
URL: https://github.com/apache/spark/pull/40569#discussion_r1149995693


##########
sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala:
##########
@@ -2695,4 +2695,26 @@ class SubquerySuite extends QueryTest
       }
     }
   }
+
+  test("SPARK-42937: Outer join with subquery in condition") {
+    withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false",

Review Comment:
   @allisonwang-db 
   
   When adaptive execution is enabled, `PlanAdaptiveSubqueries` always sets `shouldBroadcast=true`, so the subquery's result is available on the executor, if needed.
   
   https://github.com/apache/spark/blob/31965a06c9f85abf2296971237b1f88065eb67c2/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/PlanAdaptiveSubqueries.scala#L46



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #40569:
URL: https://github.com/apache/spark/pull/40569#issuecomment-1486818021

   According to the Affected Versions in Jira, I merged this to master/3.4/3.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true
URL: https://github.com/apache/spark/pull/40569


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] bersprockets commented on pull request #40569: [SPARK-42937][SQL] `PlanSubqueries` should set `InSubqueryExec#shouldBroadcast` to true

Posted by "bersprockets (via GitHub)" <gi...@apache.org>.
bersprockets commented on PR #40569:
URL: https://github.com/apache/spark/pull/40569#issuecomment-1486268124

   I updated the description to mention that the issue occurs only when both wholestage codegen and adaptive execution is disabled.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org