You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by "bersprockets (via GitHub)" <gi...@apache.org> on 2023/12/06 23:49:04 UTC

[PR] [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join [spark]

bersprockets opened a new pull request, #44219:
URL: https://github.com/apache/spark/pull/44219

   ### What changes were proposed in this pull request?
   
   This is a back-port of https://github.com/apache/spark/pull/44193.
   
   In `RewritePredicateSubquery`, prune existence flags from the final join when `rewriteExistentialExpr` returns an existence join. This change prunes the flags (attributes with the name "exists") by adding a `Project` node.
   
   For example:
   ```
   Join LeftSemi, ((a#13 = c1#15) OR exists#19)
   :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
   :  :- LocalRelation [a#13]
   :  +- LocalRelation [col1#17]
   +- LocalRelation [c1#15]
   ```
   becomes
   ```
   Project [a#13]
   +- Join LeftSemi, ((a#13 = c1#15) OR exists#19)
      :- Join ExistenceJoin(exists#19), (a#13 = col1#17)
      :  :- LocalRelation [a#13]
      :  +- LocalRelation [col1#17]
      +- LocalRelation [c1#15]
   ```
   This change always adds the `Project` node, whether `rewriteExistentialExpr` returns an existence join or not. In the case when `rewriteExistentialExpr` does not return an existence join, `RemoveNoopOperators` will remove the unneeded `Project` node.
   
   ### Why are the changes needed?
   
   This query returns an extraneous boolean column when run in spark-sql:
   ```
   create or replace temp view t1(a) as values (1), (2), (3), (7);
   create or replace temp view t2(c1) as values (1), (2), (3);
   create or replace temp view t3(col1) as values (3), (9);
   
   select *
   from t1
   where exists (
     select c1
     from t2
     where a = c1
     or a in (select col1 from t3)
   );
   
   1	false
   2	false
   3	true
   ```
   (Note: the above query will not have the extraneous boolean column when run from the Dataset API. That is because the Dataset API truncates the rows based on the schema of the analyzed plan. The bug occurs during optimization).
   
   This query fails when run in either spark-sql or using the Dataset API:
   ```
   select (
     select *
     from t1
     where exists (
       select c1
       from t2
       where a = c1
       or a in (select col1 from t3)
     )
     limit 1
   )
   from range(1);
   
   java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; something went wrong in analysis
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, except for the removal of the extraneous boolean flag and the fix to the error condition.
   
   ### How was this patch tested?
   
   New unit test.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun closed pull request #44219: [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join
URL: https://github.com/apache/spark/pull/44219


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Re: [PR] [SPARK-45580][SQL][3.4] Handle case where a nested subquery becomes an existence join [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.
dongjoon-hyun commented on PR #44219:
URL: https://github.com/apache/spark/pull/44219#issuecomment-1844193661

   Merged to branch-3.4 for Apache Spark 3.4.3.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org