You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "angerszhu (Jira)" <ji...@apache.org> on 2019/11/08 01:20:00 UTC

[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not exists" condition when using "JOIN"

     [ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

angerszhu updated SPARK-29769:
------------------------------
    Description: 
 In origin master, we can'y run sql  use `EXISTS/NOT EXISTS` in Join's on condition:
{code}
create temporary view s1 as select * from values
    (1), (3), (5), (7), (9)
  as s1(id);

create temporary view s2 as select * from values
    (1), (3), (4), (6), (9)
  as s2(id);

create temporary view s3 as select * from values
    (3), (4), (6), (9)
  as s3(id);

 explain extended SELECT s1.id, s2.id as id2 FROM s1
 LEFT OUTER JOIN s2 ON s1.id = s2.id
 AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
 
we will get

== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
   :  +- 'Project [*]
   :     +- 'Filter ('s3.id > 6)
   :        +- 'UnresolvedRelation `s3`
   :- 'UnresolvedRelation `s1`
   +- 'UnresolvedRelation `s2`

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   :     +- 'Filter ('s3.id > 6)
   :        +- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   :     +- SubqueryAlias `s1`
   :        +- LocalRelation [id#0]
   +- SubqueryAlias `s2`
      +- Project [id#1]
         +- SubqueryAlias `s2`
            +- LocalRelation [id#1]

org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   :     +- 'Filter ('s3.id > 6)
   :        +- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   :     +- SubqueryAlias `s1`
   :        +- LocalRelation [id#0]
   +- SubqueryAlias `s2`
      +- Project [id#1]
         +- SubqueryAlias `s2`
            +- LocalRelation [id#1]

== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   :     +- 'Filter ('s3.id > 6)
   :        +- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   :     +- SubqueryAlias `s1`
   :        +- LocalRelation [id#0]
   +- SubqueryAlias `s2`
      +- Project [id#1]
         +- SubqueryAlias `s2`
            +- LocalRelation [id#1]

== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
   :  +- 'Project [*]
   :     +- 'Filter ('s3.id > 6)
   :        +- 'UnresolvedRelation `s3`
   :- SubqueryAlias `s1`
   :  +- Project [id#0]
   :     +- SubqueryAlias `s1`
   :        +- LocalRelation [id#0]
   +- SubqueryAlias `s2`
      +- Project [id#1]
         +- SubqueryAlias `s2`
            +- LocalRelation [id#1]
Time taken: 1.455 seconds, Fetched 1 row(s)
{code}

Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved. 

After pr https://github.com/apache/spark/pull/25854/files

We will solve subqueries in join condition and it will pass analyzer level.

In current master, if we run sql above, we will get
{code}
 == Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#291]
+- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
   :  +- 'Project [*]
   :     +- 'Filter ('s3.id > 6)
   :        +- 'UnresolvedRelation [s3]
   :- 'UnresolvedRelation [s1]
   +- 'UnresolvedRelation [s2]

== Analyzed Logical Plan ==
id: int, id2: int
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
   :  +- Project [id#256]
   :     +- Filter (id#256 > 6)
   :        +- SubqueryAlias `s3`
   :           +- Project [value#253 AS id#256]
   :              +- LocalRelation [value#253]
   :- SubqueryAlias `s1`
   :  +- Project [value#241 AS id#244]
   :     +- LocalRelation [value#241]
   +- SubqueryAlias `s2`
      +- Project [value#247 AS id#250]
         +- LocalRelation [value#247]

== Optimized Logical Plan ==
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
   :  +- Project [value#253 AS id#256]
   :     +- Filter (value#253 > 6)
   :        +- LocalRelation [value#253]
   :- Project [value#241 AS id#244]
   :  +- LocalRelation [value#241]
   +- Project [value#247 AS id#250]
      +- LocalRelation [value#247]

== Physical Plan ==
*(2) Project [id#244, id#250 AS id2#291]
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
   :  +- Project [value#253 AS id#256]
   :     +- Filter (value#253 > 6)
   :        +- LocalRelation [value#253]
   :- *(2) Project [value#241 AS id#244]
   :  +- *(2) LocalTableScan [value#241]
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#670]
      +- *(1) Project [value#247 AS id#250]
         +- *(1) LocalTableScan [value#247]
{code}

We can see that,  exists subquery was resolved but it haven't been changed to a SparkPlan. Then the PhysicalPlan can't execute.
You can see a LocalRelation on PhysicalPlan
{code}
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
   :  +- Project [value#253 AS id#256]
   :     +- Filter (value#253 > 6)
   :        +- LocalRelation [value#253]
{code}


Final I found that we can't use {color:red}EXISTS/NOT EXISTS{color} as *on* condition in *{color:red}LEFTE OUTER JOIN/ FULL OUTER JOIN / LEFT ANTI JOIN{color}*

> Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
> -----------------------------------------------------------------------
>
>                 Key: SPARK-29769
>                 URL: https://issues.apache.org/jira/browse/SPARK-29769
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: angerszhu
>            Priority: Major
>
>  In origin master, we can'y run sql  use `EXISTS/NOT EXISTS` in Join's on condition:
> {code}
> create temporary view s1 as select * from values
>     (1), (3), (5), (7), (9)
>   as s1(id);
> create temporary view s2 as select * from values
>     (1), (3), (4), (6), (9)
>   as s2(id);
> create temporary view s3 as select * from values
>     (3), (4), (6), (9)
>   as s3(id);
>  explain extended SELECT s1.id, s2.id as id2 FROM s1
>  LEFT OUTER JOIN s2 ON s1.id = s2.id
>  AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
>  
> we will get
> == Parsed Logical Plan ==
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
>    :  +- 'Project [*]
>    :     +- 'Filter ('s3.id > 6)
>    :        +- 'UnresolvedRelation `s3`
>    :- 'UnresolvedRelation `s1`
>    +- 'UnresolvedRelation `s2`
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
>    :  +- 'Project [*]
>    :     +- 'Filter ('s3.id > 6)
>    :        +- 'UnresolvedRelation `s3`
>    :- SubqueryAlias `s1`
>    :  +- Project [id#0]
>    :     +- SubqueryAlias `s1`
>    :        +- LocalRelation [id#0]
>    +- SubqueryAlias `s2`
>       +- Project [id#1]
>          +- SubqueryAlias `s2`
>             +- LocalRelation [id#1]
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
>    :  +- 'Project [*]
>    :     +- 'Filter ('s3.id > 6)
>    :        +- 'UnresolvedRelation `s3`
>    :- SubqueryAlias `s1`
>    :  +- Project [id#0]
>    :     +- SubqueryAlias `s1`
>    :        +- LocalRelation [id#0]
>    +- SubqueryAlias `s2`
>       +- Project [id#1]
>          +- SubqueryAlias `s2`
>             +- LocalRelation [id#1]
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
>    :  +- 'Project [*]
>    :     +- 'Filter ('s3.id > 6)
>    :        +- 'UnresolvedRelation `s3`
>    :- SubqueryAlias `s1`
>    :  +- Project [id#0]
>    :     +- SubqueryAlias `s1`
>    :        +- LocalRelation [id#0]
>    +- SubqueryAlias `s2`
>       +- Project [id#1]
>          +- SubqueryAlias `s2`
>             +- LocalRelation [id#1]
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
>    :  +- 'Project [*]
>    :     +- 'Filter ('s3.id > 6)
>    :        +- 'UnresolvedRelation `s3`
>    :- SubqueryAlias `s1`
>    :  +- Project [id#0]
>    :     +- SubqueryAlias `s1`
>    :        +- LocalRelation [id#0]
>    +- SubqueryAlias `s2`
>       +- Project [id#1]
>          +- SubqueryAlias `s2`
>             +- LocalRelation [id#1]
> Time taken: 1.455 seconds, Fetched 1 row(s)
> {code}
> Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved. 
> After pr https://github.com/apache/spark/pull/25854/files
> We will solve subqueries in join condition and it will pass analyzer level.
> In current master, if we run sql above, we will get
> {code}
>  == Parsed Logical Plan ==
> 'Project ['s1.id, 's2.id AS id2#291]
> +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
>    :  +- 'Project [*]
>    :     +- 'Filter ('s3.id > 6)
>    :        +- 'UnresolvedRelation [s3]
>    :- 'UnresolvedRelation [s1]
>    +- 'UnresolvedRelation [s2]
> == Analyzed Logical Plan ==
> id: int, id2: int
> Project [id#244, id#250 AS id2#291]
> +- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
>    :  +- Project [id#256]
>    :     +- Filter (id#256 > 6)
>    :        +- SubqueryAlias `s3`
>    :           +- Project [value#253 AS id#256]
>    :              +- LocalRelation [value#253]
>    :- SubqueryAlias `s1`
>    :  +- Project [value#241 AS id#244]
>    :     +- LocalRelation [value#241]
>    +- SubqueryAlias `s2`
>       +- Project [value#247 AS id#250]
>          +- LocalRelation [value#247]
> == Optimized Logical Plan ==
> Project [id#244, id#250 AS id2#291]
> +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
>    :  +- Project [value#253 AS id#256]
>    :     +- Filter (value#253 > 6)
>    :        +- LocalRelation [value#253]
>    :- Project [value#241 AS id#244]
>    :  +- LocalRelation [value#241]
>    +- Project [value#247 AS id#250]
>       +- LocalRelation [value#247]
> == Physical Plan ==
> *(2) Project [id#244, id#250 AS id2#291]
> +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
>    :  +- Project [value#253 AS id#256]
>    :     +- Filter (value#253 > 6)
>    :        +- LocalRelation [value#253]
>    :- *(2) Project [value#241 AS id#244]
>    :  +- *(2) LocalTableScan [value#241]
>    +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#670]
>       +- *(1) Project [value#247 AS id#250]
>          +- *(1) LocalTableScan [value#247]
> {code}
> We can see that,  exists subquery was resolved but it haven't been changed to a SparkPlan. Then the PhysicalPlan can't execute.
> You can see a LocalRelation on PhysicalPlan
> {code}
> +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
>    :  +- Project [value#253 AS id#256]
>    :     +- Filter (value#253 > 6)
>    :        +- LocalRelation [value#253]
> {code}
> Final I found that we can't use {color:red}EXISTS/NOT EXISTS{color} as *on* condition in *{color:red}LEFTE OUTER JOIN/ FULL OUTER JOIN / LEFT ANTI JOIN{color}*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org