You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "angerszhu (Jira)" <ji...@apache.org> on 2019/11/08 01:20:00 UTC
[jira] [Updated] (SPARK-29769) Spark SQL cannot handle "exists/not
exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
angerszhu updated SPARK-29769:
------------------------------
Description:
In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition:
{code}
create temporary view s1 as select * from values
(1), (3), (5), (7), (9)
as s1(id);
create temporary view s2 as select * from values
(1), (3), (4), (6), (9)
as s2(id);
create temporary view s3 as select * from values
(3), (4), (6), (9)
as s3(id);
explain extended SELECT s1.id, s2.id as id2 FROM s1
LEFT OUTER JOIN s2 ON s1.id = s2.id
AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
we will get
== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
: +- 'Project [*]
: +- 'Filter ('s3.id > 6)
: +- 'UnresolvedRelation `s3`
:- 'UnresolvedRelation `s1`
+- 'UnresolvedRelation `s2`
== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
: +- 'Project [*]
: +- 'Filter ('s3.id > 6)
: +- 'UnresolvedRelation `s3`
:- SubqueryAlias `s1`
: +- Project [id#0]
: +- SubqueryAlias `s1`
: +- LocalRelation [id#0]
+- SubqueryAlias `s2`
+- Project [id#1]
+- SubqueryAlias `s2`
+- LocalRelation [id#1]
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
: +- 'Project [*]
: +- 'Filter ('s3.id > 6)
: +- 'UnresolvedRelation `s3`
:- SubqueryAlias `s1`
: +- Project [id#0]
: +- SubqueryAlias `s1`
: +- LocalRelation [id#0]
+- SubqueryAlias `s2`
+- Project [id#1]
+- SubqueryAlias `s2`
+- LocalRelation [id#1]
== Optimized Logical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
: +- 'Project [*]
: +- 'Filter ('s3.id > 6)
: +- 'UnresolvedRelation `s3`
:- SubqueryAlias `s1`
: +- Project [id#0]
: +- SubqueryAlias `s1`
: +- LocalRelation [id#0]
+- SubqueryAlias `s2`
+- Project [id#1]
+- SubqueryAlias `s2`
+- LocalRelation [id#1]
== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
'Project ['s1.id, 's2.id AS id2#4]
+- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
: +- 'Project [*]
: +- 'Filter ('s3.id > 6)
: +- 'UnresolvedRelation `s3`
:- SubqueryAlias `s1`
: +- Project [id#0]
: +- SubqueryAlias `s1`
: +- LocalRelation [id#0]
+- SubqueryAlias `s2`
+- Project [id#1]
+- SubqueryAlias `s2`
+- LocalRelation [id#1]
Time taken: 1.455 seconds, Fetched 1 row(s)
{code}
Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved.
After pr https://github.com/apache/spark/pull/25854/files
We will solve subqueries in join condition and it will pass analyzer level.
In current master, if we run sql above, we will get
{code}
== Parsed Logical Plan ==
'Project ['s1.id, 's2.id AS id2#291]
+- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
: +- 'Project [*]
: +- 'Filter ('s3.id > 6)
: +- 'UnresolvedRelation [s3]
:- 'UnresolvedRelation [s1]
+- 'UnresolvedRelation [s2]
== Analyzed Logical Plan ==
id: int, id2: int
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
: +- Project [id#256]
: +- Filter (id#256 > 6)
: +- SubqueryAlias `s3`
: +- Project [value#253 AS id#256]
: +- LocalRelation [value#253]
:- SubqueryAlias `s1`
: +- Project [value#241 AS id#244]
: +- LocalRelation [value#241]
+- SubqueryAlias `s2`
+- Project [value#247 AS id#250]
+- LocalRelation [value#247]
== Optimized Logical Plan ==
Project [id#244, id#250 AS id2#291]
+- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
: +- Project [value#253 AS id#256]
: +- Filter (value#253 > 6)
: +- LocalRelation [value#253]
:- Project [value#241 AS id#244]
: +- LocalRelation [value#241]
+- Project [value#247 AS id#250]
+- LocalRelation [value#247]
== Physical Plan ==
*(2) Project [id#244, id#250 AS id2#291]
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
: +- Project [value#253 AS id#256]
: +- Filter (value#253 > 6)
: +- LocalRelation [value#253]
:- *(2) Project [value#241 AS id#244]
: +- *(2) LocalTableScan [value#241]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#670]
+- *(1) Project [value#247 AS id#250]
+- *(1) LocalTableScan [value#247]
{code}
We can see that, exists subquery was resolved but it haven't been changed to a SparkPlan. Then the PhysicalPlan can't execute.
You can see a LocalRelation on PhysicalPlan
{code}
+- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
: +- Project [value#253 AS id#256]
: +- Filter (value#253 > 6)
: +- LocalRelation [value#253]
{code}
Final I found that we can't use {color:red}EXISTS/NOT EXISTS{color} as *on* condition in *{color:red}LEFTE OUTER JOIN/ FULL OUTER JOIN / LEFT ANTI JOIN{color}*
> Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
> -----------------------------------------------------------------------
>
> Key: SPARK-29769
> URL: https://issues.apache.org/jira/browse/SPARK-29769
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.0, 3.0.0
> Reporter: angerszhu
> Priority: Major
>
> In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition:
> {code}
> create temporary view s1 as select * from values
> (1), (3), (5), (7), (9)
> as s1(id);
> create temporary view s2 as select * from values
> (1), (3), (4), (6), (9)
> as s2(id);
> create temporary view s3 as select * from values
> (3), (4), (6), (9)
> as s3(id);
> explain extended SELECT s1.id, s2.id as id2 FROM s1
> LEFT OUTER JOIN s2 ON s1.id = s2.id
> AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
>
> we will get
> == Parsed Logical Plan ==
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- 'UnresolvedRelation `s1`
> +- 'UnresolvedRelation `s2`
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> Time taken: 1.455 seconds, Fetched 1 row(s)
> {code}
> Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved.
> After pr https://github.com/apache/spark/pull/25854/files
> We will solve subqueries in join condition and it will pass analyzer level.
> In current master, if we run sql above, we will get
> {code}
> == Parsed Logical Plan ==
> 'Project ['s1.id, 's2.id AS id2#291]
> +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation [s3]
> :- 'UnresolvedRelation [s1]
> +- 'UnresolvedRelation [s2]
> == Analyzed Logical Plan ==
> id: int, id2: int
> Project [id#244, id#250 AS id2#291]
> +- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
> : +- Project [id#256]
> : +- Filter (id#256 > 6)
> : +- SubqueryAlias `s3`
> : +- Project [value#253 AS id#256]
> : +- LocalRelation [value#253]
> :- SubqueryAlias `s1`
> : +- Project [value#241 AS id#244]
> : +- LocalRelation [value#241]
> +- SubqueryAlias `s2`
> +- Project [value#247 AS id#250]
> +- LocalRelation [value#247]
> == Optimized Logical Plan ==
> Project [id#244, id#250 AS id2#291]
> +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
> : +- Project [value#253 AS id#256]
> : +- Filter (value#253 > 6)
> : +- LocalRelation [value#253]
> :- Project [value#241 AS id#244]
> : +- LocalRelation [value#241]
> +- Project [value#247 AS id#250]
> +- LocalRelation [value#247]
> == Physical Plan ==
> *(2) Project [id#244, id#250 AS id2#291]
> +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
> : +- Project [value#253 AS id#256]
> : +- Filter (value#253 > 6)
> : +- LocalRelation [value#253]
> :- *(2) Project [value#241 AS id#244]
> : +- *(2) LocalTableScan [value#241]
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#670]
> +- *(1) Project [value#247 AS id#250]
> +- *(1) LocalTableScan [value#247]
> {code}
> We can see that, exists subquery was resolved but it haven't been changed to a SparkPlan. Then the PhysicalPlan can't execute.
> You can see a LocalRelation on PhysicalPlan
> {code}
> +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
> : +- Project [value#253 AS id#256]
> : +- Filter (value#253 > 6)
> : +- LocalRelation [value#253]
> {code}
> Final I found that we can't use {color:red}EXISTS/NOT EXISTS{color} as *on* condition in *{color:red}LEFTE OUTER JOIN/ FULL OUTER JOIN / LEFT ANTI JOIN{color}*
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org