You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "angerszhu (Jira)" <ji...@apache.org> on 2021/07/23 07:30:00 UTC
[jira] [Resolved] (SPARK-29769) Spark SQL cannot handle "exists/not
exists" condition when using "JOIN"
[ https://issues.apache.org/jira/browse/SPARK-29769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
angerszhu resolved SPARK-29769.
-------------------------------
Resolution: Implemented
> Spark SQL cannot handle "exists/not exists" condition when using "JOIN"
> -----------------------------------------------------------------------
>
> Key: SPARK-29769
> URL: https://issues.apache.org/jira/browse/SPARK-29769
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.0, 3.0.0
> Reporter: angerszhu
> Priority: Major
>
> In origin master, we can'y run sql use `EXISTS/NOT EXISTS` in Join's on condition:
> {code}
> create temporary view s1 as select * from values
> (1), (3), (5), (7), (9)
> as s1(id);
> create temporary view s2 as select * from values
> (1), (3), (4), (6), (9)
> as s2(id);
> create temporary view s3 as select * from values
> (3), (4), (6), (9)
> as s3(id);
> explain extended SELECT s1.id, s2.id as id2 FROM s1
> LEFT OUTER JOIN s2 ON s1.id = s2.id
> AND EXISTS (SELECT * FROM s3 WHERE s3.id > 6)
>
> we will get
> == Parsed Logical Plan ==
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, (('s1.id = 's2.id) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- 'UnresolvedRelation `s1`
> +- 'UnresolvedRelation `s2`
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: `s3`; line 3 pos 27;
> 'Project ['s1.id, 's2.id AS id2#4]
> +- 'Join LeftOuter, ((id#0 = id#1) && exists#3 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation `s3`
> :- SubqueryAlias `s1`
> : +- Project [id#0]
> : +- SubqueryAlias `s1`
> : +- LocalRelation [id#0]
> +- SubqueryAlias `s2`
> +- Project [id#1]
> +- SubqueryAlias `s2`
> +- LocalRelation [id#1]
> Time taken: 1.455 seconds, Fetched 1 row(s)
> {code}
> Since in analyzer , it won't solve join's condition's SubQuery in *Analyzer.ResolveSubquery*, then table *s3* was unresolved.
> After pr https://github.com/apache/spark/pull/25854/files
> We will solve subqueries in join condition and it will pass analyzer level.
> In current master, if we run sql above, we will get
> {code}
> == Parsed Logical Plan ==
> 'Project ['s1.id, 's2.id AS id2#291]
> +- 'Join LeftOuter, (('s1.id = 's2.id) AND exists#290 [])
> : +- 'Project [*]
> : +- 'Filter ('s3.id > 6)
> : +- 'UnresolvedRelation [s3]
> :- 'UnresolvedRelation [s1]
> +- 'UnresolvedRelation [s2]
> == Analyzed Logical Plan ==
> id: int, id2: int
> Project [id#244, id#250 AS id2#291]
> +- Join LeftOuter, ((id#244 = id#250) AND exists#290 [])
> : +- Project [id#256]
> : +- Filter (id#256 > 6)
> : +- SubqueryAlias `s3`
> : +- Project [value#253 AS id#256]
> : +- LocalRelation [value#253]
> :- SubqueryAlias `s1`
> : +- Project [value#241 AS id#244]
> : +- LocalRelation [value#241]
> +- SubqueryAlias `s2`
> +- Project [value#247 AS id#250]
> +- LocalRelation [value#247]
> == Optimized Logical Plan ==
> Project [id#244, id#250 AS id2#291]
> +- Join LeftOuter, (exists#290 [] AND (id#244 = id#250))
> : +- Project [value#253 AS id#256]
> : +- Filter (value#253 > 6)
> : +- LocalRelation [value#253]
> :- Project [value#241 AS id#244]
> : +- LocalRelation [value#241]
> +- Project [value#247 AS id#250]
> +- LocalRelation [value#247]
> == Physical Plan ==
> *(2) Project [id#244, id#250 AS id2#291]
> +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
> : +- Project [value#253 AS id#256]
> : +- Filter (value#253 > 6)
> : +- LocalRelation [value#253]
> :- *(2) Project [value#241 AS id#244]
> : +- *(2) LocalTableScan [value#241]
> +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#670]
> +- *(1) Project [value#247 AS id#250]
> +- *(1) LocalTableScan [value#247]
> {code}
> We can see that, exists subquery was resolved but it haven't been changed to a SparkPlan. Then the PhysicalPlan can't execute.
> You can see a LocalRelation on PhysicalPlan
> {code}
> +- *(2) BroadcastHashJoin [id#244], [id#250], LeftOuter, BuildRight, exists#290 []
> : +- Project [value#253 AS id#256]
> : +- Filter (value#253 > 6)
> : +- LocalRelation [value#253]
> {code}
> Final I found that we can't use {color:red}EXISTS/NOT EXISTS{color} as *on* condition in *{color:red}LEFTE OUTER JOIN/ FULL OUTER JOIN / LEFT ANTI JOIN{color}*
> Because in Optimizer's *PushPredicateThroughJoin*, it can't cover these case.
> Make a pr for support these join type.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org