You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nattavut Sutyanyong (JIRA)" <ji...@apache.org> on 2017/03/22 13:28:41 UTC

[jira] [Commented] (SPARK-19712) EXISTS and Left Semi join do not produce the same plan

    [ https://issues.apache.org/jira/browse/SPARK-19712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936303#comment-15936303 ] 

Nattavut Sutyanyong commented on SPARK-19712:
---------------------------------------------

Another scenario of a missed opportunity to convert left outer join to inner join:

Query #1 - Exists subquery correlated to the right table of a left outer join
{code}
sql("select * from (select t1a, t2b from t1 left join t2 on t1a = t2a) tx where exists (select 1 from t3 where tx.t2b = t3.t3b)").explain(true)

== Optimized Logical Plan ==
Project [t1a#286, t2b#290]
+- Join LeftSemi, (t2b#290 = t3b#293)
   :- Join LeftOuter, (t1a#286 = t2a#289)
   :  :- Project [t1a#286]
   :  :  +- Relation[t1a#286,t1b#287,t1c#288] parquet
   :  +- Project [t2a#289, t2b#290]
   :     +- Relation[t2a#289,t2b#290,t2c#291] parquet
   +- Project [1 AS 1#298, t3b#293]
      +- Relation[t3a#292,t3b#293,t3c#294] parquet
{code}

Query #2 - A semantically equivalent query using left semi join
{code}
sql("select * from (select t1a, t2b from t1 left join t2 on t1a = t2a) tx left semi join t3 on tx.t2b = t3.t3b").explain(true)

== Optimized Logical Plan ==
Join LeftSemi, (t2b#248 = t3b#251)
:- Project [t1a#244, t2b#248]
:  +- Join Inner, (t1a#244 = t2a#247)
:     :- Project [t1a#244]
:     :  +- Filter isnotnull(t1a#244)
:     :     +- Relation[t1a#244,t1b#245,t1c#246] parquet
:     +- Project [t2a#247, t2b#248]
:        +- Filter (isnotnull(t2b#248) && isnotnull(t2a#247))
:           +- Relation[t2a#247,t2b#248,t2c#249] parquet
+- Project [t3b#251]
   +- Relation[t3a#250,t3b#251,t3c#252] parquet

{code}

In Query #2, the left outer join is rewritten to inner join, which can employ more join choices, and could trigger other optimizations.

> EXISTS and Left Semi join do not produce the same plan
> ------------------------------------------------------
>
>                 Key: SPARK-19712
>                 URL: https://issues.apache.org/jira/browse/SPARK-19712
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Nattavut Sutyanyong
>
> This problem was found during the development of SPARK-18874.
> The EXISTS form in the following query:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a where exists (select 1 from t3 where t1.t1b=t3.t3b)")}}
> gives the optimized plan below:
> {code}
> == Optimized Logical Plan ==
> Join Inner, (t1a#7 = t2a#25)
> :- Join LeftSemi, (t1b#8 = t3b#58)
> :  :- Filter isnotnull(t1a#7)
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Project [1 AS 1#271, t3b#58]
> :     +- Relation[t3a#57,t3b#58,t3c#59] parquet
> +- Filter isnotnull(t2a#25)
>    +- Relation[t2a#25,t2b#26,t2c#27] parquet
> {code}
> whereas a semantically equivalent Left Semi join query below:
> {{sql("select * from t1 inner join t2 on t1.t1a=t2.t2a left semi join t3 on t1.t1b=t3.t3b")}}
> gives the following optimized plan:
> {code}
> == Optimized Logical Plan ==
> Join LeftSemi, (t1b#8 = t3b#58)
> :- Join Inner, (t1a#7 = t2a#25)
> :  :- Filter (isnotnull(t1b#8) && isnotnull(t1a#7))
> :  :  +- Relation[t1a#7,t1b#8,t1c#9] parquet
> :  +- Filter isnotnull(t2a#25)
> :     +- Relation[t2a#25,t2b#26,t2c#27] parquet
> +- Project [t3b#58]
>    +- Relation[t3a#57,t3b#58,t3c#59] parquet
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org