You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/01/12 03:13:39 UTC

[jira] [Closed] (SPARK-12602) Join Reordering: Pushing Inner Join Through Outer Join

     [ https://issues.apache.org/jira/browse/SPARK-12602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin closed SPARK-12602.
-------------------------------
    Resolution: Later

Closing as later. Will revisit when we have more review bandwidth.


> Join Reordering: Pushing Inner Join Through Outer Join
> ------------------------------------------------------
>
>                 Key: SPARK-12602
>                 URL: https://issues.apache.org/jira/browse/SPARK-12602
>             Project: Spark
>          Issue Type: Improvement
>          Components: Optimizer, SQL
>    Affects Versions: 1.6.0
>            Reporter: Xiao Li
>            Priority: Critical
>
> If applicable, we can push Inner Join through Outer Join. The basic idea is built on the associativity property of outer and inner joins:
> {code}
> R1 inner (R2 left R3 on p23) on p12 = (R1 inner R2 on p12) left R3 on p23
> R1 inner (R2 right R3 on p23) on p13 = R2 right (R1 inner R3 on p13) on p23 = (R1 inner R3 on p13) left R2 on p23
> (R1 left R2 on p12) inner R3 on p13 = (R1 inner R3 on p13) left R2 on p12
> (R1 right R2 on p12) inner R3 on p23 = R1 right (R2 inner R3 on p23) on p12 = (R2 inner R3 on p23) left R1 on p12
> {code}
> The reordering can reduce the number of processed rows since the Inner Join always can generate less (or equivalent) rows than Left/Right Outer Join. This change can improve the query performance in most cases.
> When cost-based optimization is available, we can switch the order of tables in each join type based on their costs. The order of joined tables in the inner join does not affect the results and the right outer join can be changed to the left outer join. This part is out of scope here.
> For example, given the following eligible query:
> {code}df.join(df2, $"a.int" === $"b.int", "right").join(df3, $"c.int" === $"b.int", "inner"){code}
> Before the fix, the logical plan is like
> {code}
> Join Inner, Some((int#15 = int#9))
> :- Join RightOuter, Some((int#3 = int#9))
> :  :- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> :  +- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> {code}
> After the fix, the logical plan is like
> {code}
> Join LeftOuter, Some((int#3 = int#9))
> :- Join Inner, Some((int#15 = int#9))
> :  :- LocalRelation [int#9,int2#10,str#11], [[1,3,1],[5,6,5]]
> :  +- LocalRelation [int#15,int2#16,str#17], [[1,9,8],[5,0,4]]
> +- LocalRelation [int#3,int2#4,str#5], [[1,2,1],[3,4,3]]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org