You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/12/01 07:41:11 UTC

[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

    [ https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15033187#comment-15033187 ] 

Reynold Xin commented on SPARK-12032:
-------------------------------------

[~marmbrus] do you mean the selinger algo?


> Filter can't be pushed down to correct Join because of bad order of Join
> ------------------------------------------------------------------------
>
>                 Key: SPARK-12032
>                 URL: https://issues.apache.org/jira/browse/SPARK-12032
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Davies Liu
>            Assignee: Davies Liu
>            Priority: Critical
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>    FROM store_sales, date_dim d, customer c
>    WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = d.d_date_sk
>    group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && (c_first_shipto_date_sk#106 = d_date_sk#141)))
>    Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
>     Join Inner, None
>      Project [ss_customer_sk#283]
>       Relation[] ParquetRelation[store_sales]
>      Project [d_date_sk#141,d_year#147]
>       Relation[] ParquetRelation[date_dim]
>    Project [c_customer_sk#101,c_first_shipto_date_sk#106]
>     Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>    Project [c_first_shipto_date_sk#106]
>     Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>      Project [ss_customer_sk#283]
>       Relation[store_sales]
>      Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>       Relation[customer]
>    Project [d_year#147,d_date_sk#141]
>     Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org