You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2016/10/07 18:03:20 UTC

[jira] [Closed] (SPARK-17375) Star Join Optimization

     [ https://issues.apache.org/jira/browse/SPARK-17375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin closed SPARK-17375.
-------------------------------
    Resolution: Duplicate

Marking this as duplicate of SPARK-17626

> Star Join Optimization
> ----------------------
>
>                 Key: SPARK-17375
>                 URL: https://issues.apache.org/jira/browse/SPARK-17375
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Yan
>
> The star schema is the simplest style of data mart schema and is the approach 
> often seen in BI/Decision Support systems. Star Join is a popular SQL query pattern that joins one or (a few) fact tables with a few dimension tables in star schemas. Star Join Query Optimizations aim to optimize the performance and use of resource for the star joins.
> Currently the existing Spark SQL optimization works on broadcasting the usually small (after filtering and projection) dimension tables to avoid costly shuffling of fact table and the "reduce" operations based on the join keys.
> This improvement proposal tries to further improve the broadcast star joins in the two areas:
> 1) avoid materialization of the intermediate rows that otherwise could eventually not make to the final result row set after further joined with other dimensions that are more restricting;
> 2) avoid the performance variations among different join orders. This could also have been largely achieved by cost analysis and heuristics and selecting a reasonably optimal join order. But we are here trying to achieve similar improvement without relying on such info.
> A preliminary test against a small TPCDS 1GB data set indicates between 5%-40% improvement (with codegen disabled on both tests) vs. the multiple broadcast joins on one Query (Q27) that inner joins 4 dimension table with one fact table. The large variation (5%-40%) is due to  the different join ordering of the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are yet to be performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org