You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yan (JIRA)" <ji...@apache.org> on 2016/09/02 08:03:21 UTC

[jira] [Created] (SPARK-17375) Star Join Optimization

Yan created SPARK-17375:
---------------------------

             Summary: Star Join Optimization
                 Key: SPARK-17375
                 URL: https://issues.apache.org/jira/browse/SPARK-17375
             Project: Spark
          Issue Type: Improvement
          Components: SQL
            Reporter: Yan


The star schema is the simplest style of data mart schema and is the approach 
often seen in BI/Decision Support systems. Star Join is a popular SQL query pattern that joins one or (a few) fact tables with a few dimension tables in star schemas. Star Join Query Optimizations aim to optimize the performance and use of resource for the star joins.

Currently the existing Spark SQL optimization works on broadcasting the usually small (after filtering and projection) dimension tables to avoid costly shuffling of fact table and the "reduce" operations based on the join keys.

This improvement proposal tries to further improve the broadcast star joins in the two areas:
1) avoid materialization of the intermediate rows that otherwise could eventually not make to the final result row set after further joined with other dimensions that are more restricting;
2) avoid the performance variations among different join orders. This could also have been largely achieved by cost analysis and heuristics and selecting a reasonably optimal join order. But we are here trying to achieve similar improvement without relying on such info.

A preliminary test against a small TPCDS 1GB data set indicates between 5%-40% improvement (with codegen disabled on both tests) vs. the multiple broadcast joins on one Query (Q27) that inner joins 4 dimension table with one fact table. The large variation (5%-40%) is due to  the different join ordering of the 4 broadcast joins. Tests using larger data sets and other TPCDS queries are yet to be performed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org