You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sameer Agarwal (JIRA)" <ji...@apache.org> on 2018/01/08 20:42:01 UTC
[jira] [Resolved] (SPARK-17626) TPC-DS performance improvements using star-schema heuristics

     [ https://issues.apache.org/jira/browse/SPARK-17626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sameer Agarwal resolved SPARK-17626.
------------------------------------
          Resolution: Done
            Assignee: Ioana Delaney
       Fix Version/s: 2.2.0
    Target Version/s: 2.2.0  (was: 2.3.0)

> TPC-DS performance improvements using star-schema heuristics
> ------------------------------------------------------------
>
>                 Key: SPARK-17626
>                 URL: https://issues.apache.org/jira/browse/SPARK-17626
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Ioana Delaney
>            Assignee: Ioana Delaney
>            Priority: Critical
>             Fix For: 2.2.0
>
>         Attachments: StarSchemaJoinReordering.pptx
>
>
> *TPC-DS performance improvements using star-schema heuristics*
> \\
> \\
> TPC-DS consists of multiple snowflake schema, which are multiple star schema with dimensions linking to dimensions. A star schema consists of a fact table referencing a number of dimension tables. Fact table holds the main data about a business. Dimension table, a usually smaller table, describes data reflecting the dimension/attribute of a business.
> \\
> \\
> As part of the benchmark performance investigation, we observed a pattern of sub-optimal execution plans of large fact tables joins. Manual rewrite of some of the queries into selective fact-dimensions joins resulted in significant performance improvement. This prompted us to develop a simple join reordering algorithm based on star schema detection. The performance testing using *1TB TPC-DS workload* shows an overall improvement of *19%*. 
> \\
> \\
> *Summary of the results:*
> {code}
> Passed                 99
> Failed                  0
> Total q time (s)   14,962
> Max time            1,467
> Min time                3
> Mean time             145
> Geomean                44
> {code}
> *Compared to baseline* (Negative = improvement; Positive = Degradation):
> {code}
> End to end improved (%)              -19%	
> Mean time improved (%)               -19%
> Geomean improved (%)                 -24%
> End to end improved (seconds)      -3,603
> Number of queries improved (>10%)      45
> Number of queries degraded (>10%)       6
> Number of queries unchanged            48
> Top 10 queries improved (%)          -20%
> {code}
> Cluster: 20-node cluster with each node having:
> * 10 2TB hard disks in a JBOD configuration, 2 Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz processors, 128 GB RAM, 10Gigabit Ethernet.
> * Total memory for the cluster: 2.5TB
> * Total storage: 400TB
> * Total CPU cores: 480
> Hadoop stack: IBM Open Platform with Apache Hadoop v4.2. Apache Spark 2.0 GA
> Database info:
> * Schema: TPCDS 
> * Scale factor: 1TB total space
> * Storage format: Parquet with Snappy compression
> Our investigation and results are included in the attached document.
> There are two parts to this improvement:
> # Join reordering using star schema detection
> # New selectivity hint to specify the selectivity of the predicates over base tables. Selectivity hint is optional and it was not used in the above TPC-DS tests. 
> \\



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org