You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2016/04/22 22:05:12 UTC
[jira] [Resolved] (SPARK-13347) Reuse the shuffle for duplicated
exchange
[ https://issues.apache.org/jira/browse/SPARK-13347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Davies Liu resolved SPARK-13347.
--------------------------------
Resolution: Duplicate
Assignee: Davies Liu
Fix Version/s: 2.0.0
> Reuse the shuffle for duplicated exchange
> -----------------------------------------
>
> Key: SPARK-13347
> URL: https://issues.apache.org/jira/browse/SPARK-13347
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Davies Liu
> Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> In TPCDS query 47, the same exchange is used three times, we should re-use the ShuffleRowRDD to skip the duplicated stages.
> {code}
> with v1 as(
> select i_category, i_brand,
> s_store_name, s_company_name,
> d_year, d_moy,
> sum(ss_sales_price) sum_sales,
> avg(sum(ss_sales_price)) over
> (partition by i_category, i_brand,
> s_store_name, s_company_name, d_year)
> avg_monthly_sales,
> rank() over
> (partition by i_category, i_brand,
> s_store_name, s_company_name
> order by d_year, d_moy) rn
> from item, store_sales, date_dim, store
> where ss_item_sk = i_item_sk and
> ss_sold_date_sk = d_date_sk and
> ss_store_sk = s_store_sk and
> (
> d_year = 1999 or
> ( d_year = 1999-1 and d_moy =12) or
> ( d_year = 1999+1 and d_moy =1)
> )
> group by i_category, i_brand,
> s_store_name, s_company_name,
> d_year, d_moy),
> v2 as(
> select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, v1.d_year,
> v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales psum,
> v1_lead.sum_sales nsum
> from v1, v1 v1_lag, v1 v1_lead
> where v1.i_category = v1_lag.i_category and
> v1.i_category = v1_lead.i_category and
> v1.i_brand = v1_lag.i_brand and
> v1.i_brand = v1_lead.i_brand and
> v1.s_store_name = v1_lag.s_store_name and
> v1.s_store_name = v1_lead.s_store_name and
> v1.s_company_name = v1_lag.s_company_name and
> v1.s_company_name = v1_lead.s_company_name and
> v1.rn = v1_lag.rn + 1 and
> v1.rn = v1_lead.rn - 1)
> select * from v2
> where d_year = 1999 and
> avg_monthly_sales > 0 and
> case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1
> order by sum_sales - avg_monthly_sales, 3
> limit 100
> {code}
> Since the SparkPlan is just a tree (not DAG), we can only do this in SparkPlan.execute() or final rule.
> And we should also have a way to compare two SparkPlan whether they have same result or not (they may have different exprId, we should compare them after bind).
> An quick experiment showed that we could have 2X improvement on this query.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org