You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Davies Liu (JIRA)" <ji...@apache.org> on 2016/04/22 22:05:12 UTC
[jira] [Resolved] (SPARK-13347) Reuse the shuffle for duplicated exchange

     [ https://issues.apache.org/jira/browse/SPARK-13347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Davies Liu resolved SPARK-13347.
--------------------------------
       Resolution: Duplicate
         Assignee: Davies Liu
    Fix Version/s: 2.0.0

> Reuse the shuffle for duplicated exchange
> -----------------------------------------
>
>                 Key: SPARK-13347
>                 URL: https://issues.apache.org/jira/browse/SPARK-13347
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Davies Liu
>            Assignee: Davies Liu
>             Fix For: 2.0.0
>
>
> In TPCDS query 47, the same exchange is used three times, we should re-use the ShuffleRowRDD to skip the duplicated stages.
> {code}
>  with v1 as(
>  select i_category, i_brand,
>         s_store_name, s_company_name,
>         d_year, d_moy,
>         sum(ss_sales_price) sum_sales,
>         avg(sum(ss_sales_price)) over
>           (partition by i_category, i_brand,
>                      s_store_name, s_company_name, d_year)
>           avg_monthly_sales,
>         rank() over
>           (partition by i_category, i_brand,
>                      s_store_name, s_company_name
>            order by d_year, d_moy) rn
>  from item, store_sales, date_dim, store
>  where ss_item_sk = i_item_sk and
>        ss_sold_date_sk = d_date_sk and
>        ss_store_sk = s_store_sk and
>        (
>          d_year = 1999 or
>          ( d_year = 1999-1 and d_moy =12) or
>          ( d_year = 1999+1 and d_moy =1)
>        )
>  group by i_category, i_brand,
>           s_store_name, s_company_name,
>           d_year, d_moy),
>  v2 as(
>  select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, v1.d_year,
>                      v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales psum,
>                      v1_lead.sum_sales nsum
>  from v1, v1 v1_lag, v1 v1_lead
>  where v1.i_category = v1_lag.i_category and
>        v1.i_category = v1_lead.i_category and
>        v1.i_brand = v1_lag.i_brand and
>        v1.i_brand = v1_lead.i_brand and
>        v1.s_store_name = v1_lag.s_store_name and
>        v1.s_store_name = v1_lead.s_store_name and
>        v1.s_company_name = v1_lag.s_company_name and
>        v1.s_company_name = v1_lead.s_company_name and
>        v1.rn = v1_lag.rn + 1 and
>        v1.rn = v1_lead.rn - 1)
>  select * from v2
>  where  d_year = 1999 and
>         avg_monthly_sales > 0 and
>         case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1
>  order by sum_sales - avg_monthly_sales, 3
>  limit 100
> {code}
> Since the SparkPlan is just a tree (not DAG), we can only do this in SparkPlan.execute() or final rule.
> And we should also have a way to compare two SparkPlan whether they have same result or not (they may have different exprId, we should compare them after bind).
> An quick experiment showed that we could have 2X improvement on this query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org