You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2015/04/07 20:17:12 UTC

[jira] [Resolved] (CALCITE-366) Support Aggregate push down in bushy joins

     [ https://issues.apache.org/jira/browse/CALCITE-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Julian Hyde resolved CALCITE-366.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 1.2.0-incubating

Fixed in http://git-wip-us.apache.org/repos/asf/incubator-calcite/commit/e48c7627.

> Support Aggregate push down in bushy joins
> ------------------------------------------
>
>                 Key: CALCITE-366
>                 URL: https://issues.apache.org/jira/browse/CALCITE-366
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Mostafa Mokhtar
>            Assignee: Jesus Camacho Rodriguez
>              Labels: hive
>             Fix For: 1.2.0-incubating
>
>
> Currently OPTIQ doesn't push down aggregates for bushy joins and this can be a performance limitation in some queries.
> Ideally before we push down the aggregate we should take into the account the NDV and reduction of the join which we are trying to push the aggregate underneath.
> On Hive to measure potential gains I wrote a query which has the following business question based on the TPC-DS schema:
> Identify customers from specific gender with good credit rating that spent more than X amount of money during a specific year.
> This query has 3 main components
> 1) Hub containing store_sales & date_dim
> 2) Hub containing customer and customer_demographics
> 3) Aggregate
> And for that query I have 2 rewrites :
> 1) Rewrite that generates a bushy join plan
> 2) Rewrite that generates a bushy join plan and pushes the aggregate down
> Results show that aggregate push down in a bushy plan cuts down the query runtime by half.
> The aggregate push down reduces the intermediate data in the query from 384GB to 22GB.
> Base query finishes in 4,781 seconds with 2,253 GB of intermediate data	
> Bushy query finished in 1,374	seconds with 384GB of  intermediate data
> Bushy query + aggregate push down finished in 605 with only 22GB of  intermediate data.
> Base query
> {code}
> "select 
>     c_customer_sk, sum(ss_net_paid) as net_paid
> from
>     customer,
>     customer_demographics,
>     store_sales,
>     date_dim
> where
>     c_current_cdemo_sk = cd_demo_sk
>         and cd_credit_rating = 'Good'
>         and cd_gender = 'F'
>         and ss_customer_sk = c_customer_sk
>         and ss_customer_sk is not null
>         and d_year = 2001
>         and d_date_sk = ss_sold_date_sk
> group by c_customer_sk
> having net_paid > 0
> order by net_paid desc
> limit 100;"
> {code}
> Bushy query
> {code}
> with ss as 
> ( 
> select 
>     ss_net_paid, ss_customer_sk
> from
>     store_sales,date_dim
> where
>     d_year = 2001
>         and d_date_sk = ss_sold_date_sk
>         and ss_customer_sk is not null
> ), cc as 
> (
> select 
>     c_customer_sk
> from
>     customer,customer_demographics
> where
>     c_current_cdemo_sk = cd_demo_sk
>         and cd_credit_rating = 'Good'
>         and cd_gender = 'F'
> )
> select 
>     c_customer_sk, sum(ss_net_paid) as net_paid
> from 
>     cc,ss
> where
>     ss_customer_sk = c_customer_sk
> group by c_customer_sk
> having net_paid > 0
> order by net_paid desc
> limit 100;
> {code}
> Bushy query + aggregate push down
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)