You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Julian Hyde (JIRA)" <ji...@apache.org> on 2015/04/07 20:17:12 UTC
[jira] [Resolved] (CALCITE-366) Support Aggregate push down in
bushy joins
[ https://issues.apache.org/jira/browse/CALCITE-366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julian Hyde resolved CALCITE-366.
---------------------------------
Resolution: Fixed
Fix Version/s: 1.2.0-incubating
Fixed in http://git-wip-us.apache.org/repos/asf/incubator-calcite/commit/e48c7627.
> Support Aggregate push down in bushy joins
> ------------------------------------------
>
> Key: CALCITE-366
> URL: https://issues.apache.org/jira/browse/CALCITE-366
> Project: Calcite
> Issue Type: Bug
> Reporter: Mostafa Mokhtar
> Assignee: Jesus Camacho Rodriguez
> Labels: hive
> Fix For: 1.2.0-incubating
>
>
> Currently OPTIQ doesn't push down aggregates for bushy joins and this can be a performance limitation in some queries.
> Ideally before we push down the aggregate we should take into the account the NDV and reduction of the join which we are trying to push the aggregate underneath.
> On Hive to measure potential gains I wrote a query which has the following business question based on the TPC-DS schema:
> Identify customers from specific gender with good credit rating that spent more than X amount of money during a specific year.
> This query has 3 main components
> 1) Hub containing store_sales & date_dim
> 2) Hub containing customer and customer_demographics
> 3) Aggregate
> And for that query I have 2 rewrites :
> 1) Rewrite that generates a bushy join plan
> 2) Rewrite that generates a bushy join plan and pushes the aggregate down
> Results show that aggregate push down in a bushy plan cuts down the query runtime by half.
> The aggregate push down reduces the intermediate data in the query from 384GB to 22GB.
> Base query finishes in 4,781 seconds with 2,253 GB of intermediate data
> Bushy query finished in 1,374 seconds with 384GB of intermediate data
> Bushy query + aggregate push down finished in 605 with only 22GB of intermediate data.
> Base query
> {code}
> "select
> c_customer_sk, sum(ss_net_paid) as net_paid
> from
> customer,
> customer_demographics,
> store_sales,
> date_dim
> where
> c_current_cdemo_sk = cd_demo_sk
> and cd_credit_rating = 'Good'
> and cd_gender = 'F'
> and ss_customer_sk = c_customer_sk
> and ss_customer_sk is not null
> and d_year = 2001
> and d_date_sk = ss_sold_date_sk
> group by c_customer_sk
> having net_paid > 0
> order by net_paid desc
> limit 100;"
> {code}
> Bushy query
> {code}
> with ss as
> (
> select
> ss_net_paid, ss_customer_sk
> from
> store_sales,date_dim
> where
> d_year = 2001
> and d_date_sk = ss_sold_date_sk
> and ss_customer_sk is not null
> ), cc as
> (
> select
> c_customer_sk
> from
> customer,customer_demographics
> where
> c_current_cdemo_sk = cd_demo_sk
> and cd_credit_rating = 'Good'
> and cd_gender = 'F'
> )
> select
> c_customer_sk, sum(ss_net_paid) as net_paid
> from
> cc,ss
> where
> ss_customer_sk = c_customer_sk
> group by c_customer_sk
> having net_paid > 0
> order by net_paid desc
> limit 100;
> {code}
> Bushy query + aggregate push down
> {code}
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)