You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2014/08/08 23:56:12 UTC
[jira] [Created] (OPTIQ-366) Support Aggregate push down in bushy
joins
Mostafa Mokhtar created OPTIQ-366:
-------------------------------------
Summary: Support Aggregate push down in bushy joins
Key: OPTIQ-366
URL: https://issues.apache.org/jira/browse/OPTIQ-366
Project: Optiq
Issue Type: Bug
Reporter: Mostafa Mokhtar
Assignee: Julian Hyde
Currently OPTIQ doesn't push down aggregates for bushy joins and this can be a performance limitation in some queries.
Ideally before we push down the aggregate we should take into the account the NDV and reduction of the join which we are trying to push the aggregate underneath.
On Hive to measure potential gains I wrote a query which has the following business question based on the TPC-DS schema:
Identify customers from specific gender with good credit rating that spent more than X amount of money during a specific year.
This query has 3 main components
1) Hub containing store_sales & date_dim
2) Hub containing customer and customer_demographics
3) Aggregate
And for that query I have 2 rewrites :
1) Rewrite that generates a bushy join plan
2) Rewrite that generates a bushy join plan and pushes the aggregate down
Results show that aggregate push down in a bushy plan cuts down the query runtime by half.
The aggregate push down reduces the intermediate data in the query from 384GB to 22GB.
Base query finishes in 4,781 seconds with 2,253 GB of intermediate data
Bushy query finished in 1,374 seconds with 384GB of intermediate data
Bushy query + aggregate push down finished in 605 with only 22GB of intermediate data.
Base query
{code}
"select
c_customer_sk, sum(ss_net_paid) as net_paid
from
customer,
customer_demographics,
store_sales,
date_dim
where
c_current_cdemo_sk = cd_demo_sk
and cd_credit_rating = 'Good'
and cd_gender = 'F'
and ss_customer_sk = c_customer_sk
and ss_customer_sk is not null
and d_year = 2001
and d_date_sk = ss_sold_date_sk
group by c_customer_sk
having net_paid > 0
order by net_paid desc
limit 100;"
{code}
Bushy query
{code}
with ss as
(
select
ss_net_paid, ss_customer_sk
from
store_sales,date_dim
where
d_year = 2001
and d_date_sk = ss_sold_date_sk
and ss_customer_sk is not null
), cc as
(
select
c_customer_sk
from
customer,customer_demographics
where
c_current_cdemo_sk = cd_demo_sk
and cd_credit_rating = 'Good'
and cd_gender = 'F'
)
select
c_customer_sk, sum(ss_net_paid) as net_paid
from
cc,ss
where
ss_customer_sk = c_customer_sk
group by c_customer_sk
having net_paid > 0
order by net_paid desc
limit 100;
{code}
Bushy query + aggregate push down
{code}
{code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)