You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Mostafa Mokhtar (JIRA)" <ji...@apache.org> on 2014/08/08 23:56:12 UTC

[jira] [Created] (OPTIQ-366) Support Aggregate push down in bushy joins

Mostafa Mokhtar created OPTIQ-366:
-------------------------------------

             Summary: Support Aggregate push down in bushy joins
                 Key: OPTIQ-366
                 URL: https://issues.apache.org/jira/browse/OPTIQ-366
             Project: Optiq
          Issue Type: Bug
            Reporter: Mostafa Mokhtar
            Assignee: Julian Hyde


Currently OPTIQ doesn't push down aggregates for bushy joins and this can be a performance limitation in some queries.

Ideally before we push down the aggregate we should take into the account the NDV and reduction of the join which we are trying to push the aggregate underneath.

On Hive to measure potential gains I wrote a query which has the following business question based on the TPC-DS schema:
Identify customers from specific gender with good credit rating that spent more than X amount of money during a specific year.

This query has 3 main components
1) Hub containing store_sales & date_dim
2) Hub containing customer and customer_demographics
3) Aggregate

And for that query I have 2 rewrites :
1) Rewrite that generates a bushy join plan
2) Rewrite that generates a bushy join plan and pushes the aggregate down

Results show that aggregate push down in a bushy plan cuts down the query runtime by half.
The aggregate push down reduces the intermediate data in the query from 384GB to 22GB.

Base query finishes in 4,781 seconds with 2,253 GB of intermediate data	
Bushy query finished in 1,374	seconds with 384GB of  intermediate data
Bushy query + aggregate push down finished in 605 with only 22GB of  intermediate data.

Base query
{code}
"select 
    c_customer_sk, sum(ss_net_paid) as net_paid
from
    customer,
    customer_demographics,
    store_sales,
    date_dim
where
    c_current_cdemo_sk = cd_demo_sk
        and cd_credit_rating = 'Good'
        and cd_gender = 'F'
        and ss_customer_sk = c_customer_sk
        and ss_customer_sk is not null
        and d_year = 2001
        and d_date_sk = ss_sold_date_sk
group by c_customer_sk
having net_paid > 0
order by net_paid desc
limit 100;"
{code}

Bushy query
{code}
with ss as 
( 
select 
    ss_net_paid, ss_customer_sk
from
    store_sales,date_dim
where
    d_year = 2001
        and d_date_sk = ss_sold_date_sk
        and ss_customer_sk is not null
), cc as 
(
select 
    c_customer_sk
from
    customer,customer_demographics
where
    c_current_cdemo_sk = cd_demo_sk
        and cd_credit_rating = 'Good'
        and cd_gender = 'F'
)
select 
    c_customer_sk, sum(ss_net_paid) as net_paid
from 
    cc,ss
where
    ss_customer_sk = c_customer_sk
group by c_customer_sk
having net_paid > 0
order by net_paid desc
limit 100;
{code}

Bushy query + aggregate push down
{code}
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)