You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@calcite.apache.org by "Vladimir Sitnikov (JIRA)" <ji...@apache.org> on 2014/11/18 17:48:34 UTC

[jira] [Commented] (CALCITE-468) Introduce semi join reduction optimization in Calcite

    [ https://issues.apache.org/jira/browse/CALCITE-468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14216404#comment-14216404 ] 

Vladimir Sitnikov commented on CALCITE-468:
-------------------------------------------

I wonder what if we use Bloom filter to trim store_returns/catalog_sales/catalog_returns tables.
This might work for large inputs as well.


> Introduce semi join reduction optimization in Calcite 
> ------------------------------------------------------
>
>                 Key: CALCITE-468
>                 URL: https://issues.apache.org/jira/browse/CALCITE-468
>             Project: Calcite
>          Issue Type: Bug
>            Reporter: Mostafa Mokhtar
>            Assignee: Laljo John Pullokkaran
>              Labels: hive
>         Attachments: BaselineTree.png, SemiJoinReductionGains.png, SemiJoinReductionTreee.png
>
>
> The basic idea is to apply join predicates early in a plan in order to reduce the size of intermediate query results and, thus, reduce the cost of other operations. In other words, the idea is to apply the same join predicates twice or more often in a query plan
> In order to reduce the communication costs of a distributed system. Obviously, semi-join reducers are only effective if the (redundant) semi-joins are cheap and result in a significant reduction of the size of intermediate
> query results.
> I propose to extend a query optimizer and integrate semi-join reducer and
> join-ordering, etc. into a single query optimization step
> Several TPC-DS queries like 24, 64 & 80 run very slow do to the lake of semi join reduction optimization in Calcite.
> Doing a rewrite of Q64 to simulate semi join reduction produced 4x gains.
> {code}
> Query	               	Total time	     CPU 	Intermediate rows (Million)
> Baseline			1,377	 356,900	                    23,940
> Semi Join Reduction		343	  47,253	                        23
> {code}
> Q64 subset 
> {code}
> select 
>     count(*)
> FROM
>     store_sales
>         JOIN
>     item ON store_sales.ss_item_sk = item.i_item_sk
>         JOIN
>     store_returns ON store_sales.ss_item_sk = store_returns.sr_item_sk
>         JOIN
>     (select 
>         cs_item_sk
>     from
>         catalog_sales
>     JOIN catalog_returns ON catalog_sales.cs_item_sk = catalog_returns.cr_item_sk
>     group by cs_item_sk
>     having 
> sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit)) cs_ui 
> ON store_sales.ss_item_sk = cs_ui.cs_item_sk
> WHERE
>     i_color in ('maroon' , 'burnished',
>         'dim',
>         'steel',
>         'navajo',
>         'chocolate')
>         and i_current_price between 35 and 35 + 10
>         and i_current_price between 35 + 1 and 35 + 15
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)