You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "weijie.tong (JIRA)" <ji...@apache.org> on 2018/11/12 07:52:00 UTC

[jira] [Closed] (DRILL-6727) JPPD does not eliminate rows using the bloom filter if a HashJoin is involved

     [ https://issues.apache.org/jira/browse/DRILL-6727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

weijie.tong closed DRILL-6727.
------------------------------
    Resolution: Fixed

has fixed at DRILL-6792.

> JPPD does not eliminate rows using the bloom filter if a HashJoin is involved
> -----------------------------------------------------------------------------
>
>                 Key: DRILL-6727
>                 URL: https://issues.apache.org/jira/browse/DRILL-6727
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Flow
>    Affects Versions: 1.15.0
>            Reporter: Kunal Khatua
>            Assignee: weijie.tong
>            Priority: Critical
>         Attachments: bcastJoin-JPPD_2477fb99-36cb-9bc2-b7fb-c81a52b256d2.json, bcastJoin-default_2477fa68-a31e-3b97-5469-373845c2b763.json, hashJoin-JPPD_2477f6f7-14e0-ca23-d9f7-6b0273c20964.json, hashJoin-default_2477f5e8-fff2-fc83-d251-d8be8f92820b.json
>
>
> When testing a simple join between 2 tables, it appears that the Bloom-filter based predicate pushdown will work only for broadcast joins, but not for hash-based joins.
> Since the purpose of the filter is to reduce the number of records being hashed across the fragments, the runtime does not improve.
> Join Query (TPCH dataset):
> {code:sql}
> select
> l.l_orderkey
> , sum(l.l_extendedprice * (1 - l.l_discount)) as revenue
> , o.o_orderdate
> , o.o_shippriority
> from
> orders o
> , lineitem l
> where
> l.l_orderkey = o.o_orderkey
> and o.o_orderdate = date '1994-08-26'
> and MOD(o.o_custkey,10) = 1
> group by
> l.l_orderkey
> , o.o_orderdate
> , o.o_shippriority
> order by
> revenue desc
> , o.o_orderdate limit 10;
> {code}
> This generates an output of about 6K rows from the build side, with the expectation of 10M rows being joined from the probe side.
> Following are the results of the following query:
> || Join Mode || Profile || Runtime || Status ||
> |BCastJoin w/o JPPD |  [^bcastJoin-default_2477fa68-a31e-3b97-5469-373845c2b763.json]  | 3.148sec | As expected. 600M rows are scanned and probed against the locally available hash table. |
> |BCastJoin w/ JPPD |  [^bcastJoin-JPPD_2477fb99-36cb-9bc2-b7fb-c81a52b256d2.json]  | 3.570sec | 04-xx-06 shows a reduction in rows. 600M rows are scanned, but only 10M rows are probed against the locally available hash table. |
> |
> |HashJoin w/o JPPD |  [^hashJoin-default_2477f5e8-fff2-fc83-d251-d8be8f92820b.json]  | 5.861sec | As expected. 600M rows are scanned and probed against the hash table. |
> |HashJoin w/ JPPD |  [^hashJoin-JPPD_2477f6f7-14e0-ca23-d9f7-6b0273c20964.json]  | 8.376sec | 03-xx-07 is not seeing a reduction in rows. All 600M rows are scanned and probed against the hash table. |
> There are a few possibilities of why the RuntimeFilter does not eliminate any rows when a HashJoin is involved.
> 1. The RuntimeFilter operator does not have a bloom-filter
> 2. The RuntimeFilter receives the bloom-filter after the scan completes, because the foreman has not finished building and distributing the global bloom-filter
> 3. The RuntimeFilter receives the bloom-filter during the scan, but does not apply it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)