You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/06/05 23:02:00 UTC
[jira] [Commented] (IMPALA-3741) Push bloom filters to Kudu scanners

    [ https://issues.apache.org/jira/browse/IMPALA-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127141#comment-17127141 ] 

ASF subversion and git services commented on IMPALA-3741:
---------------------------------------------------------

Commit c62a6808fc379f812a2eaa2363d3c3c18a1836b0 in impala's branch refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=c62a680 ]

IMPALA-3741 [part 2]: Push runtime bloom filter to Kudu

Defined the BloomFilter class as the wrapper of kudu::BlockBloomFilter.
impala::BloomFilter build runtime bloom filter in kudu::BlockBloomFilter
APIs with FastHash as default hash algorithm.
Removed the duplicated functions from impala::BloomFillter class.
Pushed down bloom filter to Kudu through Kudu clinet API.

Added a new query option ENABLED_RUNTIME_FILTER_TYPES to set enabled
runtime filter types, which only affect Kudu scan node now. By default,
bloom filter is not enabled, only min-max filter will be enabled for
Kudu. With this option, user could enable bloom filter, min-max filter,
or both bloom and min-max runtime filters.

Added new test cases in PlannerTest and end-end runtime_filters test
for pushing down bloom filter to Kudu.
Added test cases to compare the number of rows returned from Kudu
scan when appling different types of runtime filter on same queries.
Updated bloom-filter-benchmark due to the bloom-filter implementation
change.

Bump Kudu version to d652cab17.

Testing:
 - Passed all exhaustive tests.

Performance benchmark:
 - Ran single_node_perf_run.py on TPC-H with scale as 30 for parquet
   and Kudu. Verified that new hash function and bloom-filter
   implementation don't cause regressions for HDFS bloom filters.
   For Kudu, there is one regression for query TPCH-Q9 and there
   are improvement for about 8 queris when appling both bloom and
   min-max filters. The bloom filter reduce the number of rows
   returned from Kudu scan, hence reduce the cost for aggregation
   and hash join. But bloom filter evaluation add extra cost for
   Kudu scan, which offset the gain on aggregation and join.
   Kudu scan need to be optimized for bloom filter in following
   tasks.
 - Ran bloom-filter microbenchmarks and verified that there is no
   regression for Insert/Find/Union functions with or without AVX2
   due to bloom-filter implementation changes. There is small
   performance degradation for Init function, but this function is
   not in hot path.

Change-Id: I9100076f68ea299ddb6ec8bc027cac7a47f5d754
Reviewed-on: http://gerrit.cloudera.org:8080/15683
Reviewed-by: Thomas Tauber-Marshall <tm...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Push bloom filters to Kudu scanners
> -----------------------------------
>
>                 Key: IMPALA-3741
>                 URL: https://issues.apache.org/jira/browse/IMPALA-3741
>             Project: IMPALA
>          Issue Type: Task
>          Components: Backend
>    Affects Versions: Kudu_Impala
>            Reporter: Matthew Jacobs
>            Assignee: Wenzhe Zhou
>            Priority: Major
>              Labels: kudu, performance
>
> Impala relies on bloom filters to reduce number of rows from coming out of the scan node for selective joins. 
> Queries get up to 20x speedup, not having bloom filter support in Kudu will create a big performance gap between Parquet and Kudu.
> https://github.com/cloudera/Impala/blob/cdh5-trunk/be/src/util/bloom-filter.h



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org