You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2020/08/26 00:03:00 UTC

[jira] [Commented] (IMPALA-5633) Bloom filters underestimate false positive probability for high NDV

    [ https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184795#comment-17184795 ] 

Tim Armstrong commented on IMPALA-5633:
---------------------------------------

[~jbapple] I know you're looking at some related things in Kudu now. Do you have any idea why max_filter_error_rate might be set to 0.75 - that seems awfully high.

On some queries - TPC-DS Q84 for example, it seems to be too aggressive and is choosing filters that are too small and end up being less effective than desired. E.g. a 4kb filter for household_demographics.hd_demo_sk, which has 7200 distinct values.

> Bloom filters underestimate false positive probability for high NDV
> -------------------------------------------------------------------
>
>                 Key: IMPALA-5633
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5633
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Perf Investigation
>            Reporter: Jim Apple
>            Assignee: Jim Apple
>            Priority: Minor
>
> {{bloom-filter.cc}} says:
> {noformat}
> fpp = (1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS
> where space is in bits.
> {noformat}
> This is true only discounting the false positive rate induced by hash collisions. Using {{w}}}-bit hashes, with {{n}} distinct values gives a false positive rate of
> {noformat}
> n / exp2(w) + (1.0 - n / exp2(w)) * ((1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS)
> {noformat}
> This starts to become a factor as {{n}} approaches {{1 << w}}. It also suggests using bitmaps rather than Bloom filters for large {{n}}, since the false positive rate for a bitmap is
> {noformat}
> n / exp2(w) + (1.0 - n / exp2(w)) * (1 - exp(-ndv/space))
> {noformat}
> This is lower than the current BF false positive rate for high {{n}} and low relative space (aka, high false positive probability).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org