You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Jim Apple (JIRA)" <ji...@apache.org> on 2017/07/09 05:03:01 UTC

[jira] [Created] (IMPALA-5633) Bloom filters underestimate false positive probability for high NDV

Jim Apple created IMPALA-5633:
---------------------------------

             Summary: Bloom filters underestimate false positive probability for high NDV
                 Key: IMPALA-5633
                 URL: https://issues.apache.org/jira/browse/IMPALA-5633
             Project: IMPALA
          Issue Type: Bug
          Components: Perf Investigation
            Reporter: Jim Apple
            Assignee: Jim Apple
            Priority: Minor


{{bloom-filter.cc}} says:

{noformat}
fpp = (1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS

where space is in bits.
{noformat}

This is true only discounting the false positive rate induced by hash collisions. Using {{w}}}-bit hashes, with {{n}} distinct values gives a false positive rate of

{noformat}
n / exp2(w) + (1.0 - n / exp2(w)) * ((1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS)
{noformat}

This starts to become a factor as {{n}} approaches {{1 << w}}. It also suggests using bitmaps rather than Bloom filters for large {{n}}, since the false positive rate for a bitmap is

{noformat}
n / exp2(w) + (1.0 - n / exp2(w)) * (1 - exp(-ndv/space))
{noformat}

This is lower than the current BF false positive rate for high {{n}} and low relative space (aka, high false positive probability).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)