You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Jim Apple (JIRA)" <ji...@apache.org> on 2017/07/09 05:03:01 UTC
[jira] [Created] (IMPALA-5633) Bloom filters underestimate false
positive probability for high NDV
Jim Apple created IMPALA-5633:
---------------------------------
Summary: Bloom filters underestimate false positive probability for high NDV
Key: IMPALA-5633
URL: https://issues.apache.org/jira/browse/IMPALA-5633
Project: IMPALA
Issue Type: Bug
Components: Perf Investigation
Reporter: Jim Apple
Assignee: Jim Apple
Priority: Minor
{{bloom-filter.cc}} says:
{noformat}
fpp = (1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS
where space is in bits.
{noformat}
This is true only discounting the false positive rate induced by hash collisions. Using {{w}}}-bit hashes, with {{n}} distinct values gives a false positive rate of
{noformat}
n / exp2(w) + (1.0 - n / exp2(w)) * ((1 - exp(-BUCKET_WORDS * ndv/space))^BUCKET_WORDS)
{noformat}
This starts to become a factor as {{n}} approaches {{1 << w}}. It also suggests using bitmaps rather than Bloom filters for large {{n}}, since the false positive rate for a bitmap is
{noformat}
n / exp2(w) + (1.0 - n / exp2(w)) * (1 - exp(-ndv/space))
{noformat}
This is lower than the current BF false positive rate for high {{n}} and low relative space (aka, high false positive probability).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)