You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Jim Apple (Jira)" <ji...@apache.org> on 2020/08/26 03:39:00 UTC

[jira] [Commented] (IMPALA-5633) Bloom filters underestimate false positive probability

    [ https://issues.apache.org/jira/browse/IMPALA-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17184908#comment-17184908 ] 

Jim Apple commented on IMPALA-5633:
-----------------------------------

I think it's just a mistake. IIRC, [~mmokhtar] and I had a brief discussion about it a few years ago and he thought it was an error as well.

At the time, I wrote a patch to change it to 1% and ran some perf tests, which were inconclusive, likely due to me not picking judiciously which tests would demonstrate a difference, so the patch went in my backlog and has yet to escape.

I'd be delighted to +2 a patch that changed it and demonstrated the perf impact. It might be good to combine it with a patch to allow non-power-of-two sizes, which can be done without a modulo via

{code:c}
uint64_t libfilter_block_index(uint64_t hash, uint64_t num_buckets) {
  return (((unsigned __int128)hash) * ((unsigned __int128)num_buckets)) >> 64;
}
{code}


> Bloom filters underestimate false positive probability
> ------------------------------------------------------
>
>                 Key: IMPALA-5633
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5633
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Perf Investigation
>            Reporter: Jim Apple
>            Assignee: Jim Apple
>            Priority: Minor
>
> Block Bloom filters have a higher false positive rate than standard Bloom filter, due to the uneven distribution of keys between buckets. We should change the code to match the theory, using an approximation from the paper that introduced block Bloom filters, "Cache-, Hash- and Space-Efficient Bloom Filters" by Putze et al.
> For a false positive probability of 1%, this would increase filter size by about 10% and a decrease filter false positive probability by 50%. However, this is obscured by the coarseness of the fact that filters are constrained to have a size in bytes that is a power of two. Loosening that restriction is potential future work.
> See https://github.com/apache/kudu/commit/d1190c2b06a6eef91b21fd4a0b5fb76534b4e9f9



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org