You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by Prashant Malik <pm...@gmail.com> on 2009/04/12 04:53:22 UTC

Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected

The results are a bit counter intuitive here I would have expected it to be
faster with the same FP rate but   I am not sure why it is slower if you are
just using a couple of hash functions and using double hashing.

We had done these experiments a while back and found results matching with
the theoretical tables.

I am sorry I haven't looked at the test code but have you tried it with
large strings as keys ? e.g 128 byte keys , also with Longs.

But this is an interesting exercise for sure :).

Thanks
Prashant

On Sat, Apr 11, 2009 at 4:31 PM, Sandeep Tata (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/CASSANDRA-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12698145#action_12698145]
>
> Sandeep Tata commented on CASSANDRA-68:
> ---------------------------------------
>
> I ran some more tests, here's what I found:
>
> Old code:
> Test                                             MAX_FP Actual FP
> FalsePositivesInt/Random     0.1            0.142
> FalsePositivesInt/Random     0.01         (pass)
> FalsePositivesInt/Random     0.001       (pass)
> Words                                         0.1            0.15
> Words                                         0.01          (pass)
> Words                                         0.001        0.0013
>
> New code:
>
> FalsePositivesInt/Random     0.1           (pass)
> FalsePositivesInt/Random     0.01         (pass)
> FalsePositivesInt/Random     0.001       (pass)
> Words                                         0.1            (pass)
> Words                                         0.01          (pass)
> Words                                         0.001        (pass)
>
> The old bloomfilter certainly reports up to 50% more than expected false
> positive rates for some cases. The new bloomfilter is more predictable, it
> always passes.
>
> On my machine, some quick-n-crude tests show that the new bloom-filter is
> about 4x slower. (I tested at FP rate = 0.01) . When you take into account
> the fact that the penalty for a false positive at least 3 orders of
> magnitude more expensive than the actual hash calculation (an FP usually
> means you'll end up hitting disk unnecessarily), it makes sense to use it
> even when you set the FP rate to 0.001. It is even more useful at higher
> rates.
>
> > Bloom filters have much higher false-positive rate than expected
> > ----------------------------------------------------------------
> >
> >                 Key: CASSANDRA-68
> >                 URL: https://issues.apache.org/jira/browse/CASSANDRA-68
> >             Project: Cassandra
> >          Issue Type: Bug
> >            Reporter: Jonathan Ellis
> >            Assignee: Jonathan Ellis
> >             Fix For: 0.3
> >
> >         Attachments:
> 0001-r-m-unused-code-including-entire-CountingBloomFilte.patch,
> 0002-replace-JenkinsHash-w-MurmurHash.-its-hash-distrib.patch,
> 0003-rename-BloomFilter.fill-add.patch,
> 0004-rewrite-bloom-filters-to-use-murmur-hash-and-combina.patch,
> 0004-v3.patch, 0004a-tests.patch, 0004b-code.patch,
> 0005-switch-back-to-old-hash-generation-code-to-demonstra.patch,
> fp_test_for_old_code.patch, fp_test_for_old_code_v2.patch, words.gz
> >
> >
> > Gory details:
> http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected

Posted by Jonathan Ellis <jb...@gmail.com>.
On Sat, Apr 11, 2009 at 9:53 PM, Prashant Malik <pm...@gmail.com> wrote:
> The results are a bit counter intuitive here I would have expected it to be
> faster with the same FP rate but   I am not sure why it is slower if you are
> just using a couple of hash functions and using double hashing.

Murmur is a higher-quality hash and takes more operations to achieve
its better key distribution.  But since the new implementation always
uses two calls to Murmur no matter how many hashes are needed it is
virtually constant time.

> I am sorry I haven't looked at the test code but have you tried it with
> large strings as keys ? e.g 128 byte keys , also with Longs.

The random strings generated are 128 bytes.

-Jonathan

Re: [jira] Commented: (CASSANDRA-68) Bloom filters have much higher false-positive rate than expected

Posted by Jonathan Ellis <jb...@gmail.com>.
On Sat, Apr 11, 2009 at 9:53 PM, Prashant Malik <pm...@gmail.com> wrote:
> The results are a bit counter intuitive here I would have expected it to be
> faster with the same FP rate but   I am not sure why it is slower if you are
> just using a couple of hash functions and using double hashing.

Murmur is a higher-quality hash and takes more operations to achieve
its better key distribution.  But since the new implementation always
uses two calls to Murmur no matter how many hashes are needed it is
virtually constant time.

> I am sorry I haven't looked at the test code but have you tried it with
> large strings as keys ? e.g 128 byte keys , also with Longs.

The random strings generated are 128 bytes.

-Jonathan