You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ryan King (JIRA)" <ji...@apache.org> on 2010/10/20 23:12:27 UTC

[jira] Commented: (CASSANDRA-1555) Considerations for larger bloom filters

    [ https://issues.apache.org/jira/browse/CASSANDRA-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923172#action_12923172 ] 

Ryan King commented on CASSANDRA-1555:
--------------------------------------

We have a patch to deal with having larger bitsets. I'm in the process of making it into a patchset but wanted to make some comments and get feedback first.

Its not based on the lucent bitset or EWAH for a couple reasons (which might not be totally legit):

EWAH looks like it might be useful for reducing on-disk storage for BFs, but it doesn't appear to be useful for in-memory usage because the cost of checking a bit is linear, not constant. That seems like a serious performance degradation that's not even worth considering. And reducing the storage footprint for BFs seems like the least of our concerns at this point (not that we should never do it).

Lucene's OpenBitSet is implemented as an array of longs and looks genuinely useful and would allow us to have BFs of up to 64\*2\*\*32-1 bits (274,877,906,943). Our implementation [http://github.com/kakugawa/cassandra/compare/fbb7c26acde572abf625...cassandra-0.6-counts-bf#diff-5] uses an array of BitSets and can therefore handle up to 2\*\*30*2\*\*31 bits (2.30584301 x 10\*\*18). That's quite a bit more bits, but I'm not sure if its worth it if the lucene approach can cover us.

> Considerations for larger bloom filters
> ---------------------------------------
>
>                 Key: CASSANDRA-1555
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1555
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Stu Hood
>             Fix For: 0.8
>
>
> To (optimally) support SSTables larger than 143 million keys, we need to support bloom filters larger than 2 GB, which java.util.BitSet can't handle directly.
> A few options:
> * Switch to a BitSet class which supports 2^63 bits (Lucene's OpenBitSet)
> * Partition the java.util.BitSet behind our current BloomFilter
> ** Straightforward bit partitioning: bit N is in bitset N // 2^31
> ** Separate equally sized complete bloom filters for member ranges, which can be used independently or OR'd together under memory pressure.
> All of these options require new approaches to serialization.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.