You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Weijun Li <we...@gmail.com> on 2010/05/04 22:47:03 UTC

BloomFilter is taking too much memory

Hello,

We stored about 47mil keys in one Cassandra node and what a memory dump
shows for one of the SStableReader:

    SSTableReader: 386MB. Among this 386MB, IndexSummary takes about 231MB
but BloomFilter takes 155MB with an embedded huge array long[19.4mil].

It seems that BloomFilter is taking too much memory. If this is the case
BloomFilter seems to be redundant comparing to the size of index.

So is this desired behavior? Is there a formula to estimate the size of
needed memory for BloomFilter?

Thanks,

-Weijun

Re: BloomFilter is taking too much memory

Posted by Weijun Li <we...@gmail.com>.
More insight for this sstable: the ArrayList for IndexSummary has 644195
entries, so total number of entries for this sstable is: 644195*128=~82mil.
The problem is that the total bits for its BloomFilter (long[19400551]
inside BitSet) is 19400551*64=1241635264, which means each key is taking
~15bits. This seems to be inline with the number of buckets in sstable
writer. I'm making changes to make this bucket number to be configurable so
as to have more control about memory usage.

-Weijun

On Tue, May 4, 2010 at 1:50 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> BloomFilter is not redundant, because it stores information about
> _all_ keys while the index summary stores every 1/128 key.
>
> On Tue, May 4, 2010 at 3:47 PM, Weijun Li <we...@gmail.com> wrote:
> > Hello,
> >
> > We stored about 47mil keys in one Cassandra node and what a memory dump
> > shows for one of the SStableReader:
> >
> >     SSTableReader: 386MB. Among this 386MB, IndexSummary takes about
> 231MB
> > but BloomFilter takes 155MB with an embedded huge array long[19.4mil].
> >
> > It seems that BloomFilter is taking too much memory. If this is the case
> > BloomFilter seems to be redundant comparing to the size of index.
> >
> > So is this desired behavior? Is there a formula to estimate the size of
> > needed memory for BloomFilter?
> >
> > Thanks,
> >
> > -Weijun
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: BloomFilter is taking too much memory

Posted by Jonathan Ellis <jb...@gmail.com>.
BloomFilter is not redundant, because it stores information about
_all_ keys while the index summary stores every 1/128 key.

On Tue, May 4, 2010 at 3:47 PM, Weijun Li <we...@gmail.com> wrote:
> Hello,
>
> We stored about 47mil keys in one Cassandra node and what a memory dump
> shows for one of the SStableReader:
>
>     SSTableReader: 386MB. Among this 386MB, IndexSummary takes about 231MB
> but BloomFilter takes 155MB with an embedded huge array long[19.4mil].
>
> It seems that BloomFilter is taking too much memory. If this is the case
> BloomFilter seems to be redundant comparing to the size of index.
>
> So is this desired behavior? Is there a formula to estimate the size of
> needed memory for BloomFilter?
>
> Thanks,
>
> -Weijun
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com