You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Erik Forsberg <fo...@opera.com> on 2012/03/21 16:27:48 UTC

On Bloom filters and Key Cache

Hi!

We're currently testing Cassandra with a large number of row keys per 
node - nodetool cfstats approximated number of keys to something like 
700M per node. This seems to have caused a very large heap consumption.

After reading 
http://wiki.apache.org/cassandra/LargeDataSetConsiderations I think I've 
tracked this down to the bloom filter, and the sampled index entries.

Regarding bloom filters, have I understood correctly that they are 
stored on Heap, and that the "Bloom Filter Space Used" reported by 
'nodetool cfstats' is an approximation of the heap space used by bloom 
filters? It reports the on-disk size, but if I understand 
CASSANDRA-3497, the on-disk size is smaller than the on-Heap size?

I understand that increasing bloom_filter_fp_chance will decrease the 
bloom filter size, but at the cost of worse performance when asking for 
keys that don't exist. I do have a fair amount of queries for keys that 
don't exist.

How much will increasing the key cache help, i.e. decrease bloom filter 
size but increase key cache size? Will the key cache cache negative 
results, i.e. the fact that a key didn't exist?

Regards,
\EF

Re: On Bloom filters and Key Cache

Posted by aaron morton <aa...@thelastpickle.com>.

> Regarding bloom filters, have I understood correctly that they are stored on Heap,
yes.

>  and that the "Bloom Filter Space Used" reported by 'nodetool cfstats' is an approximation of the heap space used by bloom filters?
Yes, it's the on serialised on disk size. This will be smaller than the in memory. My guess is it's not a huge difference as there are not a lot of objects involved.  

How big are the -Filter files in the data directory.

> How much will increasing the key cache help, i.e. decrease bloom filter size but increase key cache size? Will the key cache cache negative results, i.e. the fact that a key didn't exist?
The key cache does not cache negative hits. 

If you have a high proportion of requests for data that does not exist you *might* be better off  increasing the index sampling (https://github.com/apache/cassandra/blob/cassandra-1.0/conf/cassandra.yaml#L400) and / or reducing / disabling the key cache. While keeping the default bloom filter false positive rate. 

Requests for rows that do  not exists will mostly be handled by the bloom filter lookup in memory.   

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 22/03/2012, at 4:27 AM, Erik Forsberg wrote:

> Hi!
> 
> We're currently testing Cassandra with a large number of row keys per node - nodetool cfstats approximated number of keys to something like 700M per node. This seems to have caused a very large heap consumption.
> 
> After reading http://wiki.apache.org/cassandra/LargeDataSetConsiderations I think I've tracked this down to the bloom filter, and the sampled index entries.
> 
> Regarding bloom filters, have I understood correctly that they are stored on Heap, and that the "Bloom Filter Space Used" reported by 'nodetool cfstats' is an approximation of the heap space used by bloom filters? It reports the on-disk size, but if I understand CASSANDRA-3497, the on-disk size is smaller than the on-Heap size?
> 
> I understand that increasing bloom_filter_fp_chance will decrease the bloom filter size, but at the cost of worse performance when asking for keys that don't exist. I do have a fair amount of queries for keys that don't exist.
> 
> How much will increasing the key cache help, i.e. decrease bloom filter size but increase key cache size? Will the key cache cache negative results, i.e. the fact that a key didn't exist?
> 
> Regards,
> \EF