You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Todd Burruss <bb...@expedia.com> on 2012/01/13 01:07:13 UTC

Cache Row Size

I'm using ConcurrentLinkedHashCacheProvider and my data on disk is about 4gb, but the RAM used by the cache is around 25gb.  I have 70k columns per row, and only about 2500 rows – so a lot more columns than rows.  has there been any discussion or JIRAs discussing reducing the size of the cache?  I can understand the overhead for column names, etc, but the ratio seems a bit distorted.

I'm tracing through the code, so any pointers to help me understand is appreciated

thx

Re: Cache Row Size

Posted by Todd Burruss <bb...@expedia.com>.
thx for the info.  I'm a bit leary on the memcached (or any out-of-process
cache) because of coherency issues:

https://issues.apache.org/jira/browse/CASSANDRA-2701



On 1/12/12 5:50 PM, "Bruno Leonardo Gonçalves" <br...@gmail.com> wrote:

>Twitter engineers reported a similar experience [1] (slide 32). They
>managed to reduce by 45% memory usage with cache provider backed by
>Memcached. Lately I've been worrying a lot with the swelling of Java
>objects. In 64-bit servers are tried using the JVM option
>-XX:+UseCompressedOops? This presentation [2] made ​​me more worried. But
>let us know any progress in your experience. :-)
>
>[1] http://www.scribd.com/doc/59830692/Cassandra-at-Twitter
>[2]
>http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-
>java-tutorial.pdf
>
>--
>Bruno Leonardo Gonçalves
>
>
>On Thu, Jan 12, 2012 at 22:07, Todd Burruss <bb...@expedia.com> wrote:
>
>> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is about
>> 4gb, but the RAM used by the cache is around 25gb.  I have 70k columns
>>per
>> row, and only about 2500 rows – so a lot more columns than rows.  has
>>there
>> been any discussion or JIRAs discussing reducing the size of the cache?
>> I
>> can understand the overhead for column names, etc, but the ratio seems a
>> bit distorted.
>>
>> I'm tracing through the code, so any pointers to help me understand is
>> appreciated
>>
>> thx
>>


Re: Cache Row Size

Posted by Bruno Leonardo Gonçalves <br...@gmail.com>.
Twitter engineers reported a similar experience [1] (slide 32). They
managed to reduce by 45% memory usage with cache provider backed by
Memcached. Lately I've been worrying a lot with the swelling of Java
objects. In 64-bit servers are tried using the JVM option
-XX:+UseCompressedOops? This presentation [2] made ​​me more worried. But
let us know any progress in your experience. :-)

[1] http://www.scribd.com/doc/59830692/Cassandra-at-Twitter
[2]
http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf

--
Bruno Leonardo Gonçalves


On Thu, Jan 12, 2012 at 22:07, Todd Burruss <bb...@expedia.com> wrote:

> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is about
> 4gb, but the RAM used by the cache is around 25gb.  I have 70k columns per
> row, and only about 2500 rows – so a lot more columns than rows.  has there
> been any discussion or JIRAs discussing reducing the size of the cache?  I
> can understand the overhead for column names, etc, but the ratio seems a
> bit distorted.
>
> I'm tracing through the code, so any pointers to help me understand is
> appreciated
>
> thx
>

Re: Cache Row Size

Posted by Jonathan Ellis <jb...@gmail.com>.
Interesting.  I'm not sure what to do with that information, but interesting. :)

2012/1/16 Todd Burruss <bb...@expedia.com>:
> I did a little more digging and a lot of the "overhead" I see in the cache
> is from the usage of ByteBuffer.  Each ByteBuffer takes 48 bytes,
> regardless of the data it represents.  so for a single IColumn stored in
> the cache, 96 bytes (one for name, one for value) are for ByteBuffer's
> needs.
>
> converting to byte[] would save a significant chunk of memory.  however I
> know the investment in ByteBuffer is significant.  creating a cache
> provider that persists the values as byte[] instead of ByteBuffer is easy,
> somewhat like the Serializing cache provider, by creating a copy of the
> row on "put".  however, saving the keys as byte[] instead of ByteBuffer
> runs a bit deeper through the code.  not sure if I want to go there.
>
> since I am randomly accessing the columns within wide rows, I need *all*
> the rows to be cached to get good performance. this is the reason for my
> desire to save as much RAM as possible.  according to my calculations, if
> convert to byte[] this will save nearly 8gb of RAM out of the approx 25gb
> the cache is currently using.
>
> the easy fix is to simply buy more RAM and/or more machines, but wanted to
> get any feedback to see if there's something to my findings.
>
> thx
>
> fyi ... I also created some cache providers using Ehcache and
> LinkedHashMap and both exhibit about the same memory usage (in my use
> case) as ConcurrentLinkedHashCache.
>
>
>
>
> On 1/12/12 9:02 PM, "Jonathan Ellis" <jb...@gmail.com> wrote:
>
>>The serializing cache is basically optimal.  Your problem is really
>>that row cache is not designed for wide rows at all.  See
>>https://issues.apache.org/jira/browse/CASSANDRA-1956
>>
>>On Thu, Jan 12, 2012 at 10:46 PM, Todd Burruss <bb...@expedia.com>
>>wrote:
>>> after looking through the code it seems fairly straight forward to
>>>create
>>> some different cache providers and try some things.
>>>
>>> has anyone tried ehcache w/o persistence?  I see this JIRA
>>> https://issues.apache.org/jira/browse/CASSANDRA-1945 but the main
>>> complaint was the disk serialization, which I don't think anyone wants.
>>>
>>>
>>> On 1/12/12 6:18 PM, "Jonathan Ellis" <jb...@gmail.com> wrote:
>>>
>>>>8x is pretty normal for JVM and bookkeeping overhead with the CLHCP.
>>>>
>>>>The SerializedCacheProvider is the default in 1.0 and is much
>>>>lighter-weight.
>>>>
>>>>On Thu, Jan 12, 2012 at 6:07 PM, Todd Burruss <bb...@expedia.com>
>>>>wrote:
>>>>> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is
>>>>>about 4gb, but the RAM used by the cache is around 25gb.  I have 70k
>>>>>columns per row, and only about 2500 rows ­ so a lot more columns than
>>>>>rows.  has there been any discussion or JIRAs discussing reducing the
>>>>>size of the cache?  I can understand the overhead for column names,
>>>>>etc,
>>>>>but the ratio seems a bit distorted.
>>>>>
>>>>> I'm tracing through the code, so any pointers to help me understand is
>>>>>appreciated
>>>>>
>>>>> thx
>>>>
>>>>
>>>>
>>>>--
>>>>Jonathan Ellis
>>>>Project Chair, Apache Cassandra
>>>>co-founder of DataStax, the source for professional Cassandra support
>>>>http://www.datastax.com
>>>
>>
>>
>>
>>--
>>Jonathan Ellis
>>Project Chair, Apache Cassandra
>>co-founder of DataStax, the source for professional Cassandra support
>>http://www.datastax.com
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Cache Row Size

Posted by Todd Burruss <bb...@expedia.com>.
I did a little more digging and a lot of the "overhead" I see in the cache
is from the usage of ByteBuffer.  Each ByteBuffer takes 48 bytes,
regardless of the data it represents.  so for a single IColumn stored in
the cache, 96 bytes (one for name, one for value) are for ByteBuffer's
needs.

converting to byte[] would save a significant chunk of memory.  however I
know the investment in ByteBuffer is significant.  creating a cache
provider that persists the values as byte[] instead of ByteBuffer is easy,
somewhat like the Serializing cache provider, by creating a copy of the
row on "put".  however, saving the keys as byte[] instead of ByteBuffer
runs a bit deeper through the code.  not sure if I want to go there.

since I am randomly accessing the columns within wide rows, I need *all*
the rows to be cached to get good performance. this is the reason for my
desire to save as much RAM as possible.  according to my calculations, if
convert to byte[] this will save nearly 8gb of RAM out of the approx 25gb
the cache is currently using.

the easy fix is to simply buy more RAM and/or more machines, but wanted to
get any feedback to see if there's something to my findings.

thx

fyi ... I also created some cache providers using Ehcache and
LinkedHashMap and both exhibit about the same memory usage (in my use
case) as ConcurrentLinkedHashCache.




On 1/12/12 9:02 PM, "Jonathan Ellis" <jb...@gmail.com> wrote:

>The serializing cache is basically optimal.  Your problem is really
>that row cache is not designed for wide rows at all.  See
>https://issues.apache.org/jira/browse/CASSANDRA-1956
>
>On Thu, Jan 12, 2012 at 10:46 PM, Todd Burruss <bb...@expedia.com>
>wrote:
>> after looking through the code it seems fairly straight forward to
>>create
>> some different cache providers and try some things.
>>
>> has anyone tried ehcache w/o persistence?  I see this JIRA
>> https://issues.apache.org/jira/browse/CASSANDRA-1945 but the main
>> complaint was the disk serialization, which I don't think anyone wants.
>>
>>
>> On 1/12/12 6:18 PM, "Jonathan Ellis" <jb...@gmail.com> wrote:
>>
>>>8x is pretty normal for JVM and bookkeeping overhead with the CLHCP.
>>>
>>>The SerializedCacheProvider is the default in 1.0 and is much
>>>lighter-weight.
>>>
>>>On Thu, Jan 12, 2012 at 6:07 PM, Todd Burruss <bb...@expedia.com>
>>>wrote:
>>>> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is
>>>>about 4gb, but the RAM used by the cache is around 25gb.  I have 70k
>>>>columns per row, and only about 2500 rows ­ so a lot more columns than
>>>>rows.  has there been any discussion or JIRAs discussing reducing the
>>>>size of the cache?  I can understand the overhead for column names,
>>>>etc,
>>>>but the ratio seems a bit distorted.
>>>>
>>>> I'm tracing through the code, so any pointers to help me understand is
>>>>appreciated
>>>>
>>>> thx
>>>
>>>
>>>
>>>--
>>>Jonathan Ellis
>>>Project Chair, Apache Cassandra
>>>co-founder of DataStax, the source for professional Cassandra support
>>>http://www.datastax.com
>>
>
>
>
>-- 
>Jonathan Ellis
>Project Chair, Apache Cassandra
>co-founder of DataStax, the source for professional Cassandra support
>http://www.datastax.com


Re: Cache Row Size

Posted by Jonathan Ellis <jb...@gmail.com>.
The serializing cache is basically optimal.  Your problem is really
that row cache is not designed for wide rows at all.  See
https://issues.apache.org/jira/browse/CASSANDRA-1956

On Thu, Jan 12, 2012 at 10:46 PM, Todd Burruss <bb...@expedia.com> wrote:
> after looking through the code it seems fairly straight forward to create
> some different cache providers and try some things.
>
> has anyone tried ehcache w/o persistence?  I see this JIRA
> https://issues.apache.org/jira/browse/CASSANDRA-1945 but the main
> complaint was the disk serialization, which I don't think anyone wants.
>
>
> On 1/12/12 6:18 PM, "Jonathan Ellis" <jb...@gmail.com> wrote:
>
>>8x is pretty normal for JVM and bookkeeping overhead with the CLHCP.
>>
>>The SerializedCacheProvider is the default in 1.0 and is much
>>lighter-weight.
>>
>>On Thu, Jan 12, 2012 at 6:07 PM, Todd Burruss <bb...@expedia.com>
>>wrote:
>>> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is
>>>about 4gb, but the RAM used by the cache is around 25gb.  I have 70k
>>>columns per row, and only about 2500 rows ­ so a lot more columns than
>>>rows.  has there been any discussion or JIRAs discussing reducing the
>>>size of the cache?  I can understand the overhead for column names, etc,
>>>but the ratio seems a bit distorted.
>>>
>>> I'm tracing through the code, so any pointers to help me understand is
>>>appreciated
>>>
>>> thx
>>
>>
>>
>>--
>>Jonathan Ellis
>>Project Chair, Apache Cassandra
>>co-founder of DataStax, the source for professional Cassandra support
>>http://www.datastax.com
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Cache Row Size

Posted by Todd Burruss <bb...@expedia.com>.
after looking through the code it seems fairly straight forward to create
some different cache providers and try some things.

has anyone tried ehcache w/o persistence?  I see this JIRA
https://issues.apache.org/jira/browse/CASSANDRA-1945 but the main
complaint was the disk serialization, which I don't think anyone wants.


On 1/12/12 6:18 PM, "Jonathan Ellis" <jb...@gmail.com> wrote:

>8x is pretty normal for JVM and bookkeeping overhead with the CLHCP.
>
>The SerializedCacheProvider is the default in 1.0 and is much
>lighter-weight.
>
>On Thu, Jan 12, 2012 at 6:07 PM, Todd Burruss <bb...@expedia.com>
>wrote:
>> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is
>>about 4gb, but the RAM used by the cache is around 25gb.  I have 70k
>>columns per row, and only about 2500 rows ­ so a lot more columns than
>>rows.  has there been any discussion or JIRAs discussing reducing the
>>size of the cache?  I can understand the overhead for column names, etc,
>>but the ratio seems a bit distorted.
>>
>> I'm tracing through the code, so any pointers to help me understand is
>>appreciated
>>
>> thx
>
>
>
>-- 
>Jonathan Ellis
>Project Chair, Apache Cassandra
>co-founder of DataStax, the source for professional Cassandra support
>http://www.datastax.com


Re: Cache Row Size

Posted by Jonathan Ellis <jb...@gmail.com>.
8x is pretty normal for JVM and bookkeeping overhead with the CLHCP.

The SerializedCacheProvider is the default in 1.0 and is much lighter-weight.

On Thu, Jan 12, 2012 at 6:07 PM, Todd Burruss <bb...@expedia.com> wrote:
> I'm using ConcurrentLinkedHashCacheProvider and my data on disk is about 4gb, but the RAM used by the cache is around 25gb.  I have 70k columns per row, and only about 2500 rows – so a lot more columns than rows.  has there been any discussion or JIRAs discussing reducing the size of the cache?  I can understand the overhead for column names, etc, but the ratio seems a bit distorted.
>
> I'm tracing through the code, so any pointers to help me understand is appreciated
>
> thx



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com