You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bryan Talbot <bt...@aeriagames.com> on 2012/10/18 20:06:06 UTC

constant CMS GC using CPU time

In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
(64-bit), the nodes are often getting "stuck" in state where CMS
collections of the old space are constantly running.

The JVM configuration is using the standard settings in cassandra-env --
relevant settings are included below.  The max heap is currently set to 5
GB with 800MB for new size.  I don't believe that the cluster is overly
busy and seems to be performing well enough other than this issue.  When
nodes get into this state they never seem to leave it (by freeing up old
space memory) without restarting cassandra.  They typically enter this
state while running "nodetool repair -pr" but once they start doing this,
restarting them only "fixes" it for a couple of hours.

Compactions are completing and are generally not queued up.  All CF are
using STCS.  The busiest CF consumes about 100GB of space on disk, is write
heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
including those used for system keyspace and secondary indexes.  The number
of SSTables per node currently varies from 185-212.

Other than frequent log warnings about "GCInspector  - Heap is 0.xxx full..."
and "StorageService  - Flushing CFS(...) to relieve memory pressure" there
are no other log entries to indicate there is a problem.

Does the memory needed vary depending on the amount of data stored?  If so,
how can I predict how much jvm space is needed?  I don't want to make the
heap too large as that's bad too.  Maybe there's a memory leak related to
compaction that doesn't allow meta-data to be purged?


-Bryan


12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
cache.
$> free -m
             total       used       free     shared    buffers     cached
Mem:         12001      11870        131          0          4       5778
-/+ buffers/cache:       6087       5914
Swap:            0          0          0


jvm settings in cassandra-env
MAX_HEAP_SIZE="5G"
HEAP_NEWSIZE="800M"

# GC tuning options
JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"


jstat shows about 12 full collections per minute with old heap usage
constantly over 75% so CMS is always over the
CMSInitiatingOccupancyFraction threshold.

$> jstat -gcutil -t 22917 5000 4
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
 FGCT     GCT
       132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
3078.941 3585.829
       132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
3079.220 3586.107
       132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
3079.583 3586.515
       132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
3079.853 3586.785


Other hosts not currently experiencing the high CPU load have a heap less
than .75 full.

$> jstat -gcutil -t 6063 5000 4
Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
 FGCT     GCT
       520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
       520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
       520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
2130.779 3819.588
       520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
2130.779 3819.588

Re: constant CMS GC using CPU time

Posted by aaron morton <aa...@thelastpickle.com>.

> How does compaction_throughput relate to memory usage?  
It reduces the rate of memory allocation. 
e.g. Say normally ParNew can keep up with the rate of memory usage without stopping for too long: so the rate of promotion is low'ish and every thing is allocated to Eden. If the allocation rate gets higher ParNew may be more frequent and objects may be promoted to tenured that don't really need to be there.  

>  I assumed that was more for IO tuning.  I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions.
Similar thing to above. This may reduce the number of rows held in memory at any instant for compaction. 

Only rows less than in_memory_compaction_limit are loaded into memory during compaction. So reducing that may reduce the memory usage.

>  Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle.
Cool. Sorry if took so long to get there. 


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 26/10/2012, at 8:08 AM, Bryan Talbot <bt...@aeriagames.com> wrote:

> On Thu, Oct 25, 2012 at 4:15 AM, aaron morton <aa...@thelastpickle.com> wrote:
>> This sounds very much like "my heap is so consumed by (mostly) bloom
>> filters that I am in steady state GC thrash."
>> 
>> Yes, I think that was at least part of the issue.
> 
> The rough numbers I've used to estimate working set are:
> 
> * bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB
> * memtable size 1024 MB 
> * index sampling:
> 	*  24 bytes + key (16 bytes for UUID) = 32 bytes 
> 	* 400M / 128 default sampling = 3,125,000
> 	*  3,125,000 * 32 = 95 MB
> 	* java fudge X5 or X10 = 475MB to 950MB
> * ignoring row cache and key cache
>  
> So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. 
> 
> On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. 
> 
> These values seem reasonable and in line with what I was seeing.  There are other CF and apps sharing this cluster but this one was the largest.  
> 
> 
>   
> 
> You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off.
> 
> 
> 
> I was using jstat to monitor gc activity and some snippets from that are in my original email in this thread.  The key behavior was that full gc was running pretty often and never able to reclaim much (if any) space.
> 
> 
>  
> 
> There are a few things you could try:
> 
> * increase the JVM heap by say 1Gb and see how it goes
> * increase bloom filter false positive,  try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance) 
> * increase index_interval sampling in yaml.  
> * decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. 
> * disable caches or ensure off heap caches are used.
> 
> I've done several of these already in addition to changing the app to reduce the number of rows retained.  How does compaction_throughput relate to memory usage?  I assumed that was more for IO tuning.  I noticed that lowering concurrent_compactors to 4 (from default of 8) lowered the memory used during compactions.  in_memory_compaction_limit_in_mb seems to only be used for wide rows and this CF didn't have any wider than in_memory_compaction_limit_in_mb.  My multithreaded_compaction is still false.
> 
>  
> 
> Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. 
> 
> Finally this book is a good reference on Java GC http://amzn.com/0137142528 
> 
> For my understanding what was the average row size for the 400 million keys ? 
> 
> 
> 
> The compacted row mean size for the CF is 8815 (as reported by cfstats) but that comes out to be much larger than the real load per node I was seeing.  Each node had about 200GB of data for the CF with 4 nodes in the cluster and RF=3.  At the time, the TTL for all columns was 3 days and gc_grace_seconds was 5 days.  Since then I've reduced the TTL to 1 hour and set gc_grace_seconds to 0 so the number of rows and data dropped to a level it can handle.
> 
> 
> -Bryan
>

Re: constant CMS GC using CPU time

Posted by Bryan Talbot <bt...@aeriagames.com>.

On Thu, Oct 25, 2012 at 4:15 AM, aaron morton <aa...@thelastpickle.com>wrote:

>  This sounds very much like "my heap is so consumed by (mostly) bloom
>> filters that I am in steady state GC thrash."
>>
>
> Yes, I think that was at least part of the issue.
>
>
> The rough numbers I've used to estimate working set are:
>
> * bloom filter size for 400M rows at 0.00074 fp without java fudge (they
> are just a big array) 714 MB
> * memtable size 1024 MB
> * index sampling:
> *  24 bytes + key (16 bytes for UUID) = 32 bytes
>  * 400M / 128 default sampling = 3,125,000
> *  3,125,000 * 32 = 95 MB
>  * java fudge X5 or X10 = 475MB to 950MB
> * ignoring row cache and key cache
>
> So the high side number is 2213 to 2,688. High because the fudge is a
> delicious sticky guess and the memtable space would rarely be full.
>
> On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some
> goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on
> the working set and how quickly stuff get's tenured which depends on the
> workload.
>

These values seem reasonable and in line with what I was seeing.  There are
other CF and apps sharing this cluster but this one was the largest.

>
> You can confirm these guesses somewhat manually by enabling all the GC
> logging in cassandra-env.sh. Restart the node and let it operate normally,
> probably best to keep repair off.
>
>
>
I was using jstat to monitor gc activity and some snippets from that are in
my original email in this thread.  The key behavior was that full gc was
running pretty often and never able to reclaim much (if any) space.

>
> There are a few things you could try:
>
> * increase the JVM heap by say 1Gb and see how it goes
> * increase bloom filter false positive,  try 0.1 first (see
> http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance
> )
> * increase index_interval sampling in yaml.
> * decreasing compaction_throughput and in_memory_compaction_limit can
> lesson the additional memory pressure compaction adds.
> * disable caches or ensure off heap caches are used.
>

I've done several of these already in addition to changing the app to
reduce the number of rows retained.  How does compaction_throughput relate
to memory usage?  I assumed that was more for IO tuning.  I noticed that
lowering concurrent_compactors to 4 (from default of 8) lowered the memory
used during compactions.  in_memory_compaction_limit_in_mb seems to only be
used for wide rows and this CF didn't have any wider
than in_memory_compaction_limit_in_mb.  My multithreaded_compaction is
still false.

>
> Watching the gc logs and the cassandra log is a great way to get a feel
> for what works in your situation. Also take note of any scheduled
> processing your app does which may impact things, and look for poorly
> performing queries.
>
> Finally this book is a good reference on Java GC
> http://amzn.com/0137142528
>
> For my understanding what was the average row size for the 400 million
> keys ?
>
>

The compacted row mean size for the CF is 8815 (as reported by cfstats) but
that comes out to be much larger than the real load per node I was seeing.
 Each node had about 200GB of data for the CF with 4 nodes in the cluster
and RF=3.  At the time, the TTL for all columns was 3 days and
gc_grace_seconds was 5 days.  Since then I've reduced the TTL to 1 hour and
set gc_grace_seconds to 0 so the number of rows and data dropped to a level
it can handle.

-Bryan

Re: constant CMS GC using CPU time

Posted by aaron morton <aa...@thelastpickle.com>.

> This sounds very much like "my heap is so consumed by (mostly) bloom
> filters that I am in steady state GC thrash."
> 
> Yes, I think that was at least part of the issue.

The rough numbers I've used to estimate working set are:

* bloom filter size for 400M rows at 0.00074 fp without java fudge (they are just a big array) 714 MB
* memtable size 1024 MB 
* index sampling:
	*  24 bytes + key (16 bytes for UUID) = 32 bytes 
	* 400M / 128 default sampling = 3,125,000
	*  3,125,000 * 32 = 95 MB
	* java fudge X5 or X10 = 475MB to 950MB
* ignoring row cache and key cache

So the high side number is 2213 to 2,688. High because the fudge is a delicious sticky guess and the memtable space would rarely be full. 

On a 5120 MB heap, with 800MB new you have roughly  4300 MB tenured  (some goes to perm) and 75% of that is 3,225 MB. Not terrible but it depends on the working set and how quickly stuff get's tenured which depends on the workload. 

You can confirm these guesses somewhat manually by enabling all the GC logging in cassandra-env.sh. Restart the node and let it operate normally, probably best to keep repair off.

This is a sample (partial) GC log before CMS kicks in note  concurrent mark-sweep used size. Your values may differ, this has non default settings:

Heap after GC invocations=9947 (full 182):
 par new generation   total 1024000K, used 101882K [0x00000006fae00000, 0x0000000745e00000, 0x0000000745e00000)
  eden space 819200K,   0% used [0x00000006fae00000, 0x00000006fae00000, 0x000000072ce00000)
  from space 204800K,  49% used [0x0000000739600000, 0x000000073f97eaf8, 0x0000000745e00000)
  to   space 204800K,   0% used [0x000000072ce00000, 0x000000072ce00000, 0x0000000739600000)
 concurrent mark-sweep generation total 2965504K, used 2309885K [0x0000000745e00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x00000007fae00000, 0x00000007fd329000, 0x0000000800000000)
}

This is when it starts, see (full X) count increases :

2012-10-25T03:32:44.664-0500: 76691.891: [GC [1 CMS-initial-mark: 2309885K(2965504K)] 2411929K(3989504K), 0.0047910 secs] [Times: user=0.01 sys=0.00, real=0.01 secs] 
Total time for which application threads were stopped: 0.0059850 seconds

(other CMS type logs)

{Heap before GC invocations=9947 (full 183):
 par new generation   total 1024000K, used 921082K [0x00000006fae00000, 0x0000000745e00000, 0x0000000745e00000)
  eden space 819200K, 100% used [0x00000006fae00000, 0x000000072ce00000, 0x000000072ce00000)
  from space 204800K,  49% used [0x0000000739600000, 0x000000073f97eaf8, 0x0000000745e00000)
  to   space 204800K,   0% used [0x000000072ce00000, 0x000000072ce00000, 0x0000000739600000)
 concurrent mark-sweep generation total 2965504K, used 2206292K [0x0000000745e00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x00000007fae00000, 0x00000007fd329000, 0x0000000800000000)

A couple of log messages later concurrent mark-sweep used size is down:

{Heap before GC invocations=9948 (full 183):
 par new generation   total 1024000K, used 938695K [0x00000006fae00000, 0x0000000745e00000, 0x0000000745e00000)
  eden space 819200K, 100% used [0x00000006fae00000, 0x000000072ce00000, 0x000000072ce00000)
  from space 204800K,  58% used [0x000000072ce00000, 0x00000007342b1f00, 0x0000000739600000)
  to   space 204800K,   0% used [0x0000000739600000, 0x0000000739600000, 0x0000000745e00000)
 concurrent mark-sweep generation total 2965504K, used 1096146K [0x0000000745e00000, 0x00000007fae00000, 0x00000007fae00000)
 concurrent-mark-sweep perm gen total 38052K, used 22811K [0x00000007fae00000, 0x00000007fd329000, 0x0000000800000000)
2012-10-25T03:32:50.479-0500: 76697.706: [GC Before GC:

There are a few things you could try:

* increase the JVM heap by say 1Gb and see how it goes
* increase bloom filter false positive,  try 0.1 first (see http://www.datastax.com/docs/1.1/configuration/storage_configuration#bloom-filter-fp-chance) 
* increase index_interval sampling in yaml.  
* decreasing compaction_throughput and in_memory_compaction_limit can lesson the additional memory pressure compaction adds. 
* disable caches or ensure off heap caches are used.

Watching the gc logs and the cassandra log is a great way to get a feel for what works in your situation. Also take note of any scheduled processing your app does which may impact things, and look for poorly performing queries. 

Finally this book is a good reference on Java GC http://amzn.com/0137142528 

For my understanding what was the average row size for the 400 million keys ? 

Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 25/10/2012, at 1:22 PM, Bryan Talbot <bt...@aeriagames.com> wrote:

> On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli <rc...@palominodb.com> wrote:
> On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot <bt...@aeriagames.com> wrote:
> > The nodes with the most data used the most memory.  All nodes are affected
> > eventually not just one.  The GC was on-going even when the nodes were not
> > compacting or running a heavy application load -- even when the main app was
> > paused constant the GC continued.
> 
> This sounds very much like "my heap is so consumed by (mostly) bloom
> filters that I am in steady state GC thrash."
> 
> Yes, I think that was at least part of the issue.
> 
>  
> 
> Do you have heap graphs which show a healthy sawtooth GC cycle which
> then more or less flatlines?
> 
> 
> 
> I didn't save any graphs but that is what they would look like.  I was using jstat to monitor gc activity. 
> 
> -Bryan
>

Re: constant CMS GC using CPU time

Posted by Bryan Talbot <bt...@aeriagames.com>.

On Wed, Oct 24, 2012 at 2:38 PM, Rob Coli <rc...@palominodb.com> wrote:

> On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot <bt...@aeriagames.com>
> wrote:
> > The nodes with the most data used the most memory.  All nodes are
> affected
> > eventually not just one.  The GC was on-going even when the nodes were
> not
> > compacting or running a heavy application load -- even when the main app
> was
> > paused constant the GC continued.
>
> This sounds very much like "my heap is so consumed by (mostly) bloom
> filters that I am in steady state GC thrash."
>

Yes, I think that was at least part of the issue.



>
> Do you have heap graphs which show a healthy sawtooth GC cycle which
> then more or less flatlines?
>
>

I didn't save any graphs but that is what they would look like.  I was
using jstat to monitor gc activity.

-Bryan

Re: constant CMS GC using CPU time

Posted by Rob Coli <rc...@palominodb.com>.

On Mon, Oct 22, 2012 at 8:38 AM, Bryan Talbot <bt...@aeriagames.com> wrote:
> The nodes with the most data used the most memory.  All nodes are affected
> eventually not just one.  The GC was on-going even when the nodes were not
> compacting or running a heavy application load -- even when the main app was
> paused constant the GC continued.

This sounds very much like "my heap is so consumed by (mostly) bloom
filters that I am in steady state GC thrash."

Do you have heap graphs which show a healthy sawtooth GC cycle which
then more or less flatlines?

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: constant CMS GC using CPU time

Posted by aaron morton <aa...@thelastpickle.com>.

> Regarding memory usage after a repair ... Are the merkle trees kept around?
> 

They should not be.

Cheers


-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/10/2012, at 4:51 PM, B. Todd Burruss <bt...@gmail.com> wrote:

> Regarding memory usage after a repair ... Are the merkle trees kept around?
> 
> On Oct 23, 2012 3:00 PM, "Bryan Talbot" <bt...@aeriagames.com> wrote:
> On Mon, Oct 22, 2012 at 6:05 PM, aaron morton <aa...@thelastpickle.com> wrote:
>> The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued.
> If you restart a node is the onset of GC activity correlated to some event?
> 
> Yes and no.  When the nodes were generally under the .75 occupancy threshold a weekly "repair -pr" job would cause them to go over the threshold and then stay there even after the repair had completed and there were no ongoing compactions.  It acts as though at least some substantial amount of memory used during repair was never dereferenced once the repair was complete.
> 
> Once one CF in particular grew larger the constant GC would start up pretty soon (less than 90 minutes) after a node restart even without a repair.
> 
> 
>  
>  
>> As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped.  So it's definitely related to data load.  memtable size is 1 GB, row cache is disabled and key cache is small (default).
> How many keys did the CF have per node? 
> I dismissed the memory used to  hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. 
> 
> 
> I've changed the app to retain less data for that CF but I think that it was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows are write-once, never updated, and rarely read.  There are no secondary indexes for this particular CF.
> 
> 
>  
>>  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like the default bloom_filter_fp_chance defaults to 0.0 
> The default should be 0.000744.
> 
> If the chance is zero or null this code should run when a new SSTable is written 
>   // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance before, let's not let that break things
>                 logger.error("Bloom filter FP chance of zero isn't supposed to happen");
> 
> Were the CF's migrated from an old version ?
> 
> 
> Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to 1.1.5 with a "upgradesstables" run at each upgrade along the way.
> 
> I could not find a way to view the current bloom_filter_fp_chance settings when they are at a default value.  JMX reports the actual fp rate and if a specific rate is set for a CF that shows up in "describe table" but I couldn't find out how to tell what the default was.  I didn't inspect the source.
> 
>  
>> Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known?
> 
> See o.a.c.utils.BloomFilter.getFilter() in the code 
> This http://hur.st/bloomfilter appears to give similar results. 
> 
> 
> 
> 
> Ahh, very helpful.  This indicates that 714MB would be used for the bloom filter for that one CF.
> 
> JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If on-disk and in-memory space used is similar then summing up all the "Bloom Filter Space Used" says they're currently consuming 1-2 GB of the heap which is substantial.
> 
> If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It just means more trips to SSTable indexes for a read correct?  Trade RAM for time (disk I/O).
> 
> -Bryan
>

Re: constant CMS GC using CPU time

Posted by "B. Todd Burruss" <bt...@gmail.com>.

Regarding memory usage after a repair ... Are the merkle trees kept around?
On Oct 23, 2012 3:00 PM, "Bryan Talbot" <bt...@aeriagames.com> wrote:

> On Mon, Oct 22, 2012 at 6:05 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>> The GC was on-going even when the nodes were not compacting or running a
>> heavy application load -- even when the main app was paused constant the GC
>> continued.
>>
>> If you restart a node is the onset of GC activity correlated to some
>> event?
>>
>
> Yes and no.  When the nodes were generally under the
> .75 occupancy threshold a weekly "repair -pr" job would cause them to go
> over the threshold and then stay there even after the repair had completed
> and there were no ongoing compactions.  It acts as though at least some
> substantial amount of memory used during repair was never dereferenced once
> the repair was complete.
>
> Once one CF in particular grew larger the constant GC would start up
> pretty soon (less than 90 minutes) after a node restart even without a
> repair.
>
>
>
>
>>
>>
>> As a test we dropped the largest CF and the memory
>> usage immediately dropped to acceptable levels and the constant GC stopped.
>>  So it's definitely related to data load.  memtable size is 1 GB, row cache
>> is disabled and key cache is small (default).
>>
>> How many keys did the CF have per node?
>> I dismissed the memory used to  hold bloom filters and index sampling.
>> That memory is not considered part of the memtable size, and will end up in
>> the tenured heap. It is generally only a problem with very large key counts
>> per node.
>>
>>
> I've changed the app to retain less data for that CF but I think that it
> was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
> are write-once, never updated, and rarely read.  There are no secondary
> indexes for this particular CF.
>
>
>
>
>>  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
>> the default bloom_filter_fp_chance defaults to 0.0
>>
>> The default should be 0.000744.
>>
>> If the chance is zero or null this code should run when a new SSTable is
>> written
>>   // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance
>> before, let's not let that break things
>>                 logger.error("Bloom filter FP chance of zero isn't
>> supposed to happen");
>>
>> Were the CF's migrated from an old version ?
>>
>>
> Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
> 1.1.5 with a "upgradesstables" run at each upgrade along the way.
>
> I could not find a way to view the current bloom_filter_fp_chance settings
> when they are at a default value.  JMX reports the actual fp rate and if a
> specific rate is set for a CF that shows up in "describe table" but I
> couldn't find out how to tell what the default was.  I didn't inspect the
> source.
>
>
>
>> Is there any way to predict how much memory the bloom filters will
>> consume if the size of the row keys, number or rows is known, and fp chance
>> is known?
>>
>>
>> See o.a.c.utils.BloomFilter.getFilter() in the code
>> This http://hur.st/bloomfilter appears to give similar results.
>>
>>
>>
>
> Ahh, very helpful.  This indicates that 714MB would be used for the bloom
> filter for that one CF.
>
> JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name
> (getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
> on-disk and in-memory space used is similar then summing up all the "Bloom
> Filter Space Used" says they're currently consuming 1-2 GB of the heap
> which is substantial.
>
> If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?
>  It just means more trips to SSTable indexes for a read correct?  Trade RAM
> for time (disk I/O).
>
> -Bryan
>
>

Re: constant CMS GC using CPU time

Posted by Bryan Talbot <bt...@aeriagames.com>.

On Mon, Oct 22, 2012 at 6:05 PM, aaron morton <aa...@thelastpickle.com>wrote:

> The GC was on-going even when the nodes were not compacting or running a
> heavy application load -- even when the main app was paused constant the GC
> continued.
>
> If you restart a node is the onset of GC activity correlated to some event?
>

Yes and no.  When the nodes were generally under the
.75 occupancy threshold a weekly "repair -pr" job would cause them to go
over the threshold and then stay there even after the repair had completed
and there were no ongoing compactions.  It acts as though at least some
substantial amount of memory used during repair was never dereferenced once
the repair was complete.

Once one CF in particular grew larger the constant GC would start up pretty
soon (less than 90 minutes) after a node restart even without a repair.

>
>
> As a test we dropped the largest CF and the memory
> usage immediately dropped to acceptable levels and the constant GC stopped.
>  So it's definitely related to data load.  memtable size is 1 GB, row cache
> is disabled and key cache is small (default).
>
> How many keys did the CF have per node?
> I dismissed the memory used to  hold bloom filters and index sampling.
> That memory is not considered part of the memtable size, and will end up in
> the tenured heap. It is generally only a problem with very large key counts
> per node.
>
>
I've changed the app to retain less data for that CF but I think that it
was about 400M rows per node.  Row keys are a TimeUUID.  All of the rows
are write-once, never updated, and rarely read.  There are no secondary
indexes for this particular CF.

>  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like
> the default bloom_filter_fp_chance defaults to 0.0
>
> The default should be 0.000744.
>
> If the chance is zero or null this code should run when a new SSTable is
> written
>   // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance
> before, let's not let that break things
>                 logger.error("Bloom filter FP chance of zero isn't
> supposed to happen");
>
> Were the CF's migrated from an old version ?
>
>
Yes, the CF were created in 1.0.9, then migrated to 1.0.11 and finally to
1.1.5 with a "upgradesstables" run at each upgrade along the way.

I could not find a way to view the current bloom_filter_fp_chance settings
when they are at a default value.  JMX reports the actual fp rate and if a
specific rate is set for a CF that shows up in "describe table" but I
couldn't find out how to tell what the default was.  I didn't inspect the
source.

> Is there any way to predict how much memory the bloom filters will consume
> if the size of the row keys, number or rows is known, and fp chance is
> known?
>
>
> See o.a.c.utils.BloomFilter.getFilter() in the code
> This http://hur.st/bloomfilter appears to give similar results.
>
>
>

Ahh, very helpful.  This indicates that 714MB would be used for the bloom
filter for that one CF.

JMX / cfstats reports "Bloom Filter Space Used" but the MBean method name
(getBloomFilterDiskSpaceUsed) indicates this is the on-disk space. If
on-disk and in-memory space used is similar then summing up all the "Bloom
Filter Space Used" says they're currently consuming 1-2 GB of the heap
which is substantial.

If a CF is rarely read is it safe to set bloom_filter_fp_chance to 1.0?  It
just means more trips to SSTable indexes for a read correct?  Trade RAM for
time (disk I/O).

-Bryan

Re: constant CMS GC using CPU time

Posted by Bryan Talbot <bt...@aeriagames.com>.

These GC settings are the default (recommended?) settings from
cassandra-env.  I added the UseCompressedOops.

-Bryan


On Mon, Oct 22, 2012 at 6:15 PM, Will @ SOHO <wi...@voodoolunchbox.com>wrote:

>  On 10/22/2012 09:05 PM, aaron morton wrote:
>
>  # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>  JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
>  You are too far behind the reference JVM's. Parallel GC is the preferred
> and highest performing form in the current Security Baseline version of the
> JVM's.
>



-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: constant CMS GC using CPU time

Posted by "Will @ SOHO" <wi...@voodoolunchbox.com>.

On 10/22/2012 09:05 PM, aaron morton wrote:
> # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
You are too far behind the reference JVM's. Parallel GC is the preferred 
and highest performing form in the current Security Baseline version of 
the JVM's.

Re: constant CMS GC using CPU time

Posted by aaron morton <aa...@thelastpickle.com>.

> The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued.
If you restart a node is the onset of GC activity correlated to some event?
 
> As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped.  So it's definitely related to data load.  memtable size is 1 GB, row cache is disabled and key cache is small (default).
How many keys did the CF have per node? 
I dismissed the memory used to  hold bloom filters and index sampling. That memory is not considered part of the memtable size, and will end up in the tenured heap. It is generally only a problem with very large key counts per node. 

>  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like the default bloom_filter_fp_chance defaults to 0.0 
The default should be 0.000744.

If the chance is zero or null this code should run when a new SSTable is written 
  // paranoia -- we've had bugs in the thrift <-> avro <-> CfDef dance before, let's not let that break things
                logger.error("Bloom filter FP chance of zero isn't supposed to happen");

Were the CF's migrated from an old version ?

> Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known?

See o.a.c.utils.BloomFilter.getFilter() in the code 
This http://hur.st/bloomfilter appears to give similar results. 

Cheers
 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 23/10/2012, at 4:38 AM, Bryan Talbot <bt...@aeriagames.com> wrote:

> The memory usage was correlated with the size of the data set.  The nodes were a bit unbalanced which is normal due to variations in compactions.  The nodes with the most data used the most memory.  All nodes are affected eventually not just one.  The GC was on-going even when the nodes were not compacting or running a heavy application load -- even when the main app was paused constant the GC continued.
> 
> As a test we dropped the largest CF and the memory usage immediately dropped to acceptable levels and the constant GC stopped.  So it's definitely related to data load.  memtable size is 1 GB, row cache is disabled and key cache is small (default).
> 
> I believe one culprit turned out to be the bloom filters.  They were 2+ GB (as reported by nodetool cfstats anyway).  It looks like the default bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as the minimum value.  Raising that to 0.20 for some write-mostly CF reduced memory used by 1GB or so.
> 
> Is there any way to predict how much memory the bloom filters will consume if the size of the row keys, number or rows is known, and fp chance is known?
> 
> -Bryan
> 
> 
> 
> On Mon, Oct 22, 2012 at 12:25 AM, aaron morton <aa...@thelastpickle.com> wrote:
> If you are using the default settings I would try to correlate the GC activity with some application activity before tweaking.
> 
> If this is happening on one machine out of 4 ensure that client load is distributed evenly. 
> 
> See if the raise in GC activity us related to Compaction, repair or an increase in throughput. OpsCentre or some other monitoring can help with the last one. Your mention of TTL makes me think compaction may be doing a bit of work churning through rows. 
>   
> Some things I've done in the past before looking at heap settings:
> * reduce compaction_throughput to reduce the memory churn
> * reduce in_memory_compaction_limit 
> * if needed reduce concurrent_compactors
> 
>> Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is.  That's not such a good thing.
> The memtable_total_space_in_mb in yaml tells C* how much memory to devote to the memtables. That with the global row cache setting says how much memory will be used with regard to "storing" data and it will not increase inline with the static data load.
> 
> Now days GC issues are typically due to more dynamic forces, like compaction, repair and throughput. 
>  
> Hope that helps. 
> 
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 20/10/2012, at 6:59 AM, Bryan Talbot <bt...@aeriagames.com> wrote:
> 
>> ok, let me try asking the question a different way ...
>> 
>> How does cassandra use memory and how can I plan how much is needed?  I have a 1 GB memtable and 5 GB total heap and that's still not enough even though the number of concurrent connections and garbage generation rate is fairly low.
>> 
>> If I were using mysql or oracle, I could compute how much memory could be used by N concurrent connections, how much is allocated for caching, temp spaces, etc.  How can I do this for cassandra?  Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is.  That's not such a good thing.
>> 
>> -Bryan
>> 
>> 
>> 
>> On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot <bt...@aeriagames.com> wrote:
>> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 (64-bit), the nodes are often getting "stuck" in state where CMS collections of the old space are constantly running.  
>> 
>> The JVM configuration is using the standard settings in cassandra-env -- relevant settings are included below.  The max heap is currently set to 5 GB with 800MB for new size.  I don't believe that the cluster is overly busy and seems to be performing well enough other than this issue.  When nodes get into this state they never seem to leave it (by freeing up old space memory) without restarting cassandra.  They typically enter this state while running "nodetool repair -pr" but once they start doing this, restarting them only "fixes" it for a couple of hours.
>> 
>> Compactions are completing and are generally not queued up.  All CF are using STCS.  The busiest CF consumes about 100GB of space on disk, is write heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF including those used for system keyspace and secondary indexes.  The number of SSTables per node currently varies from 185-212.
>> 
>> Other than frequent log warnings about "GCInspector  - Heap is 0.xxx full..." and "StorageService  - Flushing CFS(...) to relieve memory pressure" there are no other log entries to indicate there is a problem.
>> 
>> Does the memory needed vary depending on the amount of data stored?  If so, how can I predict how much jvm space is needed?  I don't want to make the heap too large as that's bad too.  Maybe there's a memory leak related to compaction that doesn't allow meta-data to be purged?
>> 
>> 
>> -Bryan
>> 
>> 
>> 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer cache.
>> $> free -m
>>              total       used       free     shared    buffers     cached
>> Mem:         12001      11870        131          0          4       5778
>> -/+ buffers/cache:       6087       5914
>> Swap:            0          0          0
>> 
>> 
>> jvm settings in cassandra-env
>> MAX_HEAP_SIZE="5G"
>> HEAP_NEWSIZE="800M"
>> 
>> # GC tuning options
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" 
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" 
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" 
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" 
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>> 
>> 
>> jstat shows about 12 full collections per minute with old heap usage constantly over 75% so CMS is always over the CMSInitiatingOccupancyFraction threshold.
>> 
>> $> jstat -gcutil -t 22917 5000 4
>> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
>>        132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523 3078.941 3585.829
>>        132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524 3079.220 3586.107
>>        132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525 3079.583 3586.515
>>        132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527 3079.853 3586.785
>> 
>> 
>> Other hosts not currently experiencing the high CPU load have a heap less than .75 full.
>> 
>> $> jstat -gcutil -t 6063 5000 4
>> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>>        520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>>        520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>>        520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>>        520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> Bryan Talbot
> Architect / Platform team lead, Aeria Games and Entertainment
> Silicon Valley | Berlin | Tokyo | Sao Paulo
> 
>

Re: constant CMS GC using CPU time

Posted by Bryan Talbot <bt...@aeriagames.com>.

The memory usage was correlated with the size of the data set.  The nodes
were a bit unbalanced which is normal due to variations in compactions.
 The nodes with the most data used the most memory.  All nodes are affected
eventually not just one.  The GC was on-going even when the nodes were not
compacting or running a heavy application load -- even when the main app
was paused constant the GC continued.

As a test we dropped the largest CF and the memory
usage immediately dropped to acceptable levels and the constant GC stopped.
 So it's definitely related to data load.  memtable size is 1 GB, row cache
is disabled and key cache is small (default).

I believe one culprit turned out to be the bloom filters.  They were 2+ GB
(as reported by nodetool cfstats anyway).  It looks like the default
bloom_filter_fp_chance defaults to 0.0 even though guides recommend 0.10 as
the minimum value.  Raising that to 0.20 for some write-mostly CF reduced
memory used by 1GB or so.

Is there any way to predict how much memory the bloom filters will consume
if the size of the row keys, number or rows is known, and fp chance is
known?

-Bryan



On Mon, Oct 22, 2012 at 12:25 AM, aaron morton <aa...@thelastpickle.com>wrote:

> If you are using the default settings I would try to correlate the GC
> activity with some application activity before tweaking.
>
> If this is happening on one machine out of 4 ensure that client load is
> distributed evenly.
>
> See if the raise in GC activity us related to Compaction, repair or an
> increase in throughput. OpsCentre or some other monitoring can help with
> the last one. Your mention of TTL makes me think compaction may be doing a
> bit of work churning through rows.
>
> Some things I've done in the past before looking at heap settings:
> * reduce compaction_throughput to reduce the memory churn
> * reduce in_memory_compaction_limit
> * if needed reduce concurrent_compactors
>
> Currently it seems like the memory used scales with the amount of bytes
> stored and not with how busy the server actually is.  That's not such a
> good thing.
>
> The memtable_total_space_in_mb in yaml tells C* how much memory to devote
> to the memtables. That with the global row cache setting says how much
> memory will be used with regard to "storing" data and it will not increase
> inline with the static data load.
>
> Now days GC issues are typically due to more dynamic forces, like
> compaction, repair and throughput.
>
> Hope that helps.
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 20/10/2012, at 6:59 AM, Bryan Talbot <bt...@aeriagames.com> wrote:
>
> ok, let me try asking the question a different way ...
>
> How does cassandra use memory and how can I plan how much is needed?  I
> have a 1 GB memtable and 5 GB total heap and that's still not enough even
> though the number of concurrent connections and garbage generation rate is
> fairly low.
>
> If I were using mysql or oracle, I could compute how much memory could be
> used by N concurrent connections, how much is allocated for caching, temp
> spaces, etc.  How can I do this for cassandra?  Currently it seems like the
> memory used scales with the amount of bytes stored and not with how busy
> the server actually is.  That's not such a good thing.
>
> -Bryan
>
>
>
> On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot <bt...@aeriagames.com>wrote:
>
>> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
>> (64-bit), the nodes are often getting "stuck" in state where CMS
>> collections of the old space are constantly running.
>>
>> The JVM configuration is using the standard settings in cassandra-env --
>> relevant settings are included below.  The max heap is currently set to 5
>> GB with 800MB for new size.  I don't believe that the cluster is overly
>> busy and seems to be performing well enough other than this issue.  When
>> nodes get into this state they never seem to leave it (by freeing up old
>> space memory) without restarting cassandra.  They typically enter this
>> state while running "nodetool repair -pr" but once they start doing this,
>> restarting them only "fixes" it for a couple of hours.
>>
>> Compactions are completing and are generally not queued up.  All CF are
>> using STCS.  The busiest CF consumes about 100GB of space on disk, is write
>> heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
>> including those used for system keyspace and secondary indexes.  The number
>> of SSTables per node currently varies from 185-212.
>>
>> Other than frequent log warnings about "GCInspector  - Heap is 0.xxx
>> full..." and "StorageService  - Flushing CFS(...) to relieve memory
>> pressure" there are no other log entries to indicate there is a problem.
>>
>> Does the memory needed vary depending on the amount of data stored?  If
>> so, how can I predict how much jvm space is needed?  I don't want to make
>> the heap too large as that's bad too.  Maybe there's a memory leak related
>> to compaction that doesn't allow meta-data to be purged?
>>
>>
>> -Bryan
>>
>>
>> 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
>> cache.
>> $> free -m
>>              total       used       free     shared    buffers     cached
>> Mem:         12001      11870        131          0          4       5778
>> -/+ buffers/cache:       6087       5914
>> Swap:            0          0          0
>>
>>
>> jvm settings in cassandra-env
>> MAX_HEAP_SIZE="5G"
>> HEAP_NEWSIZE="800M"
>>
>> # GC tuning options
>> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
>> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
>> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
>> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
>> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
>> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
>> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>>
>>
>> jstat shows about 12 full collections per minute with old heap usage
>> constantly over 75% so CMS is always over the
>> CMSInitiatingOccupancyFraction threshold.
>>
>> $> jstat -gcutil -t 22917 5000 4
>> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
>>  FGCT     GCT
>>        132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
>> 3078.941 3585.829
>>        132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
>> 3079.220 3586.107
>>        132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
>> 3079.583 3586.515
>>        132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
>> 3079.853 3586.785
>>
>>
>> Other hosts not currently experiencing the high CPU load have a heap less
>> than .75 full.
>>
>> $> jstat -gcutil -t 6063 5000 4
>> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
>>  FGCT     GCT
>>        520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
>> 2130.779 3819.588
>>        520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
>> 2130.779 3819.588
>>        520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
>> 2130.779 3819.588
>>        520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
>> 2130.779 3819.588
>>
>>
>>
>>
>
>
>


-- 
Bryan Talbot
Architect / Platform team lead, Aeria Games and Entertainment
Silicon Valley | Berlin | Tokyo | Sao Paulo

Re: constant CMS GC using CPU time

Posted by aaron morton <aa...@thelastpickle.com>.

If you are using the default settings I would try to correlate the GC activity with some application activity before tweaking.

If this is happening on one machine out of 4 ensure that client load is distributed evenly. 

See if the raise in GC activity us related to Compaction, repair or an increase in throughput. OpsCentre or some other monitoring can help with the last one. Your mention of TTL makes me think compaction may be doing a bit of work churning through rows. 
  
Some things I've done in the past before looking at heap settings:
* reduce compaction_throughput to reduce the memory churn
* reduce in_memory_compaction_limit 
* if needed reduce concurrent_compactors

> Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is.  That's not such a good thing.
The memtable_total_space_in_mb in yaml tells C* how much memory to devote to the memtables. That with the global row cache setting says how much memory will be used with regard to "storing" data and it will not increase inline with the static data load.

Now days GC issues are typically due to more dynamic forces, like compaction, repair and throughput. 
 
Hope that helps. 

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 20/10/2012, at 6:59 AM, Bryan Talbot <bt...@aeriagames.com> wrote:

> ok, let me try asking the question a different way ...
> 
> How does cassandra use memory and how can I plan how much is needed?  I have a 1 GB memtable and 5 GB total heap and that's still not enough even though the number of concurrent connections and garbage generation rate is fairly low.
> 
> If I were using mysql or oracle, I could compute how much memory could be used by N concurrent connections, how much is allocated for caching, temp spaces, etc.  How can I do this for cassandra?  Currently it seems like the memory used scales with the amount of bytes stored and not with how busy the server actually is.  That's not such a good thing.
> 
> -Bryan
> 
> 
> 
> On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot <bt...@aeriagames.com> wrote:
> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 (64-bit), the nodes are often getting "stuck" in state where CMS collections of the old space are constantly running.  
> 
> The JVM configuration is using the standard settings in cassandra-env -- relevant settings are included below.  The max heap is currently set to 5 GB with 800MB for new size.  I don't believe that the cluster is overly busy and seems to be performing well enough other than this issue.  When nodes get into this state they never seem to leave it (by freeing up old space memory) without restarting cassandra.  They typically enter this state while running "nodetool repair -pr" but once they start doing this, restarting them only "fixes" it for a couple of hours.
> 
> Compactions are completing and are generally not queued up.  All CF are using STCS.  The busiest CF consumes about 100GB of space on disk, is write heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF including those used for system keyspace and secondary indexes.  The number of SSTables per node currently varies from 185-212.
> 
> Other than frequent log warnings about "GCInspector  - Heap is 0.xxx full..." and "StorageService  - Flushing CFS(...) to relieve memory pressure" there are no other log entries to indicate there is a problem.
> 
> Does the memory needed vary depending on the amount of data stored?  If so, how can I predict how much jvm space is needed?  I don't want to make the heap too large as that's bad too.  Maybe there's a memory leak related to compaction that doesn't allow meta-data to be purged?
> 
> 
> -Bryan
> 
> 
> 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer cache.
> $> free -m
>              total       used       free     shared    buffers     cached
> Mem:         12001      11870        131          0          4       5778
> -/+ buffers/cache:       6087       5914
> Swap:            0          0          0
> 
> 
> jvm settings in cassandra-env
> MAX_HEAP_SIZE="5G"
> HEAP_NEWSIZE="800M"
> 
> # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC" 
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC" 
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled" 
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8" 
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
> 
> 
> jstat shows about 12 full collections per minute with old heap usage constantly over 75% so CMS is always over the CMSInitiatingOccupancyFraction threshold.
> 
> $> jstat -gcutil -t 22917 5000 4
> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
>        132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523 3078.941 3585.829
>        132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524 3079.220 3586.107
>        132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525 3079.583 3586.515
>        132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527 3079.853 3586.785
> 
> 
> Other hosts not currently experiencing the high CPU load have a heap less than .75 full.
> 
> $> jstat -gcutil -t 6063 5000 4
> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT
>        520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>        520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>        520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
>        520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785 2130.779 3819.588
> 
> 
> 
> 
>

Re: constant CMS GC using CPU time

Posted by Bryan Talbot <bt...@aeriagames.com>.

ok, let me try asking the question a different way ...

How does cassandra use memory and how can I plan how much is needed?  I
have a 1 GB memtable and 5 GB total heap and that's still not enough even
though the number of concurrent connections and garbage generation rate is
fairly low.

If I were using mysql or oracle, I could compute how much memory could be
used by N concurrent connections, how much is allocated for caching, temp
spaces, etc.  How can I do this for cassandra?  Currently it seems like the
memory used scales with the amount of bytes stored and not with how busy
the server actually is.  That's not such a good thing.

-Bryan



On Thu, Oct 18, 2012 at 11:06 AM, Bryan Talbot <bt...@aeriagames.com>wrote:

> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11
> (64-bit), the nodes are often getting "stuck" in state where CMS
> collections of the old space are constantly running.
>
> The JVM configuration is using the standard settings in cassandra-env --
> relevant settings are included below.  The max heap is currently set to 5
> GB with 800MB for new size.  I don't believe that the cluster is overly
> busy and seems to be performing well enough other than this issue.  When
> nodes get into this state they never seem to leave it (by freeing up old
> space memory) without restarting cassandra.  They typically enter this
> state while running "nodetool repair -pr" but once they start doing this,
> restarting them only "fixes" it for a couple of hours.
>
> Compactions are completing and are generally not queued up.  All CF are
> using STCS.  The busiest CF consumes about 100GB of space on disk, is write
> heavy, and all columns have a TTL of 3 days.  Overall, there are 41 CF
> including those used for system keyspace and secondary indexes.  The number
> of SSTables per node currently varies from 185-212.
>
> Other than frequent log warnings about "GCInspector  - Heap is 0.xxx
> full..." and "StorageService  - Flushing CFS(...) to relieve memory
> pressure" there are no other log entries to indicate there is a problem.
>
> Does the memory needed vary depending on the amount of data stored?  If
> so, how can I predict how much jvm space is needed?  I don't want to make
> the heap too large as that's bad too.  Maybe there's a memory leak related
> to compaction that doesn't allow meta-data to be purged?
>
>
> -Bryan
>
>
> 12 GB of RAM in host with ~6 GB used by java and ~6 GB for OS and buffer
> cache.
> $> free -m
>              total       used       free     shared    buffers     cached
> Mem:         12001      11870        131          0          4       5778
> -/+ buffers/cache:       6087       5914
> Swap:            0          0          0
>
>
> jvm settings in cassandra-env
> MAX_HEAP_SIZE="5G"
> HEAP_NEWSIZE="800M"
>
> # GC tuning options
> JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
> JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
> JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
> JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
> JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
> JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
> JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
> JVM_OPTS="$JVM_OPTS -XX:+UseCompressedOops"
>
>
> jstat shows about 12 full collections per minute with old heap usage
> constantly over 75% so CMS is always over the
> CMSInitiatingOccupancyFraction threshold.
>
> $> jstat -gcutil -t 22917 5000 4
> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
>  FGCT     GCT
>        132063.0  34.70   0.00  26.03  82.29  59.88  21580  506.887 17523
> 3078.941 3585.829
>        132068.0  34.70   0.00  50.02  81.23  59.88  21580  506.887 17524
> 3079.220 3586.107
>        132073.1   0.00  24.92  46.87  81.41  59.88  21581  506.932 17525
> 3079.583 3586.515
>        132078.1   0.00  24.92  64.71  81.40  59.88  21581  506.932 17527
> 3079.853 3586.785
>
>
> Other hosts not currently experiencing the high CPU load have a heap less
> than .75 full.
>
> $> jstat -gcutil -t 6063 5000 4
> Timestamp         S0     S1     E      O      P     YGC     YGCT    FGC
>  FGCT     GCT
>        520731.6   0.00  12.70  36.37  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>        520736.5   0.00  12.70  53.25  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>        520741.5   0.00  12.70  68.92  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>        520746.5   0.00  12.70  83.11  71.33  59.26  46453 1688.809 14785
> 2130.779 3819.588
>
>
>
>

Re: constant CMS GC using CPU time

Posted by Radim Kolar <hs...@filez.com>.

Dne 18.10.2012 20:06, Bryan Talbot napsal(a):
> In a 4 node cluster running Cassandra 1.1.5 with sun jvm 1.6.0_29-b11 
> (64-bit), the nodes are often getting "stuck" in state where CMS 
> collections of the old space are constantly running.
you need more java heap memory