You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Dan Hendry <da...@gmail.com> on 2011/01/17 18:03:12 UTC

Cassandra GC Settings

I am having some reliability problems in my Cassandra cluster which I am
almost certain is due to GC. I was about to start delving into the guts of
the problem by turning on GC logging but I have never done any serious java
GC tuning before (time to learn I guess). As a first step however, I was
hoping to gain some insight into the GC settings shipped with Cassandra 0.7.
I realize its a pretty complicated problem but I was specifically interested
in knowing about:

-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75

Why are these set the way they are? What specifically was used to determine
these settings? Was it purely experimental or was there a specific,
undesirable behavior adding these settings corrected for? From my various
web wanderings, I read the survivor ratio and tenuring threshold settings as
"Cassandra creates mostly long lived objects, with objects being promoted
very quickly from the young generation to the old generation". Furthermore,
the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
"start gc in the old generation later", presumably to allow Cassandra to use
more of the old generation heap without needlessly trying to free up used
space (?). Please correct me if I am misinterpreting these settings.

One of the issues I have been having is extreme node instability when
running a major compaction. After 20-30 seconds of operation, the node
spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
all memtable thresholds to reduce overall heap memory usage but that has not
seemed to help with the instability. After one of these blips, I often see
log entries as follows:

 INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line 133)
GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max is
12783583232
 INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line 133)
GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max is
12783583232
 INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133)
GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
9224048472 used; max is 12783583232

Given that the 3 GB of garbage collected via ConcurrentMarkSweep was
generated in < 30 seconds, one of the first things I was going to try was
increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold
(to 5) to try and keep more objects in the young generation and therefore
cleaned up faster. As a more general approach to solving my problem, I was
also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this
seem reasonable? Obviously, the best answer is to just try it but I hesitate
to start playing with settings when I have only vaguest notions of what they
do and little concept of why they are there in the first place.

Thanks for any help

Re: Cassandra GC Settings

Posted by SriSatish Ambati <sr...@gmail.com>.

Dan,

Please kindly attach your:
1) java -version
2) full commandline settings, heap sizes.
3) gc log from one of the nodes via:

-XX:+PrintTenuringDistribution \
-XX:+PrintGCDetails \
-XX:+PrintGCTimeStamps \
-Xloggc:/var/log/cassandra/gc.log \

4) number of cores on your system. How busy is the system?
5) Any workload specifics for your particular usecase?

While some of this is workload specific:

If you are seeing too frequent & very long CMS collection times:

C1) Upping the MaxTenuringThreshold=5/10/15  will reduce frequent promotion
that is made essential by current setting.

C2) Increasing -Xmn512mb/1g will help induce more parnew activity.

C3) If you have enough cores to handle multi threaded ParNewGen - I'd also
add -XX:+ParallelGCThreads=4 (or 8) depending on your situation.

Reducing CMIOF (& other thresholds) will trigger CMS at a lower threshold of
occupancy (counteracting some measures above) but might help offset conc
mode failures or promotion failure in case you are seeing it in the logs.

Anyways that is a simplistic analysis: I'd try changes (C1), (C2), (C3) &
then revisit further tuning as necessary.
thanks,
Sri

On Mon, Jan 17, 2011 at 5:03 PM, Dan Hendry <da...@gmail.com>wrote:

> I am having some reliability problems in my Cassandra cluster which I am
> almost certain is due to GC. I was about to start delving into the guts of
> the problem by turning on GC logging but I have never done any serious java
> GC tuning before (time to learn I guess). As a first step however, I was
> hoping to gain some insight into the GC settings shipped with Cassandra 0.7.
> I realize its a pretty complicated problem but I was specifically interested
> in knowing about:
>
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75
>
> Why are these set the way they are? What specifically was used to determine
> these settings? Was it purely experimental or was there a specific,
> undesirable behavior adding these settings corrected for? From my various
> web wanderings, I read the survivor ratio and tenuring threshold settings as
> "Cassandra creates mostly long lived objects, with objects being promoted
> very quickly from the young generation to the old generation". Furthermore,
> the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> "start gc in the old generation later", presumably to allow Cassandra to use
> more of the old generation heap without needlessly trying to free up used
> space (?). Please correct me if I am misinterpreting these settings.
>
> One of the issues I have been having is extreme node instability when
> running a major compaction. After 20-30 seconds of operation, the node
> spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> all memtable thresholds to reduce overall heap memory usage but that has not
> seemed to help with the instability. After one of these blips, I often see
> log entries as follows:
>
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
> 133) GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
> is 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
> 133) GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
> is 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
> 133) GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> 9224048472 used; max is 12783583232
>
> Given that the 3 GB of garbage collected via ConcurrentMarkSweep was
> generated in < 30 seconds, one of the first things I was going to try was
> increasing the survivor ratio (to 16) and increase the MaxTenuringThreshold
> (to 5) to try and keep more objects in the young generation and therefore
> cleaned up faster. As a more general approach to solving my problem, I was
> also going to reduce the CMSInitiatingOccupancyFraction to 65. Does this
> seem reasonable? Obviously, the best answer is to just try it but I hesitate
> to start playing with settings when I have only vaguest notions of what they
> do and little concept of why they are there in the first place.
>
> Thanks for any help
>

Re: Cassandra GC Settings

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Jan 17, 2011 at 11:58 AM, Peter Schuller
<pe...@infidyne.com> wrote:
> 45 seconds is pretty significant even for a 12 gig heap

Note that you really need to uncomment the -XX:PrintGC* arguments to
get a detailed GC log from the jvm before taking guesses at this; the
numbers GCInspector can get are NOT pause times.  High numbers for CMS
are perfectly normal there, but in a healthy system they are
Concurrent (the C in CMS) meaning Cassandra can continue running for
almost all of it.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra GC Settings

Posted by Robert Coli <rc...@digg.com>.

On 1/17/11, Dan Hendry <da...@gmail.com> wrote:
> Since applying these settings, the one time I saw the same type of behavior
> as before, the following appeared in the GC log.
>
>    (concurrent mode failure): 9418431K->6267709K(11841536K), 26.4973750
> secs] 9777393K->6267709K(12290944K), [CMS Perm : 20477K->20443K(34188K)],
> 26.7595510 secs] [Times: user=31.75 sys=0.00, real=26.76 secs]

The symptoms described in both of your mails which mention
pathological cases sound like your heap may simply be too small for
your actual working set. Compaction triggers extra memory pressure and
you then OOM or the concurrent mark sweep fails and you thrash.

It is also worth noting that  major compaction reduces the
effectiveness of various caches (internal and o/s level), and it is
somewhat likely that your node has internally backed up threadpools
during and immediately after compaction. Some work has been done
recently to improve these characteristics, but I don't think those
changes are in 0.7.0 release.

> INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133) GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving 9224048472 used; max is 12783583232

As the amount of headroom available to the CMS GCer decreases, it
tends to take longer and longer to reclaim less and less memory. 458
seconds to recover 3gb (leaving 9gb heap out of a max heap of 16gb?)
suggests that your working set has put you into this grey area where
it eventually works but sucks really badly. This is just before the
state where it permanently locks up the JVM and/or OOMs.

Have you sized your memtables and caches so that you have meaningful
heap headroom when your caches are full?

=Rob

Re: Cassandra GC Settings

Posted by Peter Schuller <pe...@infidyne.com>.

> Now, a full stop of the application was what I was seeing extensively before
> (100-200 times over the course of a major compaction as reported by
> gossipers on other nodes). I have also just noticed that the previous
> instability (ie application stops) correlated with the compaction of a few
> column families characterized by fairly fat rows (10 mb mean size, max sizes
> 150-200 mb, up to a million+ columns per row). My theory is that each row
> being compacted with the old settings was being promoted to the old
> generation, thereby running the heap out of space and causing a stop the
> world gc. With the new settings, rows being compacted typically remain in
> the young generation, allowing them to be cleaned up more quickly with less
> effort on the part of the garbage collector. Does this theory sound
> reasonable?

Sounds reasonable I think. In addition to sizing the young gen, decreasing:

   in_memory_compaction_limit_in_mb: 64

from the default of 64 might help here I suppose.

-- 
/ Peter Schuller

Re: Cassandra GC Settings

Posted by SriSatish Ambati <sr...@gmail.com>.

Thanks, Dan:

Yes, -Xmn512MB/1G sizes the Young Generation explicitly and removes the
adaptive resizing out of the picture. (If at all possible send your gc log
over & we can analyze the promotion failure a little bit more finely.)
The low load implies that that you are able to use the parallel threads
effectively.

cheers,
Sri

On Mon, Jan 17, 2011 at 9:05 PM, Dan Hendry <da...@gmail.com>wrote:

> Thanks for all the info, I think I have been able to sort out my issue. The
> new settings I am using are:
>
> -Xmn512M (Very important I think)
> -XX:SurvivorRatio=5 (Not very important I think)
> -XX:MaxTenuringThreshold=5
> -XX:ParallelGCThreads=8
> -XX:CMSInitiatingOccupancyFraction=75
>
> Since applying these settings, the one time I saw the same type of behavior
> as before, the following appeared in the GC log.
>
>   Total time for which application threads were stopped: 0.6830080 seconds
>   1368.201: [GC 1368.201: [ParNew (promotion failed)
>   Desired survivor size 38338560 bytes, new threshold 1 (max 5)
>   - age   1:   55799736 bytes,   55799736 total
>   : 449408K->449408K(449408K), 0.2618690 secs]1368.463: [CMS1372.459:
> [CMS-concurrent-mark: 7.930/9.109 secs] [Times: us
>   er=28.31 sys=0.66, real=9.11 secs]
>    (concurrent mode failure): 9418431K->6267709K(11841536K), 26.4973750
> secs] 9777393K->6267709K(12290944K), [CMS Perm : 20477K->20443K(34188K)],
> 26.7595510 secs] [Times: user=31.75 sys=0.00, real=26.76 secs]
>   Total time for which application threads were stopped: 26.7617560 seconds
>
> Now, a full stop of the application was what I was seeing extensively
> before (100-200 times over the course of a major compaction as reported by
> gossipers on other nodes). I have also just noticed that the previous
> instability (ie application stops) correlated with the compaction of a few
> column families characterized by fairly fat rows (10 mb mean size, max sizes
> 150-200 mb, up to a million+ columns per row). My theory is that each row
> being compacted with the old settings was being promoted to the old
> generation, thereby running the heap out of space and causing a stop the
> world gc. With the new settings, rows being compacted typically remain in
> the young generation, allowing them to be cleaned up more quickly with less
> effort on the part of the garbage collector. Does this theory sound
> reasonable?
>
> Answering some of the other questions:
>
> > disk bound or CPU bound during compaction?
>
> ... Neither (?). Iowait is 10-20%, disk utilization rarely jumps above 60%,
> CPU %idle is about 60%. I would have said that I was memory bound but now, I
> think compaction is now bounded by being single threaded.
>
> > are you sure you're not swapping a bit?
>
> Only if JNA is not doing its job
>
> > Number of cores on your system. How busy is the system?
>
> 8, load factors typically < 4 so not terribly busy I would say.
>
> On Mon, Jan 17, 2011 at 12:58 PM, Peter Schuller <
> peter.schuller@infidyne.com> wrote:
>
>> > very quickly from the young generation to the old generation".
>> Furthermore,
>> > the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68)
>> means
>> > "start gc in the old generation later", presumably to allow Cassandra to
>> use
>> > more of the old generation heap without needlessly trying to free up
>> used
>> > space (?). Please correct me if I am misinterpreting these settings.
>>
>> Note the use of -XX:+UseCMSInitiatingOccupancyOnly which causes the
>> JVM to always trigger on that occupancy fraction rather than only do
>> it for the first trigger (or something along those lines) and then
>> switch to heuristics. Presumably (though I don't specifically know the
>> history of this particular option being added) it is more important to
>> avoid doing Full GC:s at all than super-optimally tweaking the trigger
>> for maximum throughput.
>>
>> The heuristics tend to cut it pretty close, and setting a conservative
>> fixed occupancy trigger probably greatly lessens the chance of falling
>> back to a full gc in production.
>>
>> > One of the issues I have been having is extreme node instability when
>> > running a major compaction. After 20-30 seconds of operation, the node
>> > spends 30+ seconds in (what I believe to be) GC. Now I have tried
>> halving
>> > all memtable thresholds to reduce overall heap memory usage but that has
>> not
>> > seemed to help with the instability. After one of these blips, I often
>> see
>> > log entries as follows:
>> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
>> 133)
>> > GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
>> is
>> > 12783583232
>> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
>> 133)
>> > GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
>> is
>> > 12783583232
>> >  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
>> 133)
>> > GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
>> > 9224048472 used; max is 12783583232
>>
>> 45 seconds is pretty significant even for a 12 gig heap unless you're
>> really CPU loaded so that there is heavy contention over the CPU.
>> While I don't see anything obviously extreme; are you sure you're not
>> swapping a bit?
>>
>> Also, what do you mean by node instability - does it *completely* stop
>> responding during these periods or does it flap in and out of the
>> cluster but is still responding?
>>
>> Are you nodes disk bound or CPU bound during compaction?
>>
>> --
>> / Peter Schuller
>>
>
>

Re: Cassandra GC Settings

Posted by Dan Hendry <da...@gmail.com>.

Thanks for all the info, I think I have been able to sort out my issue. The
new settings I am using are:

-Xmn512M (Very important I think)
-XX:SurvivorRatio=5 (Not very important I think)
-XX:MaxTenuringThreshold=5
-XX:ParallelGCThreads=8
-XX:CMSInitiatingOccupancyFraction=75

Since applying these settings, the one time I saw the same type of behavior
as before, the following appeared in the GC log.

  Total time for which application threads were stopped: 0.6830080 seconds
  1368.201: [GC 1368.201: [ParNew (promotion failed)
  Desired survivor size 38338560 bytes, new threshold 1 (max 5)
  - age   1:   55799736 bytes,   55799736 total
  : 449408K->449408K(449408K), 0.2618690 secs]1368.463: [CMS1372.459:
[CMS-concurrent-mark: 7.930/9.109 secs] [Times: us
  er=28.31 sys=0.66, real=9.11 secs]
   (concurrent mode failure): 9418431K->6267709K(11841536K), 26.4973750
secs] 9777393K->6267709K(12290944K), [CMS Perm : 20477K->20443K(34188K)],
26.7595510 secs] [Times: user=31.75 sys=0.00, real=26.76 secs]
  Total time for which application threads were stopped: 26.7617560 seconds

Now, a full stop of the application was what I was seeing extensively before
(100-200 times over the course of a major compaction as reported by
gossipers on other nodes). I have also just noticed that the previous
instability (ie application stops) correlated with the compaction of a few
column families characterized by fairly fat rows (10 mb mean size, max sizes
150-200 mb, up to a million+ columns per row). My theory is that each row
being compacted with the old settings was being promoted to the old
generation, thereby running the heap out of space and causing a stop the
world gc. With the new settings, rows being compacted typically remain in
the young generation, allowing them to be cleaned up more quickly with less
effort on the part of the garbage collector. Does this theory sound
reasonable?

Answering some of the other questions:

> disk bound or CPU bound during compaction?

... Neither (?). Iowait is 10-20%, disk utilization rarely jumps above 60%,
CPU %idle is about 60%. I would have said that I was memory bound but now, I
think compaction is now bounded by being single threaded.

> are you sure you're not swapping a bit?

Only if JNA is not doing its job

> Number of cores on your system. How busy is the system?

8, load factors typically < 4 so not terribly busy I would say.

On Mon, Jan 17, 2011 at 12:58 PM, Peter Schuller <
peter.schuller@infidyne.com> wrote:

> > very quickly from the young generation to the old generation".
> Furthermore,
> > the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> > "start gc in the old generation later", presumably to allow Cassandra to
> use
> > more of the old generation heap without needlessly trying to free up used
> > space (?). Please correct me if I am misinterpreting these settings.
>
> Note the use of -XX:+UseCMSInitiatingOccupancyOnly which causes the
> JVM to always trigger on that occupancy fraction rather than only do
> it for the first trigger (or something along those lines) and then
> switch to heuristics. Presumably (though I don't specifically know the
> history of this particular option being added) it is more important to
> avoid doing Full GC:s at all than super-optimally tweaking the trigger
> for maximum throughput.
>
> The heuristics tend to cut it pretty close, and setting a conservative
> fixed occupancy trigger probably greatly lessens the chance of falling
> back to a full gc in production.
>
> > One of the issues I have been having is extreme node instability when
> > running a major compaction. After 20-30 seconds of operation, the node
> > spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> > all memtable thresholds to reduce overall heap memory usage but that has
> not
> > seemed to help with the instability. After one of these blips, I often
> see
> > log entries as follows:
> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line
> 133)
> > GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max
> is
> > 12783583232
> >  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line
> 133)
> > GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max
> is
> > 12783583232
> >  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line
> 133)
> > GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> > 9224048472 used; max is 12783583232
>
> 45 seconds is pretty significant even for a 12 gig heap unless you're
> really CPU loaded so that there is heavy contention over the CPU.
> While I don't see anything obviously extreme; are you sure you're not
> swapping a bit?
>
> Also, what do you mean by node instability - does it *completely* stop
> responding during these periods or does it flap in and out of the
> cluster but is still responding?
>
> Are you nodes disk bound or CPU bound during compaction?
>
> --
> / Peter Schuller
>

Re: Cassandra GC Settings

Posted by Peter Schuller <pe...@infidyne.com>.

> very quickly from the young generation to the old generation". Furthermore,
> the CMSInitiatingOccupancyFraction of 75 (from a JVM default of 68) means
> "start gc in the old generation later", presumably to allow Cassandra to use
> more of the old generation heap without needlessly trying to free up used
> space (?). Please correct me if I am misinterpreting these settings.

Note the use of -XX:+UseCMSInitiatingOccupancyOnly which causes the
JVM to always trigger on that occupancy fraction rather than only do
it for the first trigger (or something along those lines) and then
switch to heuristics. Presumably (though I don't specifically know the
history of this particular option being added) it is more important to
avoid doing Full GC:s at all than super-optimally tweaking the trigger
for maximum throughput.

The heuristics tend to cut it pretty close, and setting a conservative
fixed occupancy trigger probably greatly lessens the chance of falling
back to a full gc in production.

> One of the issues I have been having is extreme node instability when
> running a major compaction. After 20-30 seconds of operation, the node
> spends 30+ seconds in (what I believe to be) GC. Now I have tried halving
> all memtable thresholds to reduce overall heap memory usage but that has not
> seemed to help with the instability. After one of these blips, I often see
> log entries as follows:
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:21,961 GCInspector.java (line 133)
> GC for ParNew: 215 ms, 45084168 reclaimed leaving 11068700368 used; max is
> 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:41:28,033 GCInspector.java (line 133)
> GC for ParNew: 234 ms, 40401120 reclaimed leaving 12144504848 used; max is
> 12783583232
>  INFO [ScheduledTasks:1] 2011-01-17 10:42:15,911 GCInspector.java (line 133)
> GC for ConcurrentMarkSweep: 45828 ms, 3350764696 reclaimed leaving
> 9224048472 used; max is 12783583232

45 seconds is pretty significant even for a 12 gig heap unless you're
really CPU loaded so that there is heavy contention over the CPU.
While I don't see anything obviously extreme; are you sure you're not
swapping a bit?

Also, what do you mean by node instability - does it *completely* stop
responding during these periods or does it flap in and out of the
cluster but is still responding?

Are you nodes disk bound or CPU bound during compaction?

-- 
/ Peter Schuller