You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Rene Kochen <Re...@emea.schange.com> on 2012/01/19 12:27:29 UTC

RE: Garbage collection freezes cassandra node

Thanks for your comments. The application is indeed suffering from a freezing Cassandra node. Queries are taking longer than 10 seconds at the moment of a full garbage collect.

Here is an example from the logs. I have a three node cluster. At some point I see on a node the following log:

21:53:35,986 InetAddress /172.16.107.46 is now dead.

On node "172.16.107.46", I see the following:

21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K->324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K->3298251K(8005248K), 10.8526193 secs] 6310427K->3298251K(8350272K), [CMS Perm : 26355K->26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs] 
21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528

I have not yet tested the "XX:+DisableExplicitGC" switch.

Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting?

Thanks!

Rene

-----Original Message-----
From: scode@scode.org [mailto:scode@scode.org] On Behalf Of Peter Schuller
Sent: dinsdag 20 december 2011 6:38
To: user@cassandra.apache.org
Subject: Re: Garbage collection freezes cassandra node

I should add: If you are indeed actually pausing due to "promotion
failed" or "concurrent mode failure" (which you will see in the GC log
if you enable it with the options I suggested), the first thing I
would try to mitigate is:

* Decrease the occupancy trigger (search for "occupancy") of CMS to a
lower percentage, making the concurrent mark phase start earlier.
* Increase heap size significantly (probably not necessary based on
your graph, but for good measure).

If it then goes away, report back and we can perhaps figure out
details. There are other things that can be done.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

Posted by Mohit Anchlia <mo...@gmail.com>.
What's the version of Java do you use? Can you try reducing NewSize
and increasing Old generation? If you are on old version of Java I
also recommend upgrading that version.

On Thu, Jan 19, 2012 at 3:27 AM, Rene Kochen
<Re...@emea.schange.com> wrote:
> Thanks for your comments. The application is indeed suffering from a freezing Cassandra node. Queries are taking longer than 10 seconds at the moment of a full garbage collect.
>
> Here is an example from the logs. I have a three node cluster. At some point I see on a node the following log:
>
> 21:53:35,986 InetAddress /172.16.107.46 is now dead.
>
> On node "172.16.107.46", I see the following:
>
> 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K->324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K->3298251K(8005248K), 10.8526193 secs] 6310427K->3298251K(8350272K), [CMS Perm : 26355K->26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs]
> 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528
>
> I have not yet tested the "XX:+DisableExplicitGC" switch.
>
> Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting?
>
> Thanks!
>
> Rene
>
> -----Original Message-----
> From: scode@scode.org [mailto:scode@scode.org] On Behalf Of Peter Schuller
> Sent: dinsdag 20 december 2011 6:38
> To: user@cassandra.apache.org
> Subject: Re: Garbage collection freezes cassandra node
>
> I should add: If you are indeed actually pausing due to "promotion
> failed" or "concurrent mode failure" (which you will see in the GC log
> if you enable it with the options I suggested), the first thing I
> would try to mitigate is:
>
> * Decrease the occupancy trigger (search for "occupancy") of CMS to a
> lower percentage, making the concurrent mark phase start earlier.
> * Increase heap size significantly (probably not necessary based on
> your graph, but for good measure).
>
> If it then goes away, report back and we can perhaps figure out
> details. There are other things that can be done.
>
> --
> / Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

Posted by Peter Schuller <pe...@infidyne.com>.
> Thanks for this very helpful info. It is indeed a production site which I cannot easily upgrade. I will try the various gc knobs and post any positive results.

*IF* your data size, or at least hot set, is small enough that you're
not extremely reliant on the current size of page cache, and in terms
of short-term relief, I recommend:

* Significantly increasing the heap size. Like double it or more.
* Decrease the occupancy trigger such that it kicks in around the time
it already does (in terms of amount of heap usage used).
* Increase the young generation size (to lessen promotion into old-gen).

Experiment on a single node, making sure you're not causing too much
disk I/O by stealing memory otherwise used by page cache. Once you
have something that works you might try slowly going back down.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

RE: Garbage collection freezes cassandra node

Posted by Rene Kochen <Re...@emea.schange.com>.
Thanks for this very helpful info. It is indeed a production site which I cannot easily upgrade. I will try the various gc knobs and post any positive results.

-----Original Message-----
From: scode@scode.org [mailto:scode@scode.org] On Behalf Of Peter Schuller
Sent: vrijdag 20 januari 2012 4:23
To: user@cassandra.apache.org
Subject: Re: Garbage collection freezes cassandra node

> On node "172.16.107.46", I see the following:
>
> 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K->324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K->3298251K(8005248K), 10.8526193 secs] 6310427K->3298251K(8350272K), [CMS Perm : 26355K->26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs]
> 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528
>
> I have not yet tested the "XX:+DisableExplicitGC" switch.
>
> Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting?

* Increasing the total heap size can definitely help; the only kink is
that if you need to increase the heap size unacceptably much, it is
not helpful.
* Decreasing the occupancy trigger can help yes, but you will get very
much diminishing returns as your trigger fraction approaches the
actual live size of data on the heap.
* I just re-checked your original message - you're on Cassandra 0.7? I
*strongly* suggest upgrading to 1.x. In general that holds true, but
also specifically relating to this are significant improvements in
memory allocation behavior that significantly reduces the probability
and/or frequency of promotion failures and full gcs.
* Increasing the size of the young generation can help by causing less
promotion to old-gen (see the cassandra.in.sh script or equivalent of
for Windows).
* Increasing the amount of parallel threads used by CMS can help CMS
complete it's marking phase quicker, but at the cost of a greater
impact on the mutator (cassandra).

I think the most important thing is - upgrade to 1.x before you run
these benchmarks. Particularly detailed tuning of GC issues is pretty
useless on 0.7 given the significant changes in 1.0. Don't even bother
spending time on this until you're on 1.0, unless this is about a
production cluster that you cannot upgrade for some reason.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Re: Garbage collection freezes cassandra node

Posted by Peter Schuller <pe...@infidyne.com>.
> On node "172.16.107.46", I see the following:
>
> 21:53:27.192+0100: 1335393.834: [GC 1335393.834: [ParNew (promotion failed): 319468K->324959K(345024K), 0.1304456 secs]1335393.964: [CMS: 6000844K->3298251K(8005248K), 10.8526193 secs] 6310427K->3298251K(8350272K), [CMS Perm : 26355K->26346K(44268K)], 10.9832679 secs] [Times: user=11.15 sys=0.03, real=10.98 secs]
> 21:53:38,174 GC for ConcurrentMarkSweep: 10856 ms for 1 collections, 3389079904 used; max is 8550678528
>
> I have not yet tested the "XX:+DisableExplicitGC" switch.
>
> Is the right thing to do to decrease the CMSInitiatingOccupancyFraction setting?

* Increasing the total heap size can definitely help; the only kink is
that if you need to increase the heap size unacceptably much, it is
not helpful.
* Decreasing the occupancy trigger can help yes, but you will get very
much diminishing returns as your trigger fraction approaches the
actual live size of data on the heap.
* I just re-checked your original message - you're on Cassandra 0.7? I
*strongly* suggest upgrading to 1.x. In general that holds true, but
also specifically relating to this are significant improvements in
memory allocation behavior that significantly reduces the probability
and/or frequency of promotion failures and full gcs.
* Increasing the size of the young generation can help by causing less
promotion to old-gen (see the cassandra.in.sh script or equivalent of
for Windows).
* Increasing the amount of parallel threads used by CMS can help CMS
complete it's marking phase quicker, but at the cost of a greater
impact on the mutator (cassandra).

I think the most important thing is - upgrade to 1.x before you run
these benchmarks. Particularly detailed tuning of GC issues is pretty
useless on 0.7 given the significant changes in 1.0. Don't even bother
spending time on this until you're on 1.0, unless this is about a
production cluster that you cannot upgrade for some reason.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)