You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Dan van Kley <dv...@salesforce.com> on 2014/10/24 19:35:05 UTC

Re: Intermittent long application pauses on nodes

I'm also curious to know if this was ever resolved or if there's any other
recommended steps to take to continue to track it down. I'm seeing the same
issue in our production cluster, which is running Cassandra 2.0.10 and JVM
1.7u71, using the CMS collector. Just as described above, the issue is long
"Total time for which application threads were stopped" pauses that are not
a direct result of GC pauses (ParNew, initial mark or remark). When I
enabled the safepoint logging I saw the same result, long "sync" pause
times with short spin and block times, usually with the "RevokeBias"
description. We're seeing pause times sometimes in excess of 10 seconds, so
it's a pretty debilitating issue. Our machines are not swapping (or even
close to it) or having other load issues when these pauses occur. Any ideas
would be very appreciated. Thanks!

Re: Intermittent long application pauses on nodes

Posted by graham sanderson <gr...@vast.com>.
I have to admit that I haven’t tried the SafepointTimeout (I just noticed that it was actually a production VM option in the JVM code, after my initial suggestions below for debugging without it).

There doesn’t seem to be an obvious bug in SafepointTimeout, though I may not be looking at the same version of the JVM source. That said there IS some code executed after waiting (up to the timeout) for threads to reach safe point before it actually records the sync time, so maybe that code is slow for some reason. If you were running with TraceSafepoint you’d see this (but that requires a debug build of the JVM) because there is an extra trace statement here.
 
I’d either try some suggestions below (though if what I said above is the case they may not help you as much since the other threads may be well behaved, though they may still give you some insight), but firstly, for sure, I’d try to see if SafepointTimeout is working at all by setting a ridiculously low timeout delay and seeing if you catch anything.

> On Oct 31, 2014, at 2:42 PM, Dan van Kley <dv...@salesforce.com> wrote:
> 
> Well I tried the SafepointTimeout option, but unfortunately it seems like the long safepoint syncs don't actually trigger the SafepointTimeout mechanism, so we didn't get any logs on it. It's possible I'm just doing it wrong, I used the following options:
> 
> JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions -XX:LogFile=/var/log/cassandra/stdout.log -XX:+LogVMOutput"
> JVM_OPTS="$JVM_OPTS -XX:+PrintSafepointStatistics"
> JVM_OPTS="$JVM_OPTS -XX:PrintSafepointStatisticsCount=1"
> JVM_OPTS="$JVM_OPTS -XX:SafepointTimeoutDelay=4000"
> JVM_OPTS="$JVM_OPTS -XX:+SafepointTimeout”
> 
> and saw the safepoint logging as usual in that stdout.log file, but no timeout logging in either that file or the GC log when safepoint syncs exceeded the timeout. It also seems possible that SafepointTimeout doesn't work on long syncs, see http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html <http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html>.
> 
> That being the case, any other ideas or suggestions would be appreciated. Thanks!
> 
> On Mon, Oct 27, 2014 at 9:44 AM, Dan van Kley <dvankley@salesforce.com <ma...@salesforce.com>> wrote:
> Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try and see if that gives us anything to act on.
> 
> On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson <graham@vast.com <ma...@vast.com>> wrote:
> And -XX:SafepointTimeoutDelay=xxx
> 
> to set how long before it dumps output (defaults to 10000 I believe)…
> 
> Note it doesn’t actually timeout by default, it just prints the problematic threads after that time and keeps on waiting
> 
>> On Oct 24, 2014, at 2:44 PM, graham sanderson <graham@vast.com <ma...@vast.com>> wrote:
>> 
>> Actually - there is 
>> 
>> -XX:+SafepointTimeout
>> 
>> which will print out offending threads (assuming you reach a 10 second pause)…
>> 
>> That is probably your best bet.
>> 
>>> On Oct 24, 2014, at 2:38 PM, graham sanderson <graham@vast.com <ma...@vast.com>> wrote:
>>> 
>>> This certainly sounds like a JVM bug.
>>> 
>>> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, and don’t seem to have seen this (note we are on 7u67, so that might be an interesting data point, though since the old thread predated that probably not)
>>> 
>>> 1) From the app/java side, I’d obviously see if you can identify anything which always coincides with this - repair, compaction etc
>>> 2) From the VM side (given that this as Benedict mentioned) some threads are taking a long time to rendezvous at the safe point, and it is probably not application threads, I’d look what GC threads, compiler threads etc might be doing. As mentioned it shouldn’t be anything to do with operations which run at a safe point anyway (e.g. scavenge)
>>> 	a) So look at what CMS is doing at the time and see if you can correlate
>>> 	b) Check Oracle for related bugs - didn’t obviously see any, but there have been some complaints related to compilation and safe points
>>> 	c) Add any compilation tracing you can
>>> 	d) Kind of important here - see if you can figure out via dtrace, system tap, gdb or whatever, what the threads are doing when this happens. Sadly it doesn’t look like you can figure out when this is happening (until afterwards) unless you have access to a debug JVM build (and can turn on -XX:+TraceSafepoint and look for a safe point start without a corresponding update within a time period) - if you don’t have access to that, I guess you could try and get a dump every 2-3 seconds (you should catch a 9 second pause eventually!)
>>> 
>>>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com <ma...@salesforce.com>> wrote:
>>>> 
>>>> I'm also curious to know if this was ever resolved or if there's any other recommended steps to take to continue to track it down. I'm seeing the same issue in our production cluster, which is running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just as described above, the issue is long "Total time for which application threads were stopped" pauses that are not a direct result of GC pauses (ParNew, initial mark or remark). When I enabled the safepoint logging I saw the same result, long "sync" pause times with short spin and block times, usually with the "RevokeBias" description. We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty debilitating issue. Our machines are not swapping (or even close to it) or having other load issues when these pauses occur. Any ideas would be very appreciated. Thanks!
>>> 
>> 
> 
> 
> 


Re: Intermittent long application pauses on nodes

Posted by Dan van Kley <dv...@salesforce.com>.
Well I tried the SafepointTimeout option, but unfortunately it seems like
the long safepoint syncs don't actually trigger the SafepointTimeout
mechanism, so we didn't get any logs on it. It's possible I'm just doing it
wrong, I used the following options:

JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions
-XX:LogFile=/var/log/cassandra/stdout.log -XX:+LogVMOutput"
JVM_OPTS="$JVM_OPTS -XX:+PrintSafepointStatistics"
JVM_OPTS="$JVM_OPTS -XX:PrintSafepointStatisticsCount=1"
JVM_OPTS="$JVM_OPTS -XX:SafepointTimeoutDelay=4000"
JVM_OPTS="$JVM_OPTS -XX:+SafepointTimeout”

and saw the safepoint logging as usual in that stdout.log file, but no
timeout logging in either that file or the GC log when safepoint syncs
exceeded the timeout. It also seems possible that SafepointTimeout doesn't
work on long syncs, see
http://mail.openjdk.java.net/pipermail/hotspot-runtime-dev/2013-April/006945.html
.

That being the case, any other ideas or suggestions would be appreciated.
Thanks!

On Mon, Oct 27, 2014 at 9:44 AM, Dan van Kley <dv...@salesforce.com>
wrote:

> Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try
> and see if that gives us anything to act on.
>
> On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson <gr...@vast.com> wrote:
>
>> And -XX:SafepointTimeoutDelay=xxx
>>
>> to set how long before it dumps output (defaults to 10000 I believe)…
>>
>> Note it doesn’t actually timeout by default, it just prints the
>> problematic threads after that time and keeps on waiting
>>
>> On Oct 24, 2014, at 2:44 PM, graham sanderson <gr...@vast.com> wrote:
>>
>> Actually - there is
>>
>> -XX:+SafepointTimeout
>>
>> which will print out offending threads (assuming you reach a 10 second
>> pause)…
>>
>> That is probably your best bet.
>>
>> On Oct 24, 2014, at 2:38 PM, graham sanderson <gr...@vast.com> wrote:
>>
>> This certainly *sounds* like a JVM bug.
>>
>> We are running C* 2.0.9 on pretty high end machines with pretty large
>> heaps, and don’t seem to have seen this (note we are on 7u67, so that might
>> be an interesting data point, though since the old thread predated that
>> probably not)
>>
>> 1) From the app/java side, I’d obviously see if you can identify anything
>> which always coincides with this - repair, compaction etc
>> 2) From the VM side (given that this as Benedict mentioned) some threads
>> are taking a long time to rendezvous at the safe point, and it is probably
>> not application threads, I’d look what GC threads, compiler threads etc
>> might be doing. As mentioned it shouldn’t be anything to do with operations
>> which run at a safe point anyway (e.g. scavenge)
>> a) So look at what CMS is doing at the time and see if you can correlate
>> b) Check Oracle for related bugs - didn’t obviously see any, but there
>> have been some complaints related to compilation and safe points
>> c) Add any compilation tracing you can
>> d) Kind of important here - see if you can figure out via dtrace, system
>> tap, gdb or whatever, what the threads are doing when this happens. Sadly
>> it doesn’t look like you can figure out when this is happening (until
>> afterwards) unless you have access to a debug JVM build (and can turn on
>> -XX:+TraceSafepoint and look for a safe point start without a corresponding
>> update within a time period) - if you don’t have access to that, I guess
>> you could try and get a dump every 2-3 seconds (you should catch a 9 second
>> pause eventually!)
>>
>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dv...@salesforce.com>
>> wrote:
>>
>> I'm also curious to know if this was ever resolved or if there's any
>> other recommended steps to take to continue to track it down. I'm seeing
>> the same issue in our production cluster, which is running Cassandra 2.0.10
>> and JVM 1.7u71, using the CMS collector. Just as described above, the issue
>> is long "Total time for which application threads were stopped" pauses that
>> are not a direct result of GC pauses (ParNew, initial mark or remark). When
>> I enabled the safepoint logging I saw the same result, long "sync" pause
>> times with short spin and block times, usually with the "RevokeBias"
>> description. We're seeing pause times sometimes in excess of 10 seconds, so
>> it's a pretty debilitating issue. Our machines are not swapping (or even
>> close to it) or having other load issues when these pauses occur. Any ideas
>> would be very appreciated. Thanks!
>>
>>
>>
>>
>>
>

Re: Intermittent long application pauses on nodes

Posted by Dan van Kley <dv...@salesforce.com>.
Excellent, thanks for the tips, Graham. I'll give SafepointTimeout a try
and see if that gives us anything to act on.

On Fri, Oct 24, 2014 at 3:52 PM, graham sanderson <gr...@vast.com> wrote:

> And -XX:SafepointTimeoutDelay=xxx
>
> to set how long before it dumps output (defaults to 10000 I believe)…
>
> Note it doesn’t actually timeout by default, it just prints the
> problematic threads after that time and keeps on waiting
>
> On Oct 24, 2014, at 2:44 PM, graham sanderson <gr...@vast.com> wrote:
>
> Actually - there is
>
> -XX:+SafepointTimeout
>
> which will print out offending threads (assuming you reach a 10 second
> pause)…
>
> That is probably your best bet.
>
> On Oct 24, 2014, at 2:38 PM, graham sanderson <gr...@vast.com> wrote:
>
> This certainly *sounds* like a JVM bug.
>
> We are running C* 2.0.9 on pretty high end machines with pretty large
> heaps, and don’t seem to have seen this (note we are on 7u67, so that might
> be an interesting data point, though since the old thread predated that
> probably not)
>
> 1) From the app/java side, I’d obviously see if you can identify anything
> which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads
> are taking a long time to rendezvous at the safe point, and it is probably
> not application threads, I’d look what GC threads, compiler threads etc
> might be doing. As mentioned it shouldn’t be anything to do with operations
> which run at a safe point anyway (e.g. scavenge)
> a) So look at what CMS is doing at the time and see if you can correlate
> b) Check Oracle for related bugs - didn’t obviously see any, but there
> have been some complaints related to compilation and safe points
> c) Add any compilation tracing you can
> d) Kind of important here - see if you can figure out via dtrace, system
> tap, gdb or whatever, what the threads are doing when this happens. Sadly
> it doesn’t look like you can figure out when this is happening (until
> afterwards) unless you have access to a debug JVM build (and can turn on
> -XX:+TraceSafepoint and look for a safe point start without a corresponding
> update within a time period) - if you don’t have access to that, I guess
> you could try and get a dump every 2-3 seconds (you should catch a 9 second
> pause eventually!)
>
> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dv...@salesforce.com>
> wrote:
>
> I'm also curious to know if this was ever resolved or if there's any other
> recommended steps to take to continue to track it down. I'm seeing the same
> issue in our production cluster, which is running Cassandra 2.0.10 and JVM
> 1.7u71, using the CMS collector. Just as described above, the issue is long
> "Total time for which application threads were stopped" pauses that are not
> a direct result of GC pauses (ParNew, initial mark or remark). When I
> enabled the safepoint logging I saw the same result, long "sync" pause
> times with short spin and block times, usually with the "RevokeBias"
> description. We're seeing pause times sometimes in excess of 10 seconds, so
> it's a pretty debilitating issue. Our machines are not swapping (or even
> close to it) or having other load issues when these pauses occur. Any ideas
> would be very appreciated. Thanks!
>
>
>
>
>

Re: Intermittent long application pauses on nodes

Posted by graham sanderson <gr...@vast.com>.
And -XX:SafepointTimeoutDelay=xxx

to set how long before it dumps output (defaults to 10000 I believe)…

Note it doesn’t actually timeout by default, it just prints the problematic threads after that time and keeps on waiting

> On Oct 24, 2014, at 2:44 PM, graham sanderson <gr...@vast.com> wrote:
> 
> Actually - there is 
> 
> -XX:+SafepointTimeout
> 
> which will print out offending threads (assuming you reach a 10 second pause)…
> 
> That is probably your best bet.
> 
>> On Oct 24, 2014, at 2:38 PM, graham sanderson <graham@vast.com <ma...@vast.com>> wrote:
>> 
>> This certainly sounds like a JVM bug.
>> 
>> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, and don’t seem to have seen this (note we are on 7u67, so that might be an interesting data point, though since the old thread predated that probably not)
>> 
>> 1) From the app/java side, I’d obviously see if you can identify anything which always coincides with this - repair, compaction etc
>> 2) From the VM side (given that this as Benedict mentioned) some threads are taking a long time to rendezvous at the safe point, and it is probably not application threads, I’d look what GC threads, compiler threads etc might be doing. As mentioned it shouldn’t be anything to do with operations which run at a safe point anyway (e.g. scavenge)
>> 	a) So look at what CMS is doing at the time and see if you can correlate
>> 	b) Check Oracle for related bugs - didn’t obviously see any, but there have been some complaints related to compilation and safe points
>> 	c) Add any compilation tracing you can
>> 	d) Kind of important here - see if you can figure out via dtrace, system tap, gdb or whatever, what the threads are doing when this happens. Sadly it doesn’t look like you can figure out when this is happening (until afterwards) unless you have access to a debug JVM build (and can turn on -XX:+TraceSafepoint and look for a safe point start without a corresponding update within a time period) - if you don’t have access to that, I guess you could try and get a dump every 2-3 seconds (you should catch a 9 second pause eventually!)
>> 
>>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com <ma...@salesforce.com>> wrote:
>>> 
>>> I'm also curious to know if this was ever resolved or if there's any other recommended steps to take to continue to track it down. I'm seeing the same issue in our production cluster, which is running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just as described above, the issue is long "Total time for which application threads were stopped" pauses that are not a direct result of GC pauses (ParNew, initial mark or remark). When I enabled the safepoint logging I saw the same result, long "sync" pause times with short spin and block times, usually with the "RevokeBias" description. We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty debilitating issue. Our machines are not swapping (or even close to it) or having other load issues when these pauses occur. Any ideas would be very appreciated. Thanks!
>> 
> 


Re: Intermittent long application pauses on nodes

Posted by graham sanderson <gr...@vast.com>.
Actually - there is 

-XX:+SafepointTimeout

which will print out offending threads (assuming you reach a 10 second pause)…

That is probably your best bet.

> On Oct 24, 2014, at 2:38 PM, graham sanderson <gr...@vast.com> wrote:
> 
> This certainly sounds like a JVM bug.
> 
> We are running C* 2.0.9 on pretty high end machines with pretty large heaps, and don’t seem to have seen this (note we are on 7u67, so that might be an interesting data point, though since the old thread predated that probably not)
> 
> 1) From the app/java side, I’d obviously see if you can identify anything which always coincides with this - repair, compaction etc
> 2) From the VM side (given that this as Benedict mentioned) some threads are taking a long time to rendezvous at the safe point, and it is probably not application threads, I’d look what GC threads, compiler threads etc might be doing. As mentioned it shouldn’t be anything to do with operations which run at a safe point anyway (e.g. scavenge)
> 	a) So look at what CMS is doing at the time and see if you can correlate
> 	b) Check Oracle for related bugs - didn’t obviously see any, but there have been some complaints related to compilation and safe points
> 	c) Add any compilation tracing you can
> 	d) Kind of important here - see if you can figure out via dtrace, system tap, gdb or whatever, what the threads are doing when this happens. Sadly it doesn’t look like you can figure out when this is happening (until afterwards) unless you have access to a debug JVM build (and can turn on -XX:+TraceSafepoint and look for a safe point start without a corresponding update within a time period) - if you don’t have access to that, I guess you could try and get a dump every 2-3 seconds (you should catch a 9 second pause eventually!)
> 
>> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dvankley@salesforce.com <ma...@salesforce.com>> wrote:
>> 
>> I'm also curious to know if this was ever resolved or if there's any other recommended steps to take to continue to track it down. I'm seeing the same issue in our production cluster, which is running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just as described above, the issue is long "Total time for which application threads were stopped" pauses that are not a direct result of GC pauses (ParNew, initial mark or remark). When I enabled the safepoint logging I saw the same result, long "sync" pause times with short spin and block times, usually with the "RevokeBias" description. We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty debilitating issue. Our machines are not swapping (or even close to it) or having other load issues when these pauses occur. Any ideas would be very appreciated. Thanks!
> 


Re: Intermittent long application pauses on nodes

Posted by graham sanderson <gr...@vast.com>.
This certainly sounds like a JVM bug.

We are running C* 2.0.9 on pretty high end machines with pretty large heaps, and don’t seem to have seen this (note we are on 7u67, so that might be an interesting data point, though since the old thread predated that probably not)

1) From the app/java side, I’d obviously see if you can identify anything which always coincides with this - repair, compaction etc
2) From the VM side (given that this as Benedict mentioned) some threads are taking a long time to rendezvous at the safe point, and it is probably not application threads, I’d look what GC threads, compiler threads etc might be doing. As mentioned it shouldn’t be anything to do with operations which run at a safe point anyway (e.g. scavenge)
	a) So look at what CMS is doing at the time and see if you can correlate
	b) Check Oracle for related bugs - didn’t obviously see any, but there have been some complaints related to compilation and safe points
	c) Add any compilation tracing you can
	d) Kind of important here - see if you can figure out via dtrace, system tap, gdb or whatever, what the threads are doing when this happens. Sadly it doesn’t look like you can figure out when this is happening (until afterwards) unless you have access to a debug JVM build (and can turn on -XX:+TraceSafepoint and look for a safe point start without a corresponding update within a time period) - if you don’t have access to that, I guess you could try and get a dump every 2-3 seconds (you should catch a 9 second pause eventually!)

> On Oct 24, 2014, at 12:35 PM, Dan van Kley <dv...@salesforce.com> wrote:
> 
> I'm also curious to know if this was ever resolved or if there's any other recommended steps to take to continue to track it down. I'm seeing the same issue in our production cluster, which is running Cassandra 2.0.10 and JVM 1.7u71, using the CMS collector. Just as described above, the issue is long "Total time for which application threads were stopped" pauses that are not a direct result of GC pauses (ParNew, initial mark or remark). When I enabled the safepoint logging I saw the same result, long "sync" pause times with short spin and block times, usually with the "RevokeBias" description. We're seeing pause times sometimes in excess of 10 seconds, so it's a pretty debilitating issue. Our machines are not swapping (or even close to it) or having other load issues when these pauses occur. Any ideas would be very appreciated. Thanks!