You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Joel Samuelsson <sa...@gmail.com> on 2014/01/29 17:23:48 UTC

Weird GC

Hi,

We've been trying to figure out why we have so long and frequent
stop-the-world GC even though we have basically no load.

Today we got a log of a weird GC that I wonder if you have any theories of
why it might have happened.

A plot of our heap at the time, paired with the GC time from the Cassandra
log:
http://imgur.com/vw5rOzj
-The blue line is the ratio of Eden space used (i.e. 1.0 = full)
-The red line is the ratio of Survivor0 space used
-The green line is the ratio of Survivor1 space used
-The teal line is the ratio of Old Gen space used
-The pink line shows during which period of time a GC happened (from the
Cassandra log)

Eden space is filling up and being cleared as expected in the first and
last hill but on the middle one, it takes two seconds to clear Eden (note
that Eden has ratio 1 for 2 seconds). Neither the survivor spaces nor old
generation increase significantly afterwards.

Any ideas of why this might be happening?
We have swap disabled, JNA enabled, no CPU spikes at the time, no disk I/O
spikes at the time. What else could be causing this?

/Joel Samuelsson

Re: Weird GC

Posted by Benedict Elliott Smith <be...@datastax.com>.
You should expect to see lines of output like:

         vmop                    [threads: total initially_running
wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.436: Deoptimize                       [      10          0
   0    ]      [     0     0     0     0     0    ]  0
1.437: no vm operation                  [      18          0
   1    ]      [     0     0     0     0     0    ]  0
1.762: Deoptimize                       [      21          0
   0    ]      [     0     0     0     0     0    ]  0
2.764: no vm operation                  [     160          0
   1    ]      [     0     0     0     0     0    ]  0
2.876: Deoptimize                       [     161          0
   0    ]      [     0     0     0     0     0    ]  0
4.503: EnableBiasedLocking              [     164          0
   0    ]      [     0     0     0     0     0    ]  0
6.916: RevokeBias                       [     164          0
   0    ]      [     0     0     0     0     0    ]  0

You're looking for any of these lines printed at or around one of your
unexpectedly long pauses.




On 31 January 2014 10:40, Joel Samuelsson <sa...@gmail.com> wrote:

> Thanks for your help.
>
> I've added those flags as well as some others I saw in another thread that
> redirects stdout to a file. What information is it that you need?
>
>
> 2014-01-29 Benedict Elliott Smith <be...@datastax.com>:
>
> It's possible the time attributed to GC is actually spent somewhere else;
>> a multitude of tasks may occur during the same safepoint as a GC. We've
>> seen some batch revoke of biased locks take a long time, for instance;
>> *if* this is happening in your case, and we can track down which
>> objects, I would consider it a bug and we may be able to fix it.
>>
>> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
>>
>>
>> On 29 January 2014 16:23, Joel Samuelsson <sa...@gmail.com>wrote:
>>
>>> Hi,
>>>
>>> We've been trying to figure out why we have so long and frequent
>>> stop-the-world GC even though we have basically no load.
>>>
>>> Today we got a log of a weird GC that I wonder if you have any theories
>>> of why it might have happened.
>>>
>>> A plot of our heap at the time, paired with the GC time from the
>>> Cassandra log:
>>> http://imgur.com/vw5rOzj
>>> -The blue line is the ratio of Eden space used (i.e. 1.0 = full)
>>> -The red line is the ratio of Survivor0 space used
>>> -The green line is the ratio of Survivor1 space used
>>> -The teal line is the ratio of Old Gen space used
>>> -The pink line shows during which period of time a GC happened (from the
>>> Cassandra log)
>>>
>>> Eden space is filling up and being cleared as expected in the first and
>>> last hill but on the middle one, it takes two seconds to clear Eden (note
>>> that Eden has ratio 1 for 2 seconds). Neither the survivor spaces nor old
>>> generation increase significantly afterwards.
>>>
>>> Any ideas of why this might be happening?
>>> We have swap disabled, JNA enabled, no CPU spikes at the time, no disk
>>> I/O spikes at the time. What else could be causing this?
>>>
>>> /Joel Samuelsson
>>>
>>
>>
>

Re: Weird GC

Posted by Joel Samuelsson <sa...@gmail.com>.
Thanks for your help.

I've added those flags as well as some others I saw in another thread that
redirects stdout to a file. What information is it that you need?


2014-01-29 Benedict Elliott Smith <be...@datastax.com>:

> It's possible the time attributed to GC is actually spent somewhere else;
> a multitude of tasks may occur during the same safepoint as a GC. We've
> seen some batch revoke of biased locks take a long time, for instance;
> *if* this is happening in your case, and we can track down which objects,
> I would consider it a bug and we may be able to fix it.
>
> -XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1
>
>
> On 29 January 2014 16:23, Joel Samuelsson <sa...@gmail.com>wrote:
>
>> Hi,
>>
>> We've been trying to figure out why we have so long and frequent
>> stop-the-world GC even though we have basically no load.
>>
>> Today we got a log of a weird GC that I wonder if you have any theories
>> of why it might have happened.
>>
>> A plot of our heap at the time, paired with the GC time from the
>> Cassandra log:
>> http://imgur.com/vw5rOzj
>> -The blue line is the ratio of Eden space used (i.e. 1.0 = full)
>> -The red line is the ratio of Survivor0 space used
>> -The green line is the ratio of Survivor1 space used
>> -The teal line is the ratio of Old Gen space used
>> -The pink line shows during which period of time a GC happened (from the
>> Cassandra log)
>>
>> Eden space is filling up and being cleared as expected in the first and
>> last hill but on the middle one, it takes two seconds to clear Eden (note
>> that Eden has ratio 1 for 2 seconds). Neither the survivor spaces nor old
>> generation increase significantly afterwards.
>>
>> Any ideas of why this might be happening?
>> We have swap disabled, JNA enabled, no CPU spikes at the time, no disk
>> I/O spikes at the time. What else could be causing this?
>>
>> /Joel Samuelsson
>>
>
>

Re: Weird GC

Posted by Benedict Elliott Smith <be...@datastax.com>.
It's possible the time attributed to GC is actually spent somewhere else; a
multitude of tasks may occur during the same safepoint as a GC. We've seen
some batch revoke of biased locks take a long time, for instance; *if* this
is happening in your case, and we can track down which objects, I would
consider it a bug and we may be able to fix it.

-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1


On 29 January 2014 16:23, Joel Samuelsson <sa...@gmail.com> wrote:

> Hi,
>
> We've been trying to figure out why we have so long and frequent
> stop-the-world GC even though we have basically no load.
>
> Today we got a log of a weird GC that I wonder if you have any theories of
> why it might have happened.
>
> A plot of our heap at the time, paired with the GC time from the Cassandra
> log:
> http://imgur.com/vw5rOzj
> -The blue line is the ratio of Eden space used (i.e. 1.0 = full)
> -The red line is the ratio of Survivor0 space used
> -The green line is the ratio of Survivor1 space used
> -The teal line is the ratio of Old Gen space used
> -The pink line shows during which period of time a GC happened (from the
> Cassandra log)
>
> Eden space is filling up and being cleared as expected in the first and
> last hill but on the middle one, it takes two seconds to clear Eden (note
> that Eden has ratio 1 for 2 seconds). Neither the survivor spaces nor old
> generation increase significantly afterwards.
>
> Any ideas of why this might be happening?
> We have swap disabled, JNA enabled, no CPU spikes at the time, no disk I/O
> spikes at the time. What else could be causing this?
>
> /Joel Samuelsson
>