You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alain RODRIGUEZ <ar...@gmail.com> on 2014/06/18 11:39:12 UTC

restarting node makes cpu load of the entire cluster to raise

Hi guys

Using 1.2.11, when I try to rolling restart the cluster, any node I restart
makes the whole cluster cpu load to increase, reaching a "red" state in
opscenter (load from 3-4 to 20+). This happens once the node is back online.

The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
mutations.

I have tried to throttle handoff to 256 (instead of 1024), yet it doesn't
seems to help that much.

Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
problematic I think.

Basically, what could be happening on node restart ? What is taking that
much CPU on every machine ? There is no steal or iowait.

What can I try to tune ?

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Here is more info around ParNew. In these screenshots we can see 2 groups
of servers.

The lower counts and time spent in GC are servers using New size 1600M and
higher are using 400M (in cassandra-env.sh).
Every spike is corresponding to a server coming back into the ring.

Hope that this extra info will help.


2014-06-18 14:43 GMT+02:00 Alain RODRIGUEZ <ar...@gmail.com>:

> This last command was supposed to be a best practice a few years ago, hope
> it is still the case. I just added the recent "nodetool disablebinary"
> part...
>
>
> 2014-06-18 14:36 GMT+02:00 Alain RODRIGUEZ <ar...@gmail.com>:
>
> Thanks a lot for taking time to check the log.
>>
>> We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It
>> reduced our latency and the PARNEW GC time / second significantly...
>> (described here
>> http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
>> )
>>
>> Even when we had 400M the restart was behaving this way.
>>
>> We stop the node using : nodetool disablegossip && nodetool disablethrift
>> && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
>> service cassandra stop
>>
>>
>> 2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:
>>
>> There are several long Parnew pauses that were recorded during startup.
>>>  The young gen size looks large too, if I am reading that line correctly.
>>>  Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
>>> size in the cassandra-env.sh?  The large you gen size, set via the env.sh
>>> file, could be causing longer than typical pauses, which could make your
>>> node appear to be unresponsive and have high CPU (CPU for the ParNew GC
>>> event).
>>>
>>> Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
>>> collections, 1256307568 used; max is 8422162432
>>> That is a 2 second GC pause.  That's very high for ParNew.  We typically
>>> want a lot of tiny ParNew events as opposed to large, and less frequent,
>>> ParNew events.
>>>
>>> One other thing that was noticed, was that the node had a lot of log
>>> segment replay's during startup.  You could avoid these, or minimize them,
>>> by preforming a flush or drain before stopping and starting Cassandra.
>>>  This will flush memtables and clear your log segments.
>>>
>>>
>>>
>>> Jonathan Lacefield
>>> Solutions Architect, DataStax
>>> (404) 822 3487
>>>  <http://www.linkedin.com/in/jlacefield>
>>>
>>> <http://www.datastax.com/cassandrasummit14>
>>>
>>>
>>>
>>> On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>>
>>>> A simple restart of a node with no changes give this result.
>>>>
>>>> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>>>>
>>>> Here are some screenshot:
>>>>
>>>> - htop from a node immediatly after restarting
>>>> - opscenter ring view (show load cpu on all nodes)
>>>> - opscenter dashboard shows the impact of a restart on latency (can
>>>> affect writes or reads, it depends, reaction seems to be quite random)
>>>>
>>>>
>>>> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>
>>>> :
>>>>
>>>> Hello
>>>>>
>>>>>   Have you checked the log file to see what's happening during startup
>>>>> ?   What caused the rolling restart?  Did you preform an upgrade or
>>>>> change a config?
>>>>>
>>>>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>>>> wrote:
>>>>> >
>>>>> > Hi guys
>>>>> >
>>>>> > Using 1.2.11, when I try to rolling restart the cluster, any node I
>>>>> restart makes the whole cluster cpu load to increase, reaching a "red"
>>>>> state in opscenter (load from 3-4 to 20+). This happens once the node is
>>>>> back online.
>>>>> >
>>>>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>>>>> mutations.
>>>>> >
>>>>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>>>>> doesn't seems to help that much.
>>>>> >
>>>>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>>>>> problematic I think.
>>>>> >
>>>>> > Basically, what could be happening on node restart ? What is taking
>>>>> that much CPU on every machine ? There is no steal or iowait.
>>>>> >
>>>>> > What can I try to tune ?
>>>>> >
>>>>>
>>>>
>>>>
>>>
>>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
This last command was supposed to be a best practice a few years ago, hope
it is still the case. I just added the recent "nodetool disablebinary"
part...


2014-06-18 14:36 GMT+02:00 Alain RODRIGUEZ <ar...@gmail.com>:

> Thanks a lot for taking time to check the log.
>
> We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It
> reduced our latency and the PARNEW GC time / second significantly...
> (described here
> http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
> )
>
> Even when we had 400M the restart was behaving this way.
>
> We stop the node using : nodetool disablegossip && nodetool disablethrift
> && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
> service cassandra stop
>
>
> 2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:
>
> There are several long Parnew pauses that were recorded during startup.
>>  The young gen size looks large too, if I am reading that line correctly.
>>  Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
>> size in the cassandra-env.sh?  The large you gen size, set via the env.sh
>> file, could be causing longer than typical pauses, which could make your
>> node appear to be unresponsive and have high CPU (CPU for the ParNew GC
>> event).
>>
>> Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
>> collections, 1256307568 used; max is 8422162432
>> That is a 2 second GC pause.  That's very high for ParNew.  We typically
>> want a lot of tiny ParNew events as opposed to large, and less frequent,
>> ParNew events.
>>
>> One other thing that was noticed, was that the node had a lot of log
>> segment replay's during startup.  You could avoid these, or minimize them,
>> by preforming a flush or drain before stopping and starting Cassandra.
>>  This will flush memtables and clear your log segments.
>>
>>
>>
>> Jonathan Lacefield
>> Solutions Architect, DataStax
>> (404) 822 3487
>>  <http://www.linkedin.com/in/jlacefield>
>>
>> <http://www.datastax.com/cassandrasummit14>
>>
>>
>>
>> On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> A simple restart of a node with no changes give this result.
>>>
>>> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>>>
>>> Here are some screenshot:
>>>
>>> - htop from a node immediatly after restarting
>>> - opscenter ring view (show load cpu on all nodes)
>>> - opscenter dashboard shows the impact of a restart on latency (can
>>> affect writes or reads, it depends, reaction seems to be quite random)
>>>
>>>
>>> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:
>>>
>>> Hello
>>>>
>>>>   Have you checked the log file to see what's happening during startup
>>>> ?   What caused the rolling restart?  Did you preform an upgrade or
>>>> change a config?
>>>>
>>>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi guys
>>>> >
>>>> > Using 1.2.11, when I try to rolling restart the cluster, any node I
>>>> restart makes the whole cluster cpu load to increase, reaching a "red"
>>>> state in opscenter (load from 3-4 to 20+). This happens once the node is
>>>> back online.
>>>> >
>>>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>>>> mutations.
>>>> >
>>>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>>>> doesn't seems to help that much.
>>>> >
>>>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>>>> problematic I think.
>>>> >
>>>> > Basically, what could be happening on node restart ? What is taking
>>>> that much CPU on every machine ? There is no steal or iowait.
>>>> >
>>>> > What can I try to tune ?
>>>> >
>>>>
>>>
>>>
>>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Anyone has any clue of what is happening in our cluster with the given
information?

What other informations could help you to help me :-D ?




2014-06-18 21:07 GMT+02:00 Robert Coli <rc...@eventbrite.com>:

> On Wed, Jun 18, 2014 at 5:36 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> We stop the node using : nodetool disablegossip && nodetool disablethrift
>> && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
>> service cassandra stop
>>
>
> The stuff before "nodetool drain" here is redundant and doesn't actually
> do what you are expecting it to do.
>
> https://issues.apache.org/jira/browse/CASSANDRA-4162
>
> =Rob
>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Hi guys, just for the record, in case someone has this issue in the future,
it is a bug, fixed in 1.2.12.

I recommend anyone in this version to upgrade cluster before getting
totally stuck (by adding new machines / DC / altering keyspaces / ...),
well anything using gossip actually. We are going to plan a full downtime
to be able to recover, since we are completely stuck.

Datastax help to find out the issue, thanks DuyHai !

For more information => https://issues.apache.org/jira/browse/CASSANDRA-6297






2014-06-26 16:16 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:

> Hello Alain,
>
>   I'm not sure of the root cause of this item.  It may be helpful to use
> DEBUG and start the node to see what's happening as well as watch
> compaction stats or tpstats to understand what is taxing your system.
>
>   The log file you provided shows a large ParNew while replaying commit
> log segments.  Does your app insert very large "rows" or have individual
> columns that are large?
>
>   I quickly reviewed Changes/txt
> https://github.com/apache/cassandra/blob/cassandra-1.2/CHANGES.txt to see
> if anything jumps out as a culprit, but didn't spot anything.
>
>   Sorry i can't be of more help with this one.  It may take some hands-on
> investigation or maybe someone else in the community has experienced this
> issue and can provide feedback.
>
> Thanks,
>
> Jonathan
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Wed, Jun 18, 2014 at 3:07 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Wed, Jun 18, 2014 at 5:36 AM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>>
>>> We stop the node using : nodetool disablegossip && nodetool
>>> disablethrift && nodetool disablebinary && sleep 10 && nodetool drain &&
>>> sleep 30 && service cassandra stop
>>>
>>
>> The stuff before "nodetool drain" here is redundant and doesn't actually
>> do what you are expecting it to do.
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-4162
>>
>> =Rob
>>
>>
>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Jonathan Lacefield <jl...@datastax.com>.
Hello Alain,

  I'm not sure of the root cause of this item.  It may be helpful to use
DEBUG and start the node to see what's happening as well as watch
compaction stats or tpstats to understand what is taxing your system.

  The log file you provided shows a large ParNew while replaying commit log
segments.  Does your app insert very large "rows" or have individual
columns that are large?

  I quickly reviewed Changes/txt
https://github.com/apache/cassandra/blob/cassandra-1.2/CHANGES.txt to see
if anything jumps out as a culprit, but didn't spot anything.

  Sorry i can't be of more help with this one.  It may take some hands-on
investigation or maybe someone else in the community has experienced this
issue and can provide feedback.

Thanks,

Jonathan

Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Wed, Jun 18, 2014 at 3:07 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Jun 18, 2014 at 5:36 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> We stop the node using : nodetool disablegossip && nodetool disablethrift
>> && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
>> service cassandra stop
>>
>
> The stuff before "nodetool drain" here is redundant and doesn't actually
> do what you are expecting it to do.
>
> https://issues.apache.org/jira/browse/CASSANDRA-4162
>
> =Rob
>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Jun 18, 2014 at 5:36 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> We stop the node using : nodetool disablegossip && nodetool disablethrift
> && nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
> service cassandra stop
>

The stuff before "nodetool drain" here is redundant and doesn't actually do
what you are expecting it to do.

https://issues.apache.org/jira/browse/CASSANDRA-4162

=Rob

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
Thanks a lot for taking time to check the log.

We just switched from 400M to 1600M NEW size in the cassandra-env.sh. It
reduced our latency and the PARNEW GC time / second significantly...
(described here
http://tech.shift.com/post/74311817513/cassandra-tuning-the-jvm-for-read-heavy-workloads
)

Even when we had 400M the restart was behaving this way.

We stop the node using : nodetool disablegossip && nodetool disablethrift
&& nodetool disablebinary && sleep 10 && nodetool drain && sleep 30 &&
service cassandra stop


2014-06-18 14:23 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:

> There are several long Parnew pauses that were recorded during startup.
>  The young gen size looks large too, if I am reading that line correctly.
>  Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
> size in the cassandra-env.sh?  The large you gen size, set via the env.sh
> file, could be causing longer than typical pauses, which could make your
> node appear to be unresponsive and have high CPU (CPU for the ParNew GC
> event).
>
> Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
> collections, 1256307568 used; max is 8422162432
> That is a 2 second GC pause.  That's very high for ParNew.  We typically
> want a lot of tiny ParNew events as opposed to large, and less frequent,
> ParNew events.
>
> One other thing that was noticed, was that the node had a lot of log
> segment replay's during startup.  You could avoid these, or minimize them,
> by preforming a flush or drain before stopping and starting Cassandra.
>  This will flush memtables and clear your log segments.
>
>
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <ar...@gmail.com>
> wrote:
>
>> A simple restart of a node with no changes give this result.
>>
>> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>>
>> Here are some screenshot:
>>
>> - htop from a node immediatly after restarting
>> - opscenter ring view (show load cpu on all nodes)
>> - opscenter dashboard shows the impact of a restart on latency (can
>> affect writes or reads, it depends, reaction seems to be quite random)
>>
>>
>> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:
>>
>> Hello
>>>
>>>   Have you checked the log file to see what's happening during startup
>>> ?   What caused the rolling restart?  Did you preform an upgrade or
>>> change a config?
>>>
>>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <ar...@gmail.com>
>>> wrote:
>>> >
>>> > Hi guys
>>> >
>>> > Using 1.2.11, when I try to rolling restart the cluster, any node I
>>> restart makes the whole cluster cpu load to increase, reaching a "red"
>>> state in opscenter (load from 3-4 to 20+). This happens once the node is
>>> back online.
>>> >
>>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>>> mutations.
>>> >
>>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>>> doesn't seems to help that much.
>>> >
>>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>>> problematic I think.
>>> >
>>> > Basically, what could be happening on node restart ? What is taking
>>> that much CPU on every machine ? There is no steal or iowait.
>>> >
>>> > What can I try to tune ?
>>> >
>>>
>>
>>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Jonathan Lacefield <jl...@datastax.com>.
There are several long Parnew pauses that were recorded during startup.
 The young gen size looks large too, if I am reading that line correctly.
 Did you happen to overwrite the default settings for MAX_HEAP and/or NEW
size in the cassandra-env.sh?  The large you gen size, set via the env.sh
file, could be causing longer than typical pauses, which could make your
node appear to be unresponsive and have high CPU (CPU for the ParNew GC
event).

Check out this one - INFO 11:42:51,939 GC for ParNew: 2148 ms for 2
collections, 1256307568 used; max is 8422162432
That is a 2 second GC pause.  That's very high for ParNew.  We typically
want a lot of tiny ParNew events as opposed to large, and less frequent,
ParNew events.

One other thing that was noticed, was that the node had a lot of log
segment replay's during startup.  You could avoid these, or minimize them,
by preforming a flush or drain before stopping and starting Cassandra.
 This will flush memtables and clear your log segments.



Jonathan Lacefield
Solutions Architect, DataStax
(404) 822 3487
<http://www.linkedin.com/in/jlacefield>

<http://www.datastax.com/cassandrasummit14>



On Wed, Jun 18, 2014 at 8:05 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> A simple restart of a node with no changes give this result.
>
> logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26
>
> Here are some screenshot:
>
> - htop from a node immediatly after restarting
> - opscenter ring view (show load cpu on all nodes)
> - opscenter dashboard shows the impact of a restart on latency (can affect
> writes or reads, it depends, reaction seems to be quite random)
>
>
> 2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:
>
> Hello
>>
>>   Have you checked the log file to see what's happening during startup
>> ?   What caused the rolling restart?  Did you preform an upgrade or
>> change a config?
>>
>> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <ar...@gmail.com>
>> wrote:
>> >
>> > Hi guys
>> >
>> > Using 1.2.11, when I try to rolling restart the cluster, any node I
>> restart makes the whole cluster cpu load to increase, reaching a "red"
>> state in opscenter (load from 3-4 to 20+). This happens once the node is
>> back online.
>> >
>> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
>> mutations.
>> >
>> > I have tried to throttle handoff to 256 (instead of 1024), yet it
>> doesn't seems to help that much.
>> >
>> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
>> problematic I think.
>> >
>> > Basically, what could be happening on node restart ? What is taking
>> that much CPU on every machine ? There is no steal or iowait.
>> >
>> > What can I try to tune ?
>> >
>>
>
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
A simple restart of a node with no changes give this result.

logs output : https://gist.github.com/arodrime/db9ab152071d1ad39f26

Here are some screenshot:

- htop from a node immediatly after restarting
- opscenter ring view (show load cpu on all nodes)
- opscenter dashboard shows the impact of a restart on latency (can affect
writes or reads, it depends, reaction seems to be quite random)


2014-06-18 13:35 GMT+02:00 Jonathan Lacefield <jl...@datastax.com>:

> Hello
>
>   Have you checked the log file to see what's happening during startup
> ?   What caused the rolling restart?  Did you preform an upgrade or
> change a config?
>
> > On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
> >
> > Hi guys
> >
> > Using 1.2.11, when I try to rolling restart the cluster, any node I
> restart makes the whole cluster cpu load to increase, reaching a "red"
> state in opscenter (load from 3-4 to 20+). This happens once the node is
> back online.
> >
> > The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop
> mutations.
> >
> > I have tried to throttle handoff to 256 (instead of 1024), yet it
> doesn't seems to help that much.
> >
> > Disks are not the bottleneck. PARNEW GC increase a bit, but nothing
> problematic I think.
> >
> > Basically, what could be happening on node restart ? What is taking that
> much CPU on every machine ? There is no steal or iowait.
> >
> > What can I try to tune ?
> >
>

Re: restarting node makes cpu load of the entire cluster to raise

Posted by Jonathan Lacefield <jl...@datastax.com>.
Hello

  Have you checked the log file to see what's happening during startup
?   What caused the rolling restart?  Did you preform an upgrade or
change a config?

> On Jun 18, 2014, at 5:40 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
> Hi guys
>
> Using 1.2.11, when I try to rolling restart the cluster, any node I restart makes the whole cluster cpu load to increase, reaching a "red" state in opscenter (load from 3-4 to 20+). This happens once the node is back online.
>
> The restarted node uses 100 % cpu for 5 - 10 min and sometimes drop mutations.
>
> I have tried to throttle handoff to 256 (instead of 1024), yet it doesn't seems to help that much.
>
> Disks are not the bottleneck. PARNEW GC increase a bit, but nothing problematic I think.
>
> Basically, what could be happening on node restart ? What is taking that much CPU on every machine ? There is no steal or iowait.
>
> What can I try to tune ?
>