You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Oleksandr Shulgin <ol...@zalando.de> on 2018/09/26 07:34:11 UTC

Odd CPU utilization spikes on 1 node out of 30 during repair

Hello,

On our production cluster of 30 Apache Cassandra 3.0.17 nodes we have
observed that only one node started to show about 2 times the CPU
utilization as compared to the rest (see screenshot): up to 30% vs. ~15% on
average for the other nodes.

This started more or less immediately after repair was started (using
Cassandra Reaper, parallel, non-incremental) and lasted up until we've
restarted this node.  After restart the CPU use is in line with the rest of
nodes.

All other metrics that we are monitoring for these nodes were in line with
the rest of the cluster.

The logs on the node don't show anything odd, no extra warn/error/info
messages, not more minor or major GC runs as compared to other nodes during
the time we were observing this behavior.

What could be the reason for this behavior?  How should we debug it if that
happens next time instead of just restarting?

Cheers,
--
Alex

Re: Odd CPU utilization spikes on 1 node out of 30 during repair

Posted by Oleksandr Shulgin <ol...@zalando.de>.
On Thu, Sep 27, 2018 at 2:24 AM Anup Shirolkar <
anup.shirolkar@instaclustr.com> wrote:

>
> Most of the things look ok from your setup.
>
> You can enable Debug logs for repair duration.
> This will help identify if you are hitting a bug or other cause of unusual
> behaviour.
>
> Just a remote possibility, do you have other things running on nodes
> besides Cassandra.
> Do they consume additional CPU at times.
> You can check per process CPU consumption to keep an eye on non-Cassandra
> processes.
>

That's a good point.  These instances are dedicated to run Cassandra, so
that we didn't think to check any other processes might be the cause...
But of course, there are some additional processes (like metrics exporter
and log shipping agents), but they normally do not contribute to CPU
utilization in any visible amount.

Cheers,
--
Alex

Re: Odd CPU utilization spikes on 1 node out of 30 during repair

Posted by Anup Shirolkar <an...@instaclustr.com>.
Hi,

Most of the things look ok from your setup.

You can enable Debug logs for repair duration.
This will help identify if you are hitting a bug or other cause of unusual
behaviour.

Just a remote possibility, do you have other things running on nodes
besides Cassandra.
Do they consume additional CPU at times.
You can check per process CPU consumption to keep an eye on non-Cassandra
processes.


Regards,

Anup Shirolkar

On Wed, 26 Sep 2018 at 21:32, Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> On Wed, Sep 26, 2018 at 1:07 PM Anup Shirolkar <
> anup.shirolkar@instaclustr.com> wrote:
>
>>
>> Looking at information you have provided, the increased CPU utilisation
>> could be because of repair running on the node.
>> Repairs are resource intensive operations.
>>
>> Restarting the node should have halted repair operation getting the CPU
>> back to normal.
>>
>
> The repair was running on all nodes at the same time, still only one node
> had CPU significantly different from the rest of the nodes.
> As I've mentioned: we are running non-incremental parallel repair using
> Cassandra Reaper.
> After the node was restarted, new repair tasks were given to it by Reaper
> and it was doing repair as previously, but this time
> without exposing the odd behavior.
>
> In some cases, repairs trigger additional operations e.g. compactions,
>> anti-compactions
>> These operations could cause extra CPU utilisation.
>> What is the compaction strategy used on majority of keyspaces ?
>>
>
> For the 2 tables involved in this regular repair we are using
> TimeWindowCompactionStrategy with time windows of 30 days.
>
> Talking about CPU utilisation *percentage*, although it has doubled but
>> the increase is 15%.
>> It would be interesting to know the number of CPU cores on these nodes to
>> judge the absolute increase in CPU utilisation.
>>
>
> All nodes are using the same hardware on AWS EC2: r4.xlarge, they have 4
> vCPUs.
>
> You should try to find the root cause behind the behaviour and decide
>> course of action.
>>
>
> Sure, that's why I was asking for ideas how to find the root cause. :-)
>
> Effective use monitoring, logs can help you identify the root cause.
>>
>
> As I've mentioned, we do have monitoring and I've checked the logs, but
> that didn't help to identify the issue so far.
>
> Regards,
> --
> Alex
>
>

Re: Odd CPU utilization spikes on 1 node out of 30 during repair

Posted by Oleksandr Shulgin <ol...@zalando.de>.
On Wed, Sep 26, 2018 at 1:07 PM Anup Shirolkar <
anup.shirolkar@instaclustr.com> wrote:

>
> Looking at information you have provided, the increased CPU utilisation
> could be because of repair running on the node.
> Repairs are resource intensive operations.
>
> Restarting the node should have halted repair operation getting the CPU
> back to normal.
>

The repair was running on all nodes at the same time, still only one node
had CPU significantly different from the rest of the nodes.
As I've mentioned: we are running non-incremental parallel repair using
Cassandra Reaper.
After the node was restarted, new repair tasks were given to it by Reaper
and it was doing repair as previously, but this time
without exposing the odd behavior.

In some cases, repairs trigger additional operations e.g. compactions,
> anti-compactions
> These operations could cause extra CPU utilisation.
> What is the compaction strategy used on majority of keyspaces ?
>

For the 2 tables involved in this regular repair we are using
TimeWindowCompactionStrategy with time windows of 30 days.

Talking about CPU utilisation *percentage*, although it has doubled but the
> increase is 15%.
> It would be interesting to know the number of CPU cores on these nodes to
> judge the absolute increase in CPU utilisation.
>

All nodes are using the same hardware on AWS EC2: r4.xlarge, they have 4
vCPUs.

You should try to find the root cause behind the behaviour and decide
> course of action.
>

Sure, that's why I was asking for ideas how to find the root cause. :-)

Effective use monitoring, logs can help you identify the root cause.
>

As I've mentioned, we do have monitoring and I've checked the logs, but
that didn't help to identify the issue so far.

Regards,
--
Alex

Re: Odd CPU utilization spikes on 1 node out of 30 during repair

Posted by Anup Shirolkar <an...@instaclustr.com>.
Hi,

Looking at information you have provided, the increased CPU utilisation
could be because of repair running on the node.
Repairs are resource intensive operations.

Restarting the node should have halted repair operation getting the CPU
back to normal.

In case you regularly run repairs but have observed increase in CPU
utilisation first time,
it could be area of concern. Otherwise, repairs utilising extra CPU is
normal.

In some cases, repairs trigger additional operations e.g. compactions,
anti-compactions
These operations could cause extra CPU utilisation.
What is the compaction strategy used on majority of keyspaces ?

Talking about CPU utilisation *percentage*, although it has doubled but the
increase is 15%.
It would be interesting to know the number of CPU cores on these nodes to
judge the absolute increase in CPU utilisation.

You should try to find the root cause behind the behaviour and decide
course of action.
Effective use monitoring, logs can help you identify the root cause.

Regards,
Anup

On Wed, 26 Sep 2018 at 17:34, Oleksandr Shulgin <
oleksandr.shulgin@zalando.de> wrote:

> Hello,
>
> On our production cluster of 30 Apache Cassandra 3.0.17 nodes we have
> observed that only one node started to show about 2 times the CPU
> utilization as compared to the rest (see screenshot): up to 30% vs. ~15% on
> average for the other nodes.
>
> This started more or less immediately after repair was started (using
> Cassandra Reaper, parallel, non-incremental) and lasted up until we've
> restarted this node.  After restart the CPU use is in line with the rest of
> nodes.
>
> All other metrics that we are monitoring for these nodes were in line with
> the rest of the cluster.
>
> The logs on the node don't show anything odd, no extra warn/error/info
> messages, not more minor or major GC runs as compared to other nodes during
> the time we were observing this behavior.
>
> What could be the reason for this behavior?  How should we debug it if
> that happens next time instead of just restarting?
>
> Cheers,
> --
> Alex
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org



-- 

Anup Shirolkar

Consultant

+61 420 602 338

<https://www.instaclustr.com/solutions/managed-apache-kafka/>

<https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
<https://www.linkedin.com/company/instaclustr>

Read our latest technical blog posts here
<https://www.instaclustr.com/blog/>.