You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by onmstester onmstester <on...@zoho.com.INVALID> on 2018/10/20 10:18:55 UTC

High CPU usage on some of the nodes due to message coalesce

3 nodes in my cluster have 100% cpu usage and most of it is used by org.apache.cassandra.util.coalesceInternal and SepWorker.run? The most active threads are the messaging-service-incomming. Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 rack each having 3 nodes. The problematic nodes are configured for one rack, on normal write load, system.log reports too many hint message dropped (cross node). also there are alot of parNewGc with about 700-1000ms and commit log isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there are alot of "updateing topology" logs (1000s of them pending). Using iperf, i'm sure that network is OK checking NTPs and mutations on each node, load is balanced among the nodes. using apache cassandra 3.11.2 I can not not figure out the root cause of the problem, although there are some obvious symptoms. Best Regards Sent using Zoho Mail

Fwd: Re: Re: High CPU usage on some of the nodes due to message coalesce

Posted by onmstester onmstester <on...@zoho.com.INVALID>.

Any cron or other scheduler running on those nodes? no Lots of Java processes running simultaneously? no, just Apache Cassandra Heavy repair continuously running? none Lots of pending compactions? none, the cpu goes to 100% on first seconds of insert (write load) so no memtable flushed yet,  Is the number of CPU cores the same in all the nodes? yes, 12 Did you try rebooting one of the nodes? Yes, cold rebooted all of them once, no luck! Thanks for your time

Re: Re: High CPU usage on some of the nodes due to message coalesce

Posted by shalom sagges <sh...@gmail.com>.

I guess the code experts could shed more light on
org.apache.cassandra.util.coalesceInternal and SepWorker.run.
I'll just add anything I can think of....

Any cron or other scheduler running on those nodes?
Lots of Java processes running simultaneously?
Heavy repair continuously running?
Lots of pending compactions?
Is the number of CPU cores the same in all the nodes?
Did you try rebooting one of the nodes?

On Sun, Oct 21, 2018 at 4:55 PM onmstester onmstester
<on...@zoho.com.invalid> wrote:

>
> What takes the most CPU? System or User?
>
>
>  most of it is used by org.apache.cassandra.util.coalesceInternal and
> SepWorker.run
>
> Did you try removing a problematic node and installing a brand new one
> (instead of re-adding)?
>
> I did not install a new node, but did remove the problematic node and CPU
> load in all the cluster became normal again
>
> When you decommissioned these nodes, did the high CPU "move" to other
> nodes (probably data model/query issues) or was it completely gone? (server
> issues)
>
> it was completely gone
>
>

Re: Re: High CPU usage on some of the nodes due to message coalesce

Posted by onmstester onmstester <on...@zoho.com.INVALID>.

What takes the most CPU? System or User?  most of it is used by org.apache.cassandra.util.coalesceInternal and SepWorker.run Did you try removing a problematic node and installing a brand new one (instead of re-adding)? I did not install a new node, but did remove the problematic node and CPU load in all the cluster became normal again When you decommissioned these nodes, did the high CPU "move" to other nodes (probably data model/query issues) or was it completely gone? (server issues) it was completely gone

Re: Re: High CPU usage on some of the nodes due to message coalesce

Posted by shalom sagges <sh...@gmail.com>.

What takes the most CPU? System or User?
Did you try removing a problematic node and installing a brand new one
(instead of re-adding)?
When you decommissioned these nodes, did the high CPU "move" to other nodes
(probably data model/query issues) or was it completely gone? (server
issues)


On Sun, Oct 21, 2018 at 3:52 PM onmstester onmstester
<on...@zoho.com.invalid> wrote:

> I don't think that root cause is related to Cassandra config, because the
> nodes are homogeneous and config for all of them are the same (16GB heap
> with default gc), also mutation counter and Native Transport counter is the
> same in all of the nodes, but only these 3 nodes experiencing 100% CPU
> usage (others have less than 20% CPU usage)
> I even decommissioned these 3 nodes from cluster and re-add them, but
> still the same
> The cluster is OK without these 3 nodes (in a state that these nodes are
> decommissioned)
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ============ Forwarded message ============
> From : Chris Lohfink <cl...@apple.com>
> To : <us...@cassandra.apache.org>
> Date : Sat, 20 Oct 2018 23:24:03 +0330
> Subject : Re: High CPU usage on some of the nodes due to message coalesce
> ============ Forwarded message ============
>
> 1s young gcs are horrible and likely cause of *some* of your bad metrics.
> How large are your mutations/query results and what gc/heap settings are
> you using?
>
> You can use https://github.com/aragozin/jvm-tools to see the threads
> generating allocation pressure and using the cpu (ttop) and what garbage is
> being created (hh --dead-young).
>
> Just a shot in the dark, I would *guess* you have rather large mutations
> putting pressure on commitlog and heap. G1 with a larger heap might help in
> that scenario to reduce fragmentation and adjust its eden and survivor
> regions to the allocation rate better (but give it a bigger reserve space)
> but theres limits to what can help if you cant change your workload.
> Without more info on schema etc its hard to tell but maybe that can help
> give you some ideas on places to look. It could just as likely be repair
> coordination, wide partition reads, or compactions so need to look more at
> what within the app is causing the pressure to know if its possible to
> improve with settings or if the load your application is producing exceeds
> what your cluster can handle (needs more nodes).
>
> Chris
>
> On Oct 20, 2018, at 5:18 AM, onmstester onmstester <
> onmstester@zoho.com.INVALID> wrote:
>
> 3 nodes in my cluster have 100% cpu usage and most of it is used by
> org.apache.cassandra.util.coalesceInternal and SepWorker.run?
> The most active threads are the messaging-service-incomming.
> Other nodes are normal, having 30 nodes, using Rack Aware strategy. with
> 10 rack each having 3 nodes. The problematic nodes are configured for one
> rack, on normal write load, system.log reports too many hint message
> dropped (cross node). also there are alot of parNewGc with about 700-1000ms
> and commit log isolated disk, is utilized about 80-90%. on startup of these
> 3 nodes, there are alot of "updateing topology" logs (1000s of them
> pending).
> Using iperf, i'm sure that network is OK
> checking NTPs and mutations on each node, load is balanced among the nodes.
> using apache cassandra 3.11.2
> I can not not figure out the root cause of the problem, although there are
> some obvious symptoms.
>
> Best Regards
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>

Fwd: Re: High CPU usage on some of the nodes due to message coalesce

Posted by onmstester onmstester <on...@zoho.com.INVALID>.

I don't think that root cause is related to Cassandra config, because the nodes are homogeneous and config for all of them are the same (16GB heap with default gc), also mutation counter and Native Transport counter is the same in all of the nodes, but only these 3 nodes experiencing 100% CPU usage (others have less than 20% CPU usage)  I even decommissioned these 3 nodes from cluster and re-add them, but still the same The cluster is OK without these 3 nodes (in a state that these nodes are decommissioned) Sent using Zoho Mail ============ Forwarded message ============ From : Chris Lohfink <cl...@apple.com> To : <us...@cassandra.apache.org> Date : Sat, 20 Oct 2018 23:24:03 +0330 Subject : Re: High CPU usage on some of the nodes due to message coalesce ============ Forwarded message ============ 1s young gcs are horrible and likely cause of some of your bad metrics. How large are your mutations/query results and what gc/heap settings are you using? You can use https://github.com/aragozin/jvm-tools to see the threads generating allocation pressure and using the cpu (ttop) and what garbage is being created (hh --dead-young). Just a shot in the dark, I would guess you have rather large mutations putting pressure on commitlog and heap. G1 with a larger heap might help in that scenario to reduce fragmentation and adjust its eden and survivor regions to the allocation rate better (but give it a bigger reserve space) but theres limits to what can help if you cant change your workload. Without more info on schema etc its hard to tell but maybe that can help give you some ideas on places to look. It could just as likely be repair coordination, wide partition reads, or compactions so need to look more at what within the app is causing the pressure to know if its possible to improve with settings or if the load your application is producing exceeds what your cluster can handle (needs more nodes). Chris On Oct 20, 2018, at 5:18 AM, onmstester onmstester <on...@zoho.com.INVALID> wrote: 3 nodes in my cluster have 100% cpu usage and most of it is used by org.apache.cassandra.util.coalesceInternal and SepWorker.run? The most active threads are the messaging-service-incomming. Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 rack each having 3 nodes. The problematic nodes are configured for one rack, on normal write load, system.log reports too many hint message dropped (cross node). also there are alot of parNewGc with about 700-1000ms and commit log isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there are alot of "updateing topology" logs (1000s of them pending). Using iperf, i'm sure that network is OK checking NTPs and mutations on each node, load is balanced among the nodes. using apache cassandra 3.11.2 I can not not figure out the root cause of the problem, although there are some obvious symptoms. Best Regards Sent using Zoho Mail

Re: High CPU usage on some of the nodes due to message coalesce

Posted by Chris Lohfink <cl...@apple.com>.

1s young gcs are horrible and likely cause of some of your bad metrics. How large are your mutations/query results and what gc/heap settings are you using?

You can use https://github.com/aragozin/jvm-tools <https://github.com/aragozin/jvm-tools> to see the threads generating allocation pressure and using the cpu (ttop) and what garbage is being created (hh --dead-young).

Just a shot in the dark, I would guess you have rather large mutations putting pressure on commitlog and heap. G1 with a larger heap might help in that scenario to reduce fragmentation and adjust its eden and survivor regions to the allocation rate better (but give it a bigger reserve space) but theres limits to what can help if you cant change your workload. Without more info on schema etc its hard to tell but maybe that can help give you some ideas on places to look. It could just as likely be repair coordination, wide partition reads, or compactions so need to look more at what within the app is causing the pressure to know if its possible to improve with settings or if the load your application is producing exceeds what your cluster can handle (needs more nodes).

Chris

> On Oct 20, 2018, at 5:18 AM, onmstester onmstester <on...@zoho.com.INVALID> wrote:
> 
> 3 nodes in my cluster have 100% cpu usage and most of it is used by org.apache.cassandra.util.coalesceInternal and SepWorker.run?
> The most active threads are the messaging-service-incomming.
> Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 rack each having 3 nodes. The problematic nodes are configured for one rack, on normal write load, system.log reports too many hint message dropped (cross node). also there are alot of parNewGc with about 700-1000ms and commit log isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there are alot of "updateing topology" logs (1000s of them pending). 
> Using iperf, i'm sure that network is OK
> checking NTPs and mutations on each node, load is balanced among the nodes.
> using apache cassandra 3.11.2
> I can not not figure out the root cause of the problem, although there are some obvious symptoms.
> 
> Best Regards
> Sent using Zoho Mail <https://www.zoho.com/mail/>
> 
>