You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Roman Tkachenko <ro...@mailgunhq.com> on 2015/09/10 07:05:40 UTC

High CPU usage on some of nodes

Hey guys,

We've been having issues in the past couple of days with CPU usage / load
average suddenly skyrocketing on some nodes of the cluster, affecting
performance significantly so majority of requests start timing out. It can
go on for several hours, with CPU spiking through the roof then coming back
down to norm and so on. Weirdly, it affects only a subset of nodes and it's
always the same ones. The boxes Cassandra is running on are pretty beefy,
24 cores, and these CPU spikes go up to >1000%.

What is the best way to debug such kind of issues and find out what
Cassandra is doing during spikes like this? Doesn't seem to be compaction
related as sometimes during these spikes "nodetool compactionstats" says no
compactions are running.

Thanks!

Re: High CPU usage on some of nodes

Posted by Graham Sanderson <gr...@vast.com>.

again I haven’t read this thread from the beginning so I don’t know which node is which, but if nodes pause for longish GC, then other nodes will likely be saving hints (assuming you are writing at the time), then they will be delivered once the machines become responsive again. I’m just guessing though. Take a look at the hinting metrics.
> On Sep 11, 2015, at 2:45 PM, Roman Tkachenko <ro...@mailgunhq.com> wrote:
> 
> I have another datapoint from our monitoring system that shows huge outbound network traffic increase for the affected boxes during these spikes:
> 
> <Screen Shot 2015-09-11 at 12.35.16 PM.png>
> 
> Looking at inbound traffic, it is increased on nodes other than these (purple, yellow and blue) so it does look like some kind of excessive internode communication is going on between these 3 nodes and the rest of the cluster.
> 
> What could these network spikes be a sign of?
> 
> 
> On Thu, Sep 10, 2015 at 12:00 PM, Graham Sanderson <graham@vast.com <ma...@vast.com>> wrote:
> Haven’t been following this thread, but we run beefy machines with 8gig new gen, 12 gig old gen (down from 16g since moving memtables off heap, we can probably go lower)…
> 
> Apart from making sure you have all the latest -XX: flags from cassandra-env.sh (and MALLOC_ARENA_MAX), I personally would recommend running latest 2.1.x with
> 
> memory_allocator: JEMallocAllocator
> memtable_allocation_type: offheap_objects
> 
> Some people will probably disagree, but it works great for us (rare long pauses sub 2 secs), and if you’re seeing slow GC because of promotion failure of objects 131074 dwords big, then I definitely suggest you give it a try.
> 
>> On Sep 10, 2015, at 1:43 PM, Robert Coli <rcoli@eventbrite.com <ma...@eventbrite.com>> wrote:
>> 
>> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko <roman@mailgunhq.com <ma...@mailgunhq.com>> wrote: 
>> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune garbage collection then?
>> 
>> Yep. As a minor note, if the machines are that beefy, they probably have a lot of RAM, you might wish to consider trying G1 GC and a larger heap.
>> 
>> =Rob
>> 
>>  
> 
>

Re: High CPU usage on some of nodes

Posted by Roman Tkachenko <ro...@mailgunhq.com>.

I have another datapoint from our monitoring system that shows huge
outbound network traffic increase for the affected boxes during these
spikes:

[image: Inline image 1]

Looking at inbound traffic, it is increased on nodes other than these
(purple, yellow and blue) so it does look like some kind of excessive
internode communication is going on between these 3 nodes and the rest of
the cluster.

What could these network spikes be a sign of?


On Thu, Sep 10, 2015 at 12:00 PM, Graham Sanderson <gr...@vast.com> wrote:

> Haven’t been following this thread, but we run beefy machines with 8gig
> new gen, 12 gig old gen (down from 16g since moving memtables off heap, we
> can probably go lower)…
>
> Apart from making sure you have all the latest -XX: flags from
> cassandra-env.sh (and MALLOC_ARENA_MAX), I personally would recommend
> running latest 2.1.x with
>
> memory_allocator: JEMallocAllocator
> memtable_allocation_type: offheap_objects
>
> Some people will probably disagree, but it works great for us (rare long
> pauses sub 2 secs), and if you’re seeing slow GC because of promotion
> failure of objects 131074 dwords big, then I definitely suggest you give it
> a try.
>
> On Sep 10, 2015, at 1:43 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko <ro...@mailgunhq.com>
> wrote:
>>
>> [5 second CMS GC] Is my best shot to play with JVM settings trying to
>> tune garbage collection then?
>>
>
> Yep. As a minor note, if the machines are that beefy, they probably have a
> lot of RAM, you might wish to consider trying G1 GC and a larger heap.
>
> =Rob
>
>
>
>
>

Re: High CPU usage on some of nodes

Posted by Graham Sanderson <gr...@vast.com>.

Haven’t been following this thread, but we run beefy machines with 8gig new gen, 12 gig old gen (down from 16g since moving memtables off heap, we can probably go lower)…

Apart from making sure you have all the latest -XX: flags from cassandra-env.sh (and MALLOC_ARENA_MAX), I personally would recommend running latest 2.1.x with

memory_allocator: JEMallocAllocator
memtable_allocation_type: offheap_objects

Some people will probably disagree, but it works great for us (rare long pauses sub 2 secs), and if you’re seeing slow GC because of promotion failure of objects 131074 dwords big, then I definitely suggest you give it a try.

> On Sep 10, 2015, at 1:43 PM, Robert Coli <rc...@eventbrite.com> wrote:
> 
> On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko <roman@mailgunhq.com <ma...@mailgunhq.com>> wrote: 
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune garbage collection then?
> 
> Yep. As a minor note, if the machines are that beefy, they probably have a lot of RAM, you might wish to consider trying G1 GC and a larger heap.
> 
> =Rob
> 
>

Re: High CPU usage on some of nodes

Posted by Robert Coli <rc...@eventbrite.com>.

On Thu, Sep 10, 2015 at 10:54 AM, Roman Tkachenko <ro...@mailgunhq.com>
wrote:
>
> [5 second CMS GC] Is my best shot to play with JVM settings trying to tune
> garbage collection then?
>

Yep. As a minor note, if the machines are that beefy, they probably have a
lot of RAM, you might wish to consider trying G1 GC and a larger heap.

=Rob

Re: High CPU usage on some of nodes

Posted by Jeff Jirsa <je...@crowdstrike.com>.

With a 5s collection, the problem is almost certainly GC. 

GC pressure can be caused by a number of things, including normal read/write loads, but ALSO compaction calculation (pre-2.1.9 / #9882) and very large partitions (trying to load a very large partition with something like row cache in 2.0 and earlier, or issuing a full row read where the row is larger than you expect). 

You can try to tune the GC behavior, but the underlying problem may be something like a bad data model (which Samuel suggested), and no amount of GC tuning is going to fix trying to do bad things with very big rows. 

From:  Roman Tkachenko
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, September 10, 2015 at 10:54 AM
To:  "user@cassandra.apache.org"
Subject:  Re: High CPU usage on some of nodes

Thanks for the responses guys. 

I also suspected GC and I guess it could be it, since during the spikes logs are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for 1 collections, 1986282520 used; max is 8375238656", often right before messages about dropped queries, unlike other, unaffected, nodes that only have "GC for ParNew: 230 ms for 1 collections, 4418571760 used; max is 8375238656" type of messages.

Is my best shot to play with JVM settings trying to tune garbage collection then?

On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE <sa...@urssaf.fr> wrote:
Hi Roman, 
If it affects only a subset of nodes and it's always the same ones, it could be a "problem" with your data model : maybe some (too) wide rows on theses nodes.
If one of your row is too wide, the deserialisation of the columns index of this row can take a lot of resources (disk, RAM, and CPU).
If you are using leveled compaction strategy and you see anormaly big sstables on thoses nodes, it could be a clue.
Regards, 
Samuel 

Robert Wille <rw...@fold3.com> a écrit sur 10/09/2015 15:27:41 :

> De : Robert Wille <rw...@fold3.com>
> A : "user@cassandra.apache.org" <us...@cassandra.apache.org>, 
> Date : 10/09/2015 15:30 
> Objet : Re: High CPU usage on some of nodes 
> 
> It sounds like its probably GC. Grep for GC in system.log to verify.
> If it is GC, there are a myriad of issues that could cause it, but 
> at least you’ve narrowed it down.
> 
> On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> wrote:
> 
> > Hey guys,
> > 
> > We've been having issues in the past couple of days with CPU usage
> / load average suddenly skyrocketing on some nodes of the cluster, 
> affecting performance significantly so majority of requests start 
> timing out. It can go on for several hours, with CPU spiking through
> the roof then coming back down to norm and so on. Weirdly, it 
> affects only a subset of nodes and it's always the same ones. The 
> boxes Cassandra is running on are pretty beefy, 24 cores, and these 
> CPU spikes go up to >1000%.
> > 
> > What is the best way to debug such kind of issues and find out 
> what Cassandra is doing during spikes like this? Doesn't seem to be 
> compaction related as sometimes during these spikes "nodetool 
> compactionstats" says no compactions are running.
> > 
> > Thanks!
> > 
>

Re: High CPU usage on some of nodes

Posted by Roman Tkachenko <ro...@mailgunhq.com>.

Thanks for the responses guys.

I also suspected GC and I guess it could be it, since during the spikes
logs are filled with messages like "GC for ConcurrentMarkSweep: 5908 ms for
1 collections, 1986282520 used; max is 8375238656", often right before
messages about dropped queries, unlike other, unaffected, nodes that only
have "GC for ParNew: 230 ms for 1 collections, 4418571760 used; max is
8375238656" type of messages.

Is my best shot to play with JVM settings trying to tune garbage collection
then?


On Thu, Sep 10, 2015 at 6:52 AM, Samuel CARRIERE <sa...@urssaf.fr>
wrote:

> Hi Roman,
> If it affects only a subset of nodes and it's always the same ones, it
> could be a "problem" with your data model : maybe some (too) wide rows on
> theses nodes.
> If one of your row is too wide, the deserialisation of the columns index
> of this row can take a lot of resources (disk, RAM, and CPU).
> If you are using leveled compaction strategy and you see anormaly big
> sstables on thoses nodes, it could be a clue.
> Regards,
> Samuel
>
> Robert Wille <rw...@fold3.com> a écrit sur 10/09/2015 15:27:41 :
>
> > De : Robert Wille <rw...@fold3.com>
> > A : "user@cassandra.apache.org" <us...@cassandra.apache.org>,
> > Date : 10/09/2015 15:30
> > Objet : Re: High CPU usage on some of nodes
> >
> > It sounds like its probably GC. Grep for GC in system.log to verify.
> > If it is GC, there are a myriad of issues that could cause it, but
> > at least you’ve narrowed it down.
> >
> > On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com>
> wrote:
> >
> > > Hey guys,
> > >
> > > We've been having issues in the past couple of days with CPU usage
> > / load average suddenly skyrocketing on some nodes of the cluster,
> > affecting performance significantly so majority of requests start
> > timing out. It can go on for several hours, with CPU spiking through
> > the roof then coming back down to norm and so on. Weirdly, it
> > affects only a subset of nodes and it's always the same ones. The
> > boxes Cassandra is running on are pretty beefy, 24 cores, and these
> > CPU spikes go up to >1000%.
> > >
> > > What is the best way to debug such kind of issues and find out
> > what Cassandra is doing during spikes like this? Doesn't seem to be
> > compaction related as sometimes during these spikes "nodetool
> > compactionstats" says no compactions are running.
> > >
> > > Thanks!
> > >
> >
>

Re: High CPU usage on some of nodes

Posted by Samuel CARRIERE <sa...@urssaf.fr>.

Hi Roman,
If it affects only a subset of nodes and it's always the same ones, it 
could be a "problem" with your data model : maybe some (too) wide rows on 
theses nodes.
If one of your row is too wide, the deserialisation of the columns index 
of this row can take a lot of resources (disk, RAM, and CPU).
If you are using leveled compaction strategy and you see anormaly big 
sstables on thoses nodes, it could be a clue.
Regards,
Samuel

Robert Wille <rw...@fold3.com> a écrit sur 10/09/2015 15:27:41 :

> De : Robert Wille <rw...@fold3.com>
> A : "user@cassandra.apache.org" <us...@cassandra.apache.org>, 
> Date : 10/09/2015 15:30
> Objet : Re: High CPU usage on some of nodes
> 
> It sounds like its probably GC. Grep for GC in system.log to verify.
> If it is GC, there are a myriad of issues that could cause it, but 
> at least you?ve narrowed it down.
> 
> On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> 
wrote:
> 
> > Hey guys,
> > 
> > We've been having issues in the past couple of days with CPU usage
> / load average suddenly skyrocketing on some nodes of the cluster, 
> affecting performance significantly so majority of requests start 
> timing out. It can go on for several hours, with CPU spiking through
> the roof then coming back down to norm and so on. Weirdly, it 
> affects only a subset of nodes and it's always the same ones. The 
> boxes Cassandra is running on are pretty beefy, 24 cores, and these 
> CPU spikes go up to >1000%.
> > 
> > What is the best way to debug such kind of issues and find out 
> what Cassandra is doing during spikes like this? Doesn't seem to be 
> compaction related as sometimes during these spikes "nodetool 
> compactionstats" says no compactions are running.
> > 
> > Thanks!
> > 
>

Re: High CPU usage on some of nodes

Posted by Robert Wille <rw...@fold3.com>.

It sounds like its probably GC. Grep for GC in system.log to verify. If it is GC, there are a myriad of issues that could cause it, but at least you’ve narrowed it down.

On Sep 9, 2015, at 11:05 PM, Roman Tkachenko <ro...@mailgunhq.com> wrote:

> Hey guys,
> 
> We've been having issues in the past couple of days with CPU usage / load average suddenly skyrocketing on some nodes of the cluster, affecting performance significantly so majority of requests start timing out. It can go on for several hours, with CPU spiking through the roof then coming back down to norm and so on. Weirdly, it affects only a subset of nodes and it's always the same ones. The boxes Cassandra is running on are pretty beefy, 24 cores, and these CPU spikes go up to >1000%.
> 
> What is the best way to debug such kind of issues and find out what Cassandra is doing during spikes like this? Doesn't seem to be compaction related as sometimes during these spikes "nodetool compactionstats" says no compactions are running.
> 
> Thanks!
>

Re: High CPU usage on some of nodes

Posted by Otis Gospodnetić <ot...@gmail.com>.

A quick and dirty way is to run jstack a few times and see if you can spot
some common methods where code is spending time.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Thu, Sep 10, 2015 at 1:05 AM, Roman Tkachenko <ro...@mailgunhq.com>
wrote:

> Hey guys,
>
> We've been having issues in the past couple of days with CPU usage / load
> average suddenly skyrocketing on some nodes of the cluster, affecting
> performance significantly so majority of requests start timing out. It can
> go on for several hours, with CPU spiking through the roof then coming back
> down to norm and so on. Weirdly, it affects only a subset of nodes and it's
> always the same ones. The boxes Cassandra is running on are pretty beefy,
> 24 cores, and these CPU spikes go up to >1000%.
>
> What is the best way to debug such kind of issues and find out what
> Cassandra is doing during spikes like this? Doesn't seem to be compaction
> related as sometimes during these spikes "nodetool compactionstats" says no
> compactions are running.
>
> Thanks!
>
>