You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bryan Cheng <br...@blockcypher.com> on 2015/07/22 23:55:28 UTC

Cassandra compaction appears to stall, node becomes partially unresponsive

Hi there,

Within our Cassandra cluster, we're observing, on occasion, one or two
nodes at a time becoming partially unresponsive.

We're running 2.1.7 across the entire cluster.

nodetool still reports the node as being healthy, and it does respond to
some local queries; however, the CPU is pegged at 100%. One common thread
(heh) each time this happens is that there always seems to be one of more
compaction threads running (via nodetool tpstats), and some appear to be
stuck (active count doesn't change, pending count doesn't decrease). A
request for compactionstats hangs with no response.

Each time we've seen this, the only thing that appears to resolve the issue
is a restart of the Cassandra process; the restart does not appear to be
clean, and requires one or more attempts (or a -9 on occasion).

There does not seem to be any pattern to what machines are affected; the
nodes thus far have been different instances on different physical machines
and on different racks.

Has anyone seen this before? Alternatively, when this happens again, what
data can we collect that would help with the debugging process (in addition
to tpstats)?

Thanks in advance,

Bryan

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

Posted by Aiman Parvaiz <ai...@flipagram.com>.
I faced something similar in past and the reason for nodes becoming unresponsive intermittently was Long GC pauses. That's why I wanted to bring this to your attention incase GC pause is a potential cause.

Sent from my iPhone

> On Jul 22, 2015, at 4:32 PM, Bryan Cheng <br...@blockcypher.com> wrote:
> 
> Aiman,
> 
> Your post made me look back at our data a bit. The most recent occurrence of this incident was not preceded by any abnormal GC activity; however, the previous occurrence (which took place a few days ago) did correspond to a massive, order-of-magnitude increase in both ParNew and CMS collection times which lasted ~17 hours.
> 
> Was there something in particular that links GC to these stalls? At this point in time, we cannot identify any particular reason for either that GC spike or the subsequent apparent compaction stall, although it did not seem to have any effect on our usage of the cluster.
> 
>> On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng <br...@blockcypher.com> wrote:
>> Hi Aiman,
>> 
>> We previously had issues with GC, but since upgrading to 2.1.7 things seem a lot healthier.
>> 
>> We collect GC statistics through collectd via the garbage collector mbean, ParNew GC's report sub 500ms collection time on average (I believe accumulated per minute?) and CMS peaks at about 300ms collection time when it runs.
>> 
>>> On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <ai...@flipagram.com> wrote:
>>> Hi Bryan
>>> How's GC behaving on these boxes?
>>> 
>>>> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> wrote:
>>>> Hi there,
>>>> 
>>>> Within our Cassandra cluster, we're observing, on occasion, one or two nodes at a time becoming partially unresponsive.
>>>> 
>>>> We're running 2.1.7 across the entire cluster.
>>>> 
>>>> nodetool still reports the node as being healthy, and it does respond to some local queries; however, the CPU is pegged at 100%. One common thread (heh) each time this happens is that there always seems to be one of more compaction threads running (via nodetool tpstats), and some appear to be stuck (active count doesn't change, pending count doesn't decrease). A request for compactionstats hangs with no response.
>>>> 
>>>> Each time we've seen this, the only thing that appears to resolve the issue is a restart of the Cassandra process; the restart does not appear to be clean, and requires one or more attempts (or a -9 on occasion).
>>>> 
>>>> There does not seem to be any pattern to what machines are affected; the nodes thus far have been different instances on different physical machines and on different racks.
>>>> 
>>>> Has anyone seen this before? Alternatively, when this happens again, what data can we collect that would help with the debugging process (in addition to tpstats)?
>>>> 
>>>> Thanks in advance,
>>>> 
>>>> Bryan
>>> 
>>> 
>>> 
>>> -- 
>>> Aiman Parvaiz
>>> Lead Systems Architect
>>> aiman@flipagram.com
>>> cell: 213-300-6377
>>> http://flipagram.com/apz
> 

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

Posted by Bryan Cheng <br...@blockcypher.com>.
Aiman,

Your post made me look back at our data a bit. The most recent occurrence
of this incident was not preceded by any abnormal GC activity; however, the
previous occurrence (which took place a few days ago) did correspond to a
massive, order-of-magnitude increase in both ParNew and CMS collection
times which lasted ~17 hours.

Was there something in particular that links GC to these stalls? At this
point in time, we cannot identify any particular reason for either that GC
spike or the subsequent apparent compaction stall, although it did not seem
to have any effect on our usage of the cluster.

On Wed, Jul 22, 2015 at 3:35 PM, Bryan Cheng <br...@blockcypher.com> wrote:

> Hi Aiman,
>
> We previously had issues with GC, but since upgrading to 2.1.7 things seem
> a lot healthier.
>
> We collect GC statistics through collectd via the garbage collector mbean,
> ParNew GC's report sub 500ms collection time on average (I believe
> accumulated per minute?) and CMS peaks at about 300ms collection time when
> it runs.
>
> On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <ai...@flipagram.com>
> wrote:
>
>> Hi Bryan
>> How's GC behaving on these boxes?
>>
>> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com>
>> wrote:
>>
>>> Hi there,
>>>
>>> Within our Cassandra cluster, we're observing, on occasion, one or two
>>> nodes at a time becoming partially unresponsive.
>>>
>>> We're running 2.1.7 across the entire cluster.
>>>
>>> nodetool still reports the node as being healthy, and it does respond to
>>> some local queries; however, the CPU is pegged at 100%. One common thread
>>> (heh) each time this happens is that there always seems to be one of more
>>> compaction threads running (via nodetool tpstats), and some appear to be
>>> stuck (active count doesn't change, pending count doesn't decrease). A
>>> request for compactionstats hangs with no response.
>>>
>>> Each time we've seen this, the only thing that appears to resolve the
>>> issue is a restart of the Cassandra process; the restart does not appear to
>>> be clean, and requires one or more attempts (or a -9 on occasion).
>>>
>>> There does not seem to be any pattern to what machines are affected; the
>>> nodes thus far have been different instances on different physical machines
>>> and on different racks.
>>>
>>> Has anyone seen this before? Alternatively, when this happens again,
>>> what data can we collect that would help with the debugging process (in
>>> addition to tpstats)?
>>>
>>> Thanks in advance,
>>>
>>> Bryan
>>>
>>
>>
>>
>> --
>> *Aiman Parvaiz*
>> Lead Systems Architect
>> aiman@flipagram.com
>> cell: 213-300-6377
>> http://flipagram.com/apz
>>
>
>

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

Posted by Bryan Cheng <br...@blockcypher.com>.
Hi Aiman,

We previously had issues with GC, but since upgrading to 2.1.7 things seem
a lot healthier.

We collect GC statistics through collectd via the garbage collector mbean,
ParNew GC's report sub 500ms collection time on average (I believe
accumulated per minute?) and CMS peaks at about 300ms collection time when
it runs.

On Wed, Jul 22, 2015 at 3:22 PM, Aiman Parvaiz <ai...@flipagram.com> wrote:

> Hi Bryan
> How's GC behaving on these boxes?
>
> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com>
> wrote:
>
>> Hi there,
>>
>> Within our Cassandra cluster, we're observing, on occasion, one or two
>> nodes at a time becoming partially unresponsive.
>>
>> We're running 2.1.7 across the entire cluster.
>>
>> nodetool still reports the node as being healthy, and it does respond to
>> some local queries; however, the CPU is pegged at 100%. One common thread
>> (heh) each time this happens is that there always seems to be one of more
>> compaction threads running (via nodetool tpstats), and some appear to be
>> stuck (active count doesn't change, pending count doesn't decrease). A
>> request for compactionstats hangs with no response.
>>
>> Each time we've seen this, the only thing that appears to resolve the
>> issue is a restart of the Cassandra process; the restart does not appear to
>> be clean, and requires one or more attempts (or a -9 on occasion).
>>
>> There does not seem to be any pattern to what machines are affected; the
>> nodes thus far have been different instances on different physical machines
>> and on different racks.
>>
>> Has anyone seen this before? Alternatively, when this happens again, what
>> data can we collect that would help with the debugging process (in addition
>> to tpstats)?
>>
>> Thanks in advance,
>>
>> Bryan
>>
>
>
>
> --
> *Aiman Parvaiz*
> Lead Systems Architect
> aiman@flipagram.com
> cell: 213-300-6377
> http://flipagram.com/apz
>

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

Posted by Aiman Parvaiz <ai...@flipagram.com>.
Hi Bryan
How's GC behaving on these boxes?

On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> wrote:

> Hi there,
>
> Within our Cassandra cluster, we're observing, on occasion, one or two
> nodes at a time becoming partially unresponsive.
>
> We're running 2.1.7 across the entire cluster.
>
> nodetool still reports the node as being healthy, and it does respond to
> some local queries; however, the CPU is pegged at 100%. One common thread
> (heh) each time this happens is that there always seems to be one of more
> compaction threads running (via nodetool tpstats), and some appear to be
> stuck (active count doesn't change, pending count doesn't decrease). A
> request for compactionstats hangs with no response.
>
> Each time we've seen this, the only thing that appears to resolve the
> issue is a restart of the Cassandra process; the restart does not appear to
> be clean, and requires one or more attempts (or a -9 on occasion).
>
> There does not seem to be any pattern to what machines are affected; the
> nodes thus far have been different instances on different physical machines
> and on different racks.
>
> Has anyone seen this before? Alternatively, when this happens again, what
> data can we collect that would help with the debugging process (in addition
> to tpstats)?
>
> Thanks in advance,
>
> Bryan
>



-- 
*Aiman Parvaiz*
Lead Systems Architect
aiman@flipagram.com
cell: 213-300-6377
http://flipagram.com/apz

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

Posted by Bryan Cheng <br...@blockcypher.com>.
Robert, thanks for these references! We're not using DTCS, so 9056 and 8243
seem out, but I'll take a look at 9577 (also looked at the referenced
thread on this list, which seems to have some interesting data)

On Wed, Jul 22, 2015 at 5:33 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com>
> wrote:
>
>> nodetool still reports the node as being healthy, and it does respond to
>> some local queries; however, the CPU is pegged at 100%. One common thread
>> (heh) each time this happens is that there always seems to be one of more
>> compaction threads running (via nodetool tpstats), and some appear to be
>> stuck (active count doesn't change, pending count doesn't decrease). A
>> request for compactionstats hangs with no response.
>>
>
> I've heard other reports of compaction appearing to stall in 2.1.7...
> wondering if you're affected by any of these...
>
> https://issues.apache.org/jira/browse/CASSANDRA-9577
> or
> https://issues.apache.org/jira/browse/CASSANDRA-9056 or
> https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be
> in 2.1.7)
>
> =Rob
>
>

Re: Cassandra compaction appears to stall, node becomes partially unresponsive

Posted by Robert Coli <rc...@eventbrite.com>.
On Wed, Jul 22, 2015 at 2:55 PM, Bryan Cheng <br...@blockcypher.com> wrote:

> nodetool still reports the node as being healthy, and it does respond to
> some local queries; however, the CPU is pegged at 100%. One common thread
> (heh) each time this happens is that there always seems to be one of more
> compaction threads running (via nodetool tpstats), and some appear to be
> stuck (active count doesn't change, pending count doesn't decrease). A
> request for compactionstats hangs with no response.
>

I've heard other reports of compaction appearing to stall in 2.1.7...
wondering if you're affected by any of these...

https://issues.apache.org/jira/browse/CASSANDRA-9577
or
https://issues.apache.org/jira/browse/CASSANDRA-9056 or
https://issues.apache.org/jira/browse/CASSANDRA-8243 (these should not be
in 2.1.7)

=Rob