You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Stan Lemon <sl...@salesforce.com> on 2014/11/26 05:07:17 UTC

High cpu usage & segfaulting

We are using v2.0.11 and have seen several instances in our 24 node cluster
where the node becomes unresponsive, when we look into it we find that
there is a cassandra process chewing up a lot of CPU. There are no other
indications in logs or anything as to what might be happening, however if
we strace the process that is chewing up CPU we see a segmental fault:

--- SIGSEGV (Segmentation fault) @ 0 (0) ---
rt_sigreturn(0x7fd61110f862)            = 30618997712
futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27333, NULL) = -1 EAGAIN
(Resource temporarily unavailable)
futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27335, NULL) = 0
futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
{FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1

And this happens over and over again while running strafe.

Has anyone seen this? Does anyone have any ideas what might be happening,
or how we could debug it further?

Thanks for your help,

Stan

Re: High cpu usage & segfaulting

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Nov 25, 2014 at 8:07 PM, Stan Lemon <sl...@salesforce.com> wrote:

> We are using v2.0.11 and have seen several instances in our 24 node
> cluster where the node becomes unresponsive, when we look into it we find
> that there is a cassandra process chewing up a lot of CPU. There are no
> other indications in logs or anything as to what might be happening,
> however if we strace the process that is chewing up CPU we see a segmental
> fault:
>

> Has anyone seen this? Does anyone have any ideas what might be happening,
> or how we could debug it further?
>

Does it go away when you restart the node?

First, you should do the standard checks for if this is GC pre-fail, which
looks like a flattop on heap consumption graphs combined with a spike in GC
duration.

If you don't find that or OOM log messages, your version is new enough that
I would file a JIRA at http://issues.apache.org

=Rob

Re: High cpu usage & segfaulting

Posted by Stan Lemon <sl...@salesforce.com>.

Thanks everyone for the feedback. So some additional details...

1. Definitely using Oracle JDK (1.7.0_71-b14)
2. Yes, the segfaulting does go away after a restart
3. No OOM log messages when this occurs
4. We are seeing many GC pauses that take a long time, as in over 2 seconds
- we are aware that our GC performance is bad and we believe this is
because of IO, which we are addressing. However, we are see these runaway
CPU during low load times and even when we took the cluster completely out
of use.

Thanks again,
Stan


On Wed, Nov 26, 2014 at 12:03 PM, Tyler Hobbs <ty...@datastax.com> wrote:

> When I see a segfault, my first reaction is to always suspect OpenJDK.
> Are you using OpenJDK or the Oracle JDK?  If you're using the former, I
> recommend the latter.
>
> On Tue, Nov 25, 2014 at 10:40 PM, Otis Gospodnetic <
> otis.gospodnetic@gmail.com> wrote:
>
>> Hi Stan,
>>
>> Put some monitoring on this.  The first thing I think of when I hear
>> "chewing up CPU" for Java apps is GC.  In SPM <http://sematext.com/spm/>
>> you can easily see individual JVM memory pools and see if any of them are
>> at (close to) 100%.  You can typically correlate that to increased GC times
>> and counts.  I'd look at that before looking at strace and such.
>>
>> Otis
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Tue, Nov 25, 2014 at 11:07 PM, Stan Lemon <sl...@salesforce.com>
>> wrote:
>>
>>> We are using v2.0.11 and have seen several instances in our 24 node
>>> cluster where the node becomes unresponsive, when we look into it we find
>>> that there is a cassandra process chewing up a lot of CPU. There are no
>>> other indications in logs or anything as to what might be happening,
>>> however if we strace the process that is chewing up CPU we see a segmental
>>> fault:
>>>
>>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>>> rt_sigreturn(0x7fd61110f862)            = 30618997712
>>> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27333, NULL) = -1 EAGAIN
>>> (Resource temporarily unavailable)
>>> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
>>> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27335, NULL) = 0
>>> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
>>> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
>>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>>> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>>
>>> And this happens over and over again while running strafe.
>>>
>>> Has anyone seen this? Does anyone have any ideas what might be
>>> happening, or how we could debug it further?
>>>
>>> Thanks for your help,
>>>
>>> Stan
>>>
>>>
>>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: High cpu usage & segfaulting

Posted by Tyler Hobbs <ty...@datastax.com>.

When I see a segfault, my first reaction is to always suspect OpenJDK.  Are
you using OpenJDK or the Oracle JDK?  If you're using the former, I
recommend the latter.

On Tue, Nov 25, 2014 at 10:40 PM, Otis Gospodnetic <
otis.gospodnetic@gmail.com> wrote:

> Hi Stan,
>
> Put some monitoring on this.  The first thing I think of when I hear
> "chewing up CPU" for Java apps is GC.  In SPM <http://sematext.com/spm/>
> you can easily see individual JVM memory pools and see if any of them are
> at (close to) 100%.  You can typically correlate that to increased GC times
> and counts.  I'd look at that before looking at strace and such.
>
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Tue, Nov 25, 2014 at 11:07 PM, Stan Lemon <sl...@salesforce.com>
> wrote:
>
>> We are using v2.0.11 and have seen several instances in our 24 node
>> cluster where the node becomes unresponsive, when we look into it we find
>> that there is a cassandra process chewing up a lot of CPU. There are no
>> other indications in logs or anything as to what might be happening,
>> however if we strace the process that is chewing up CPU we see a segmental
>> fault:
>>
>> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
>> rt_sigreturn(0x7fd61110f862)            = 30618997712
>> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27333, NULL) = -1 EAGAIN
>> (Resource temporarily unavailable)
>> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
>> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
>> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27335, NULL) = 0
>> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
>> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
>> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
>> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
>>
>> And this happens over and over again while running strafe.
>>
>> Has anyone seen this? Does anyone have any ideas what might be happening,
>> or how we could debug it further?
>>
>> Thanks for your help,
>>
>> Stan
>>
>>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: High cpu usage & segfaulting

Posted by Otis Gospodnetic <ot...@gmail.com>.

Hi Stan,

Put some monitoring on this.  The first thing I think of when I hear
"chewing up CPU" for Java apps is GC.  In SPM <http://sematext.com/spm/>
you can easily see individual JVM memory pools and see if any of them are
at (close to) 100%.  You can typically correlate that to increased GC times
and counts.  I'd look at that before looking at strace and such.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Nov 25, 2014 at 11:07 PM, Stan Lemon <sl...@salesforce.com> wrote:

> We are using v2.0.11 and have seen several instances in our 24 node
> cluster where the node becomes unresponsive, when we look into it we find
> that there is a cassandra process chewing up a lot of CPU. There are no
> other indications in logs or anything as to what might be happening,
> however if we strace the process that is chewing up CPU we see a segmental
> fault:
>
> --- SIGSEGV (Segmentation fault) @ 0 (0) ---
> rt_sigreturn(0x7fd61110f862)            = 30618997712
> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27333, NULL) = -1 EAGAIN
> (Resource temporarily unavailable)
> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7fd614844054, FUTEX_WAIT_PRIVATE, 27335, NULL) = 0
> futex(0x7fd614844028, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7fd6148e2e54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7fd6148e2e50,
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7fd6148e2e28, FUTEX_WAKE_PRIVATE, 1) = 1
>
> And this happens over and over again while running strafe.
>
> Has anyone seen this? Does anyone have any ideas what might be happening,
> or how we could debug it further?
>
> Thanks for your help,
>
> Stan
>
>