You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Paul Pham (JIRA)" <ji...@apache.org> on 2014/01/16 23:49:22 UTC
[jira] [Commented] (CASSANDRA-6255) Exception count not incremented on OutOfMemoryError (HSHA)

    [ https://issues.apache.org/jira/browse/CASSANDRA-6255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13874096#comment-13874096 ] 

Paul Pham commented on CASSANDRA-6255:
--------------------------------------

We've experienced almost precisely the same issue, both in the sense that cassandra stopped listening on 9160 due to an apparent out-of-memory problem, and also in observing the "invalid frame size of 0" error, which we've also observed to a lesser degree in the past (but trending up as of late) and haven't been able to root cause.

One other interesting data point is that all 3 of our nodes in our main datacenter experienced this same issue at the same time. Our working theory is that the trigger was the starting of many applications simultaneously was too much for the current cluster to handle (with both reads and writes using a consistency level of TWO). 

Relevant log snippet:
{code}
INFO [ScheduledTasks:1] 2014-01-16 19:01:45,208 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 921 ms for 1 collections, 1389728032 used; max is 2105540608
ERROR [Selector-Thread-0] 2014-01-16 19:01:48,813 CustomTHsHaServer.java (line 187) Uncaught Exception:
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(Unknown Source)
        at java.nio.ByteBuffer.allocate(Unknown Source)
        at org.apache.thrift.server.TNonblockingServer$FrameBuffer.read(TNonblockingServer.java:491)
        at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.handleRead(CustomTHsHaServer.java:273)
        at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.select(CustomTHsHaServer.java:224)
        at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.run(CustomTHsHaServer.java:182)
 INFO [ScheduledTasks:1] 2014-01-16 19:01:48,822 GCInspector.java (line 119) GC for ConcurrentMarkSweep: 3125 ms for 2 collections, 1390753448 used; max is 2105540608
{code}

Strangely, you can see that we're not really close to our max heap limit, although it was inching close to our configured "CMSInitiatingOccupancyFraction" which is effectively 1.5/2GB.

We were ready to chalk this up to an increased load on the cassandra cluster (and a need for beefier/more nodes). But I'd also echo Dan's concern that the health reporting is not sufficient/accurate for this particular issue (which I think is the main ask for this bug). We were only able to detect this issue because we had a monit check checking for expected listening ports.



> Exception count not incremented on OutOfMemoryError (HSHA)
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-6255
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6255
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Oracle java version "1.7.0_15"
> rpc_server_type: hsha
>            Reporter: Dan Hendry
>            Priority: Minor
>
> One of our nodes decided to stop listening on 9160 (netstat -l was showing nothing and telnet was reporting connection refused). Nodetool status showed no hosts down and on the offending node nodetool info gave the following:
> {noformat}
> nodetool info
> Token            : (invoke with -T/--tokens to see all 256 tokens)
> ID               : (removed)
> Gossip active    : true
> Thrift active    : true
> Native Transport active: false
> Load             : 2.05 TB
> Generation No    : 1382536528
> Uptime (seconds) : 432970
> Heap Memory (MB) : 8098.05 / 14131.25
> Data Center      : DC1
> Rack             : RAC2
> Exceptions       : 0
> Key Cache        : size 536854996 (bytes), capacity 536870912 (bytes), 41383646 hits, 1710831591 requests, 0.024 recent hit rate, 0 save period in seconds
> Row Cache        : size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
> {noformat}
> After looking at the cassandra log, I saw a bunch of the following:
> {noformat}
> ERROR [Selector-Thread-16] 2013-10-27 17:36:00,370 CustomTHsHaServer.java (line 187) Uncaught Exception: 
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:691)
>         at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:145)
>         at org.apache.cassandra.thrift.CustomTHsHaServer.requestInvoke(CustomTHsHaServer.java:337)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.handleRead(CustomTHsHaServer.java:281)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.select(CustomTHsHaServer.java:224)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.run(CustomTHsHaServer.java:182)
> ERROR [Selector-Thread-7] 2013-10-27 17:36:00,370 CustomTHsHaServer.java (line 187) Uncaught Exception: 
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:691)
>         at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949)
>         at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1371)
>         at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:145)
>         at org.apache.cassandra.thrift.CustomTHsHaServer.requestInvoke(CustomTHsHaServer.java:337)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.handleRead(CustomTHsHaServer.java:281)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.select(CustomTHsHaServer.java:224)
>         at org.apache.cassandra.thrift.CustomTHsHaServer$SelectorThread.run(CustomTHsHaServer.java:182)
> {noformat}
> There wasn't anything else overtly suspicious in the logs except for the occasional 
> {noformat}
> ERROR [Selector-Thread-0] 2013-10-27 17:35:58,662 TNonblockingServer.java (line 468) Read an invalid frame size of 0. Are you using TFramedTransport on the client side?
> {noformat}
> but  that periodically comes up - I have looked into it before but it has never seemed to have any serious impact.
> This ticket is not about *why* an OutOfMemoryError occurred - which is bad but I don't think I have enough information to reproduce or speculate on a cause. This ticket is about the fact that an OutOfMemoryError occurred and nodetool info was reporting Thrift active : true and Exceptions : 0. 
> Our monitoring systems and investigation processes are both starting to rely on on the exception count. The fact that it was not accurate here is disconcerting.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)