You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joshua McKenzie (JIRA)" <ji...@apache.org> on 2016/05/17 17:51:13 UTC

[jira] [Commented] (CASSANDRA-11818) C* does neither recover nor trigger stability inspector on direct memory OOM

    [ https://issues.apache.org/jira/browse/CASSANDRA-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287103#comment-15287103 ] 

Joshua McKenzie commented on CASSANDRA-11818:
---------------------------------------------

A few observations:

1) We'd need to change our handling of the error [here|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/transport/Message.java#L396] for it to be inspected by the JVMStabilityInspector. If we're OOM, wrapping's going to fail.
2) We both need CASSANDRA-8092 to be integrated into CI to catch new errors like this in the future, and it needs its logic revised to check if we're immediately rethrowing vs. attempting further operation and/or wrapping an exception since that will fail in the OOM condition.

Looks like the state of the code-base has regressed w/regards to this issue since I submit that ticket:
{noformat}
Total caught and rethrown as something other than Runtime: 100
Total caught and rethrown as Runtime: 68
Total Swallowed: 81
Total delegated to JVMStabilityInspector: 69
Total 'catch (Throwable ...)' analyzed: 120
Total 'catch (Exception ...)' analyzed: 198
Total catch clauses analyzed: 318
{noformat}

[~mshuler]: Any word on where CASSANDRA-8092 falls on the priority list?

> C* does neither recover nor trigger stability inspector on direct memory OOM
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-11818
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11818
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Robert Stupp
>         Attachments: oom-histo-live.txt, oom-stack.txt
>
>
> The following stack trace is not caught by {{JVMStabilityInspector}}.
> Situation was caused by a load test with a lot of parallel writes and reads against a single node.
> {code}
> ERROR [SharedPool-Worker-1] 2016-05-17 18:38:44,187 Message.java:611 - Unexpected exception during request; channel = [id: 0x1e02351b, L:/127.0.0.1:9042 - R:/127.0.0.1:51087]
> java.lang.OutOfMemoryError: Direct buffer memory
> 	at java.nio.Bits.reserveMemory(Bits.java:693) ~[na:1.8.0_92]
> 	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[na:1.8.0_92]
> 	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_92]
> 	at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:672) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:234) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.PoolArena.allocate(PoolArena.java:218) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.PoolArena.allocate(PoolArena.java:138) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:270) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:177) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:168) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:105) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:349) ~[main/:na]
> 	at org.apache.cassandra.transport.Message$ProtocolEncoder.encode(Message.java:314) ~[main/:na]
> 	at io.netty.handler.codec.MessageToMessageEncoder.write(MessageToMessageEncoder.java:89) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:619) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:676) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:612) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at org.apache.cassandra.transport.Message$Dispatcher$Flusher.run(Message.java:445) ~[main/:na]
> 	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:358) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:374) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:112) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) ~[netty-all-4.0.36.Final.jar:4.0.36.Final]
> 	at java.lang.Thread.run(Thread.java:745) [na:1.8.0_92]
> {code}
> The situation does not get better when the load driver is stopped.
> I can reproduce this scenario at will. Managed to get histogram, stack traces and heap dump. Already increased {{-XX:MaxDirectMemorySize}} to {{2g}}.
> A {{nodetool flush}} causes the daemon to exit (as that direct-memory OOM is caught by {{JVMStabilityInspector}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)