You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Ryan McGuire (JIRA)" <ji...@apache.org> on 2014/08/11 17:34:12 UTC

[jira] [Commented] (CASSANDRA-7743) Possible C* OOM issue during long running test

    [ https://issues.apache.org/jira/browse/CASSANDRA-7743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092892#comment-14092892 ] 

Ryan McGuire commented on CASSANDRA-7743:
-----------------------------------------

I'd recommend running [MAT|http://www.eclipse.org/mat/] on of the core files to be able to examine what exactly is eating up the ram. Although, I'm not sure if this helps with "Direct buffer memory" as I've only used it to debug things before we went off-heap.

> Possible C* OOM issue during long running test
> ----------------------------------------------
>
>                 Key: CASSANDRA-7743
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7743
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: Google Compute Engine, n1-standard-1
>            Reporter: Pierre Laporte
>
> During a long running test, we ended up with a lot of "java.lang.OutOfMemoryError: Direct buffer memory" errors on the Cassandra instances.
> Here is an example of stacktrace from system.log :
> {code}
> ERROR [SharedPool-Worker-1] 2014-08-11 11:09:34,610 ErrorMessage.java:218 - Unexpected exception during request
> java.lang.OutOfMemoryError: Direct buffer memory
>         at java.nio.Bits.reserveMemory(Bits.java:658) ~[na:1.7.0_25]
>         at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[na:1.7.0_25]
>         at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306) ~[na:1.7.0_25]
>         at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:434) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:179) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PoolArena.allocate(PoolArena.java:168) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PoolArena.allocate(PoolArena.java:98) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:251) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:112) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:507) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:464) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:378) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:350) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116) ~[netty-all-4.0.20.Final.jar:4.0.20.Final]
>         at java.lang.Thread.run(Thread.java:724) ~[na:1.7.0_25]
> {code}
> The test consisted of a 3-nodes cluster of n1-standard-1 GCE instances (1 vCPU, 3.75 GB RAM) running cassandra-2.1.0-rc5, and a n1-standard-2 instance running the test.
> After ~2.5 days, several requests start to fail and we see the previous stacktraces in the system.log file.
> The output from linux ‘free’ and ‘meminfo’ suggest that there is still memory available.
> {code}
> $ free -m
> total              used       free     shared    buffers     cached
> Mem:          3702       3532        169          0        161        854
> -/+ buffers/cache:       2516       1185
> Swap:            0          0          0
> $ head -n 4 /proc/meminfo
> MemTotal:        3791292 kB
> MemFree:          173568 kB
> Buffers:          165608 kB
> Cached:           874752 kB
> {code}
> These errors do not affect all the queries we run. The cluster is still responsive but is unable to display tracing information using cqlsh :
> {code}
> $ ./bin/nodetool --host 10.240.137.253 status duration_test
> Datacenter: DC1
> ===============
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address         Load       Tokens  Owns (effective)  Host ID                               Rack
> UN  10.240.98.27    925.17 KB  256     100.0%            41314169-eff5-465f-85ea-d501fd8f9c5e  RAC1
> UN  10.240.137.253  1.1 MB     256     100.0%            c706f5f9-c5f3-4d5e-95e9-a8903823827e  RAC1
> UN  10.240.72.183   896.57 KB  256     100.0%            15735c4d-98d4-4ea4-a305-7ab2d92f65fc  RAC1
> $ echo 'tracing on; select count(*) from duration_test.ints;' | ./bin/cqlsh 10.240.137.253
> Now tracing requests.
>  count
> -------
>   9486
> (1 rows)
> Statement trace did not complete within 10 seconds
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)