You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Zach Toews (Jira)" <ji...@apache.org> on 2021/10/07 17:03:00 UTC

[jira] [Commented] (CASSANDRA-17025) Direct memory OOM

    [ https://issues.apache.org/jira/browse/CASSANDRA-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17425693#comment-17425693 ] 

Zach Toews commented on CASSANDRA-17025:
----------------------------------------

[~brandon.williams] To clarify, what we're seeing is the node with the long gc pauses is marked down by the other nodes while `nodetool status` on the down nodes says it is still up. While it is in this state the entire cluster becomes unresponsive until the down node is restarted.

> Direct memory OOM 
> ------------------
>
>                 Key: CASSANDRA-17025
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-17025
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Zach Toews
>            Priority: Normal
>
> Version: 3.11.11
> Cluster: 9 nodes (1 dc, 3 racks) on aws r5.4xlarge nodes (16 vCPU, 128GB mem)
> Heap size: 20G
> Direct memory buffer: 24G
>  
> We stood up this cluster a few months ago in order to migrate an old 2.0 cluster. Since then, after about 7-10 days a node will begin to experience long old gc cycles  before logging many  "java.lang.OutOfMemoryError: Direct buffer memory" exceptions. When the long gc cycles start, the entire cluster becomes unresponsive (our application is unable to make queries to any node).
> Restarting cassandra on the failing node resolves the problem, then we have to restart every other node in the cluster to prevent them from getting into the same state.
>  
> We have attempted to:
>  * Increase -XX:MaxDirectMemorySize
>  * Increase -Djdk.nio.maxCachedBufferSize
>  * Update cassandra from 3.11.10 to 3.11.11
> None of these have resolved the problem.
> Since the last failure, we have increased -XX:MaxDirectMemorySize again, and are waiting to see if that has any effect.
>  
> Old gc collections from system.log:
> {noformat}
> INFO  [Service Thread] 2021-10-04 15:24:04,973 GCInspector.java:285 - G1 Old Generation GC in 4447ms.  Compressed Class Space: 6683064 -> 6677952; G1 Eden Space: 16777216 -> 0; G1 Old Gen: 5431375384 -> 745163912; G1 Survivor Space: 419430400 -> 0; Metaspace: 54716768 ->
> ...
> INFO  [Service Thread] 2021-10-04 15:24:06,985 GCInspector.java:285 - G1 Old Generation GC in 1901ms.  G1 Old Gen: 745163912 -> 745168360;
> ...
> INFO  [Service Thread] 2021-10-04 15:24:09,306 GCInspector.java:285 - G1 Old Generation GC in 2046ms.  G1 Eden Space: 528482304 -> 0; G1 Old Gen: 759785184 -> 761229864;
> ...
> INFO  [Service Thread] 2021-10-04 15:24:14,749 GCInspector.java:285 - G1 Old Generation GC in 5403ms.  G1 Old Gen: 761229864 -> 762168640;
> ...
> INFO  [Service Thread] 2021-10-04 15:24:16,782 GCInspector.java:285 - G1 Old Generation GC in 1949ms.  G1 Old Gen: 762168640 -> 762167640;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:09,406 GCInspector.java:285 - G1 Old Generation GC in 52302ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 762167640 -> 762168160;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:15,011 GCInspector.java:285 - G1 Old Generation GC in 5522ms.  G1 Eden Space: 192937984 -> 0; G1 Old Gen: 762168160 -> 770098088;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:31,453 GCInspector.java:285 - G1 Old Generation GC in 16310ms.  G1 Eden Space: 201326592 -> 0; G1 Old Gen: 770098088 -> 769228400;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:33,597 GCInspector.java:285 - G1 Old Generation GC in 1984ms.  G1 Eden Space: 352321536 -> 0; G1 Old Gen: 750824952 -> 751118968;
> ...
> INFO  [Service Thread] 2021-10-04 15:25:50,152 GCInspector.java:285 - G1 Old Generation GC in 16411ms.  G1 Eden Space: 8388608 -> 0; G1 Old Gen: 751118968 -> 751645056;
> {noformat}
> Example of direct memory oom from system.log
> {noformat}
> INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - READ messages were dropped in last 5000 ms: 2 internal and 7 cross node. Mean internal dropped latency: 47238 ms and Mean cross-node dropped latency: 45171 ms
> INFO  [ScheduledTasks:1] 2021-10-04 15:25:31,484 MessagingService.java:1246 - COUNTER_MUTATION messages were dropped in last 5000 ms: 1 internal and 60 cross node. Mean internal dropped latency: 7289 ms and Mean cross-node dropped latency: 9509 ms
> ERROR [CounterMutationStage-464] 2021-10-04 15:25:31,484 JVMStabilityInspector.java:94 - OutOfMemory error letting the JVM handle the error:
> java.lang.OutOfMemoryError: Direct buffer memory
>         at java.nio.Bits.reserveMemory(Bits.java:695) ~[na:1.8.0_292]
>         at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[na:1.8.0_292]
>         at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[na:1.8.0_292]
>         at org.apache.cassandra.utils.memory.BufferPool.allocate(BufferPool.java:114) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.utils.memory.BufferPool.access$1000(BufferPool.java:50) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.utils.memory.BufferPool$LocalPool.allocate(BufferPool.java:408) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.utils.memory.BufferPool$LocalPool.access$000(BufferPool.java:335) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.utils.memory.BufferPool.takeFromPool(BufferPool.java:126) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.utils.memory.BufferPool.get(BufferPool.java:98) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:156) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.cache.ChunkCache.load(ChunkCache.java:39) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at com.github.benmanes.caffeine.cache.BoundedLocalCache$BoundedLocalLoadingCache.lambda$new$0(BoundedLocalCache.java:2949) ~[caffeine-2.2.6.jar:na]
>         at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$doComputeIfAbsent$15(BoundedLocalCache.java:1807) ~[caffeine-2.2.6.jar:na]
>         at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892) ~[na:1.8.0_292]
>         at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:1805) ~[caffeine-2.2.6.jar:na]
>         at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:1788) ~[caffeine-2.2.6.jar:na]
>         at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:97) ~[caffeine-2.2.6.jar:na]
>         at com.github.benmanes.caffeine.cache.LocalLoadingCache.get(LocalLoadingCache.java:66) ~[caffeine-2.2.6.jar:na]
>         at org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:236) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.cache.ChunkCache$CachingRebufferer.rebuffer(ChunkCache.java:214) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.io.util.RandomAccessReader.reBufferAt(RandomAccessReader.java:65) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.io.util.RandomAccessReader.seek(RandomAccessReader.java:207) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.io.util.FileHandle.createReader(FileHandle.java:150) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.io.sstable.format.SSTableReader.getFileDataInput(SSTableReader.java:1801) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.columniterator.AbstractSSTableIterator.<init>(AbstractSSTableIterator.java:103) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.columniterator.SSTableIterator.<init>(SSTableIterator.java:49) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:72) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.io.sstable.format.big.BigTableReader.iterator(BigTableReader.java:65) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.initializeIterator(UnfilteredRowIteratorWithLowerBound.java:108) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.maybeInit(LazilyInitializedUnfilteredRowIterator.java:48) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.partitionLevelDeletion(LazilyInitializedUnfilteredRowIterator.java:81) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.partitionLevelDeletion(UnfilteredRowIteratorWithLowerBound.java:170) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(SinglePartitionReadCommand.java:762) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDisk(SinglePartitionReadCommand.java:673) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.CounterMutation.updateWithCurrentValuesFromCFS(CounterMutation.java:258) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.CounterMutation.processModifications(CounterMutation.java:202) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.CounterMutation.applyCounterMutation(CounterMutation.java:123) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.service.StorageProxy$9.runMayThrow(StorageProxy.java:1650) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2771) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.service.StorageProxy$3.apply(StorageProxy.java:160) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.service.StorageProxy.performWrite(StorageProxy.java:1201) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.service.StorageProxy.applyCounterMutationOnLeader(StorageProxy.java:1627) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.db.CounterMutationVerbHandler.doVerb(CounterMutationVerbHandler.java:48) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_292]
>         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162) ~[apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:134) [apache-cassandra-3.11.11.jar:3.11.11]
>         at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) [apache-cassandra-3.11.11.jar:3.11.11]
>         at java.lang.Thread.run(Thread.java:748) [na:1.8.0_292]
> {noformat}
> Example of full gc from from gc.log
> {noformat}
> 2021-10-04T15:25:46.467+0000: 494369.005: [Full GC (System.gc())  727M->726M(20480M), 1.8053704 secs]
>    [Eden: 8192.0K(12288.0M)->0.0B(12288.0M) Survivors: 0.0B->0.0B Heap: 727.3M(20480.0M)->726.2M(20480.0M)], [Metaspace: 53437K->53437K(1099776K)]
> Heap after GC invocations=14150 (full 57):
>  garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 0x00000002c0805000, 0x00000007c0000000)
>   region size 8192K, 0 young (0K), 0 survivors (0K)
>  Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 1099776K
>   class space    used 6522K, capacity 6987K, committed 7120K, reserved 1048576K
> }
>  [Times: user=2.99 sys=0.00, real=1.81 secs]
> 2021-10-04T15:25:48.273+0000: 494370.811: Total time for which application threads were stopped: 1.8099675 seconds, Stopping threads took: 0.0003245 seconds
> {Heap before GC invocations=14150 (full 57):
>  garbage-first heap   total 20971520K, used 743652K [0x00000002c0000000, 0x00000002c0805000, 0x00000007c0000000)
>   region size 8192K, 1 young (8192K), 0 survivors (0K)
>  Metaspace       used 53437K, capacity 55970K, committed 56528K, reserved 1099776K
>   class space    used 6522K, capacity 6987K, committed 7120K, reserved 1048576K
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org