You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joey Lynch (Jira)" <ji...@apache.org> on 2020/05/05 05:58:00 UTC

[jira] [Comment Edited] (CASSANDRA-15214) OOMs caught and not rethrown

    [ https://issues.apache.org/jira/browse/CASSANDRA-15214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099546#comment-17099546 ] 

Joey Lynch edited comment on CASSANDRA-15214 at 5/5/20, 5:57 AM:
-----------------------------------------------------------------

Quick update on this from the jvmquake side we are now building [architecture specific artifacts|https://github.com/Netflix-Skunkworks/jvmquake/releases] that will work with any JVM newer than Java 8, they link only against the platform specific libc (we're also now testing on Java 8 and 11, on both zulu and openjdk JVMs). I think this means it would be plausible to include the {{libjvmquake-linux-x86_64.so}} in {{libs}} and then have a switch on uname -s -m to determine to pick it up or not. Right now we're only building for linux amd64 but if there is interest I can generate more architectures (linux arm probably makes sense, and could do osx). I also still like the idea of having a agents/available and agents/enabled folder like apache does for modules, users can just symlink agents from one to the other to include them (and we can symlink jamm and jvmquake by default).

[~yifanc] I agree that the OutOfMemory conditions that do not result in "true" JVM OOM (meaning that it would cause a heapdump via {{HeapDumpOnOutOfMemory}}) such as direct buffer allocations will not get caught by jvmquake, my testing confirms your findings, although the jvmquake GC instability algorithm will still trigger in various real world scenarios I've run into.

I feel like the right move might be to walk back a small bit of CASSANDRA-13006 where we stopped forcibly killing the JVM ourselves and let the JVM do it. Specifically if the OOM message contains "Direct buffer memory" we could do what jvmquake does and force the JVM into a "normal" OOM by [allocating large long arrays|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/src/jvmquake.c#L103]. This will then trigger a proper OOM and get us heap dumping. It's relatively easy to ignore the "sacrificial" long array in a heap dump and we could log clearly what is happening.


was (Author: jolynch):
Quick update on this from the jvmquake side we are now building [architecture specific artifacts|https://github.com/Netflix-Skunkworks/jvmquake/releases] that will work with any JVM newer than Java 8, they link only against the platform specific libc (we're also now testing on Java 8 and 11, on both zulu and openjdk JVMs). I think this means it would be plausible to include the {{libjvmquake-linux-x86_64.so}} in {{libs}} and then have a switch on uname -s -m to determine to pick it up or not. Right now we're only building for linux amd64 but if there is interest I can generate more architectures (linux arm probably makes sense, and could do osx). I also still like the idea of having a agents/available and agents/enabled folder like apache does for modules, users can just symlink agents from one to the other to include them (and we can symlink jamm and jvmquake by default).

[~yifanc] I agree that the OutOfMemory conditions that do not result in "true" JVM OOM (meaning that it would cause a heapdump via {{HeapDumpOnOutOfMemory}}) will not get caught by jvmquake, my testing confirms your findings, although the jvmquake GC instability algorithm will still trigger in various real world scenarios I've run into.

I feel like the right move mightly be to walk back a small bit of CASSANDRA-13006 where we stopped forcibly killing the JVM ourselves and let the JVM do it. Specifically if the OOM message contains "Direct buffer memory" we could do what jvmquake does and force the JVM into a "normal" OOM by [allocating large long arrays|https://github.com/Netflix-Skunkworks/jvmquake/blob/master/src/jvmquake.c#L103]. This will then trigger a proper OOM and get us heap dumping. It's relatively easy to ignore the "sacrificial" long array in a heap dump and we could log clearly what is happening.

> OOMs caught and not rethrown
> ----------------------------
>
>                 Key: CASSANDRA-15214
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15214
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client, Messaging/Internode
>            Reporter: Benedict Elliott Smith
>            Priority: Normal
>             Fix For: 4.0, 4.0-rc
>
>         Attachments: oom-experiments.zip
>
>
> Netty (at least, and perhaps elsewhere in Executors) catches all exceptions, so presently there is no way to ensure that an OOM reaches the JVM handler to trigger a crash/heapdump.
> It may be that the simplest most consistent way to do this would be to have a single thread spawned at startup that waits for any exceptions we must propagate to the Runtime.
> We could probably submit a patch upstream to Netty, but for a guaranteed future proof approach, it may be worth paying the cost of a single thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org