You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Joshua McKenzie (JIRA)" <ji...@apache.org> on 2014/08/19 19:55:18 UTC
[jira] [Updated] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

     [ https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joshua McKenzie updated CASSANDRA-7507:
---------------------------------------

    Attachment: 7507_v3_java.txt
                7507_v3_build.txt

v3 attached in 2 parts:

h5. 7507_v3_java.txt:
I've added a new JVMStabilityInspector class where I've housed the aspect of Throwables and Exceptions that we consider unrecoverable.  I started implementing as simply rethrowing to our uncaught handler but repeatedly ran into logic that expected continuation after encountering a Throwable (CommitLog, CommitLogSegment, SEPWorker, etc) that made that approach an invasive change.  I went ahead and took a hint from CommitLog.handleCommitError() with this design and also removed the exit thread from CassandraDaemon and added the JVMStabilityInspector.inspectThrowable call to our default uncaught exception handler.

h5. 7507_v3_build.txt:
I've also tightened up findSwallowedExceptions.py and added it to tools/bin.  There are 2 new targets in build.xml related to that, 1 of which I integrated into the build pipeline.  With the current setup, if the parser finds any caught (Throwable ...) clauses in the code-base that are not rethrown or handed to the JVMStabilityInspector, it'll fail the build with the infraction count and prompt you to run 'ant swallow-print' to print out details.  I also added a -w flag to the script itself that has proven useful for incrementally walking through violations in the code-base to remedy them.


I have a few reservations with v3 though none are strong enough to prevent me attaching it for discussion:
# Added complexity to our build process and added time to the build.  The ant target looks to take around 400ms on my dev laptop.  As to the hard dependency for the build target, I'm open to discussion on removing it and instead integrating it into our ci server to run separately.  It's currently error-free so should only gate new development from adding blanket Throwable catches, however the added build time is something I'm on the fence about.
# Potential performance impact of hitting JVMStabilityInspector's inspection method.  This code-path should only be executed in exception cases and my expectation is that we're not writing performance-critical code w/assumptions of thrown exceptions in the mix.  Something worth keeping in mind if/as we extend it with further checks (file handle exhaustion for CASSANDRA-7579 for instance)
# I'd prefer the shutdown hook unregistration be agnostic rather than calling StorageService specifically to remove it, however that would require either users calling a wrapper on JVMStabilityInspector to register shutdown hooks or using reflection to iterate over package private variables in ApplicationShutdownHooks which I don't want to couple us to.  Given we only have 1 shutdown hook at the moment leaving it coupled seems a reasonable compromise for now.

> OOM creates unreliable state - die instantly better
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7507
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Karl Mueller
>            Assignee: Joshua McKenzie
>            Priority: Minor
>             Fix For: 2.1.1
>
>         Attachments: 7507_v1.txt, 7507_v2.txt, 7507_v3_build.txt, 7507_v3_java.txt, exceptionHandlingResults.txt, findSwallowedExceptions.py
>
>
> I had a cassandra node run OOM. My heap had enough headroom, there was just something which either was a bug or some unfortunate amount of short-term memory utilization. This resulted in the following error:
>  WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java (line 1713) Some hints were not written before shutdown.  This is not supposed to happen.  You should (a) run repair, and (b) file a bug report
> There are no other messages of relevance besides the OOM error about 90 minutes earlier.
> My (limited) understanding of the JVM and Cassandra says that when it goes OOM, it will attempt to signal cassandra to shut down "cleanly". The problem, in my view, is that with an OOM situation, nothing is guaranteed anymore. I believe it's impossible to reliably "cleanly shut down" at this point, and therefore it's wrong to even try. 
> Yes, ideally things could be written out, flushed to disk, memory messages written, other nodes notified, etc. but why is there any reason to believe any of those steps could happen? Would happen? Couldn't bad data be written at this point to disk rather than good data? Some network messages delivered, but not others?
> I think Cassandra should have the option to (and possibly default) to kill itself immediately upon the OOM condition happening in a hard way, and not rely on the java-based clean shutdown process. Cassandra already handles recovery from unclean shutdown, and it's not a big deal. My node, for example, kept in a sort-of alive state for 90 minutes where who knows what it was doing or not doing.
> I don't know enough about the JVM and options for it to know the best exact implementation of "die instantly on OOM", but it should be something that's possible either with some flags or a C library (which doesn't rely on java memory to do something which it may not be able to get!)
> Short version: a kill -9 of all C* processes in that instance without needing more java memory, when OOM is raised



--
This message was sent by Atlassian JIRA
(v6.2#6252)