You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Karl Mueller (JIRA)" <ji...@apache.org> on 2014/07/08 04:24:35 UTC
[jira] [Comment Edited] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

    [ https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054433#comment-14054433 ] 

Karl Mueller edited comment on CASSANDRA-7507 at 7/8/14 2:23 AM:
-----------------------------------------------------------------

if a bug can cause the clean exit after OOM to fail as expected, then isn't it considered a problem?

I guess if I'm considering the value of a "clean exit" versus "possibly staying up, being in a weird state, or not writing the right data to disk", I would always prefer it to die without worrying about a clean exit. As I said, in my opinion, Cassandra already handles dying unexpectedly fine - there's no need to handle it cleanly when there's any risk. 

If there's no risk of something like 7133 happening (or a similar bug), then sure, clean exit is sensible, but that's clearly not guaranteed. Replaying some logs and then flushing is not a big deal compared to potentially bad data, zombie states, etc. - in my view, at least.



was (Author: kmueller):
if a bug can cause the clean exit after OOM to fail as expected, then isn't it considered a problem?

I guess if I'm considering the value of a "clean exit" versus "possibly staying up or being in a weird state", I would always prefer it to die without worrying about a clean exit. As I said, in my opinion, Cassandra already handles dying unexpectedly fine - there's no need to handle it cleanly when there's any risk. 

If there's no risk of something like 7133 happening (or a similar bug), then sure, clean exit is sensible, but that's clearly not guaranteed. Replaying some logs and then flushing is not a big deal compared to potentially bad data, zombie states, etc. - in my view, at least.


> OOM creates unreliable state - die instantly better
> ---------------------------------------------------
>
>                 Key: CASSANDRA-7507
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Karl Mueller
>            Priority: Minor
>
> I had a cassandra node run OOM. My heap had enough headroom, there was just something which either was a bug or some unfortunate amount of short-term memory utilization. This resulted in the following error:
>  WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java (line 1713) Some hints were not written before shutdown.  This is not supposed to happen.  You should (a) run repair, and (b) file a bug report
> There are no other messages of relevance besides the OOM error about 90 minutes earlier.
> My (limited) understanding of the JVM and Cassandra says that when it goes OOM, it will attempt to signal cassandra to shut down "cleanly". The problem, in my view, is that with an OOM situation, nothing is guaranteed anymore. I believe it's impossible to reliably "cleanly shut down" at this point, and therefore it's wrong to even try. 
> Yes, ideally things could be written out, flushed to disk, memory messages written, other nodes notified, etc. but why is there any reason to believe any of those steps could happen? Would happen? Couldn't bad data be written at this point to disk rather than good data? Some network messages delivered, but not others?
> I think Cassandra should have the option to (and possibly default) to kill itself immediately upon the OOM condition happening in a hard way, and not rely on the java-based clean shutdown process. Cassandra already handles recovery from unclean shutdown, and it's not a big deal. My node, for example, kept in a sort-of alive state for 90 minutes where who knows what it was doing or not doing.
> I don't know enough about the JVM and options for it to know the best exact implementation of "die instantly on OOM", but it should be something that's possible either with some flags or a C library (which doesn't rely on java memory to do something which it may not be able to get!)
> Short version: a kill -9 of all C* processes in that instance without needing more java memory, when OOM is raised



--
This message was sent by Atlassian JIRA
(v6.2#6252)