You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Yan Fang (JIRA)" <ji...@apache.org> on 2015/02/13 20:05:12 UTC

[jira] [Commented] (SAMZA-507) OOME causes container to wedge

    [ https://issues.apache.org/jira/browse/SAMZA-507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14320579#comment-14320579 ] 

Yan Fang commented on SAMZA-507:
--------------------------------

Patch looks good. +1 . Thank you.

> OOME causes container to wedge
> ------------------------------
>
>                 Key: SAMZA-507
>                 URL: https://issues.apache.org/jira/browse/SAMZA-507
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.8.0
>            Reporter: Chris Riccomini
>            Assignee: Chris Riccomini
>             Fix For: 0.9.0
>
>         Attachments: SAMZA-507-0.patch
>
>
> One of our Samza jobs' containers wedged the other day due to an OOME. We had some downtime in an upstream cluster, which caused data to build up. The burst in throughput of messages caused one of our containers to OOME (the container was buffering input messages in memory, and the buffer grew very large very quickly). The OOME occurred on the BrokerProxy thread. Once this happened, no new  messages were consumed, and the rest of the container just sat idle. Only one of the 64 containers in the job failed this way.
> A good first step in fixing this problem would be to have the container recognize that an OOME has occurred, and kill itself with a non-zero exit code. This would cause the AM to restart it, which would in turn un-wedge the process.
> There are longer-term solutions that would help prevent this specific issue, but the general issue of OOME'ing causing a process to wedge shouldn't happen.
> I poked around a bit. The problem is that an OOME on the BrokerProxy thread (or any thread) [just causes that thread to die|http://stackoverflow.com/questions/10327989/does-jvm-terminate-itself-after-outofmemoryerror]. The rest of the process continues running. There seem to be two solutions to this problem, if you wish your process to die when an OOME occurs:
> # Use [-XX:OnOutOfMemoryError|http://stackoverflow.com/questions/3871278/how-do-i-make-the-jvm-exit-on-any-outofmemoryexception-even-when-bad-people-try] to execute a kill -9 on the process.
> # Use [Thread.setDefaultUncaughtExceptionHandler|http://www.javamex.com/tutorials/exceptions/exceptions_uncaught_handler.shtml] to catch OOME from all threads, and execute a System.exit().
> (2) seems more appealing to me. Is anyone else aware of any other ways to handle this?
> The proposed behavior is that no thread should ever have an uncaught exception. If one occurs, the entire container should exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)