You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2013/11/21 16:21:35 UTC

[jira] [Commented] (YARN-1430) InvalidStateTransition exceptions are ignored in state machines

    [ https://issues.apache.org/jira/browse/YARN-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13829018#comment-13829018 ] 

Jason Lowe commented on YARN-1430:
----------------------------------

Before flipping the switch to change this, we need to carefully consider the consequences.  I'm all for making this a fatal error for unit tests, but I'm not convinced this is a good thing for production environments.

We have been running in production for quite some time now (0.23 instead of 2.x, but the code is very similar in many of these areas).  We've seen invalid state transitions logged on our production machines and have filed quite a few JIRAs related to those.  However I was often thankful the invalid state transition did not crash, because in the vast majority of these cases the system can continue to function in an "acceptable" manner.  Sure, we might leak some resources related to an application, fail to aggregate some log or something similar, but I'd rather take that pain with a potential workaround than the alternative of bringing down the entire cluster each and every time it occurs.

What I'm worried about here is a case where we don't see the error during testing but when we deploy to production some critical, frequent job consistently triggers an unhandled transition.  If that's always fatal, now we're stuck in a state where the cluster cannot stay up very long until we scramble to develop and deploy a fix or have to rollback, and we have guaranteed downtime when it occurs.  In almost all of these cases the invalid transition is going to be localized to just one app, one container, or one node.  I'm not sure that kind of error is worth taking down an entire cluster outside of a testing setup.  I feel this is similar to how most software products handle asserts -- they are fatal during development but not during production.

> InvalidStateTransition exceptions are ignored in state machines
> ---------------------------------------------------------------
>
>                 Key: YARN-1430
>                 URL: https://issues.apache.org/jira/browse/YARN-1430
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Omkar Vinit Joshi
>            Assignee: Omkar Vinit Joshi
>
> We have all state machines ignoring InvalidStateTransitions. These exceptions will get logged but will not crash the RM / NM. We definitely should crash it as they move the system into some invalid / unacceptable state.
> * Places where we hide this exception :-
> ** JobImpl
> ** TaskAttemptImpl
> ** TaskImpl
> ** NMClientAsyncImpl
> ** ApplicationImpl
> ** ContainerImpl
> ** LocalizedResource
> ** RMAppAttemptImpl
> ** RMAppImpl
> ** RMContainerImpl
> ** RMNodeImpl
> thoughts?



--
This message was sent by Atlassian JIRA
(v6.1#6144)