You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Daniel Templeton (JIRA)" <ji...@apache.org> on 2017/03/04 18:57:46 UTC

[jira] [Updated] (YARN-3742) YARN RM will shut down if ZKClient creation times out

     [ https://issues.apache.org/jira/browse/YARN-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Templeton updated YARN-3742:
-----------------------------------
    Attachment: YARN-3742.001.patch

Hmmm...  This is a little more complicated than I thought.  Looking at YARN-2759 and YARN-2814, it looks like the original plan was to remove the {{RMFatalEventDispatcher}} entirely, in favor of distributing the job of transitioning the RM or shutting it down to all the the various places where that could be triggered.  I disagree, though.  I think it makes a lot more sense to have the RM be in control of its own destiny and force other services to trigger the RM's state change through the dispatcher.

The original issue that YARN-2759 was trying to solve was a deadlock caused by the {{RMFatalEventDispatcher}} trying to transition to standby.  The solution was to move the transition into a thread *and* move that thread into the state store.  Just moving the transition into a thread would have been enough to solve the problem.  I've attached a patch that restores the {{RMFatalEventDispatcher}} to its original role in a way that doesn't cause a deadlock.  All tests from YARN-2759 and subsequent patches pass, so I think we're safe.  As an added bonus, the RM will now not die spontaneously just because something unexpected happened.

[~kasha], as the author of YARN-2814, I would love to hear your feedback.

> YARN RM  will shut down if ZKClient creation times out 
> -------------------------------------------------------
>
>                 Key: YARN-3742
>                 URL: https://issues.apache.org/jira/browse/YARN-3742
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Daniel Templeton
>         Attachments: YARN-3742.001.patch
>
>
> The RM goes down showing the following stacktrace if the ZK client connection fails to be created. We should not exit but transition to StandBy and stop doing things and let the other RM take over.
> {code}
> 2015-04-19 01:22:20,513  FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
> java.io.IOException: Wait for ZKClient creation timed out
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1090)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationStateInternal(ZKRMStateStore.java:643)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:162)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:147)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org