You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tarun Parimi (JIRA)" <ji...@apache.org> on 2019/07/31 07:54:00 UTC
[jira] [Updated] (YARN-9712) ResourceManager goes into a deadlock
while transitioning to standby
[ https://issues.apache.org/jira/browse/YARN-9712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tarun Parimi updated YARN-9712:
-------------------------------
Attachment: YARN-9712.001.patch
> ResourceManager goes into a deadlock while transitioning to standby
> -------------------------------------------------------------------
>
> Key: YARN-9712
> URL: https://issues.apache.org/jira/browse/YARN-9712
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager, RM
> Affects Versions: 2.9.0
> Reporter: Tarun Parimi
> Assignee: Tarun Parimi
> Priority: Major
> Attachments: YARN-9712.001.patch
>
>
> We have observed RM go into a deadlock while transitioning to standby in a heavily loaded production cluster which can observe random connection loss to a zookeeper session and also has a large amount of RMDelegationToken requests due to oozie jobs.
> On analyzing the jstack and the logs, this seems to happen when the below sequence of events occur.
> 1. Zookeeper session is lost and so the ActiveStandbyElector service will do transitionToStandby . This transitionToStandby is a synchronized method and so will acquire a lock on ResourceManager.
> {code:java}
> 2019-07-25 14:31:24,497 INFO ha.ActiveStandbyElector (ActiveStandbyElector.java:processWatchEvent(621)) - Session expired. Entering neutral mode and rejoining...
> 2019-07-25 14:31:28,084 INFO resourcemanager.ResourceManager (ResourceManager.java:transitionToStandby(1134)) - Transitioning to standby state
> {code}
> 2. While transitioning to standby, a java.lang.InterruptedException occurs in RMStateStore while removing/storing RMDelegationToken. This is because RMSecretManagerService will be stopped while transitioning to standby.
> {code:java}
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore (RMStateStore.java:transition(373)) - Error While Removing RMDelegationToken and SequenceNumber
> java.lang.InterruptedException
> 2019-07-25 14:31:28,576 ERROR recovery.RMStateStore (RMStateStore.java:notifyStoreOperationFailedInternal(992)) - State store operation failed
> java.lang.InterruptedException
> {code}
> 3. When state store error occurs, a RMFatalEvent of type STATE_STORE_FENCED will be sent.
> {code:java}
> 2019-07-25 14:31:28,579 ERROR resourcemanager.ResourceManager (ResourceManager.java:handle(767)) - Received RMFatalEvent of type STATE_STORE_FENCED, caused by java.lang.InterruptedException
> {code}
> 4. The problem occurs when the RMFatalEventDispatcher calls getConfig() . This also needs a lock on ResourceManager since its a synchronized method. This will cause the rmDispatcher eventHandlingThread to become blocked.
> {code:java}
> private class RMFatalEventDispatcher implements EventHandler<RMFatalEvent> {
> @Override
> public void handle(RMFatalEvent event) {
> LOG.error("Received " + event);
> if (HAUtil.isHAEnabled(getConfig())) {
> // If we're in an HA config, the right answer is always to go into
> // standby.
> LOG.warn("Transitioning the resource manager to standby.");
> handleTransitionToStandByInNewThread();
> {code}
> 5. The transitionToStandby will wait forever as the eventHandlingThread of rmDispatcher is blocked. This causes a deadlock and RM will not become active until restarted.
> Below are the relevant threads in the jstack captured.
> The transitionToStandby thread that waits forever.
> {code:java}
> "main-EventThread" #138239 daemon prio=5 os_prio=0 tid=0x00007fea473b2800 nid=0x2f411 in Object.wait() [0x00007fda5bef5000]
> java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1245)
> - locked <0x00007fdb6c5059a0> (a java.lang.Thread)
> at java.lang.Thread.join(Thread.java:1319)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.serviceStop(AsyncDispatcher.java:161)
> at org.apache.hadoop.service.AbstractService.stop(AbstractService.java:221)
> - locked <0x00007fdb6c538ca0> (a java.lang.Object)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.resetRMContext(ResourceManager.java:1323)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.reinitialize(ResourceManager.java:1091)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.transitionToStandby(ResourceManager.java:1139)
> - locked <0x00007fdb33e418f0> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
> at org.apache.hadoop.yarn.server.resourcemanager.AdminService.transitionToStandby(AdminService.java:355)
> - locked <0x00007fdb33e41828> (a org.apache.hadoop.yarn.server.resourcemanager.AdminService)
> at org.apache.hadoop.yarn.server.resourcemanager.EmbeddedElectorService.becomeStandby(EmbeddedElectorService.java:147)
> at org.apache.hadoop.ha.ActiveStandbyElector.becomeStandby(ActiveStandbyElector.java:970)
> at org.apache.hadoop.ha.ActiveStandbyElector.processResult(ActiveStandbyElector.java:480)
> - locked <0x00007fdb33e7bb88> (a org.apache.hadoop.ha.ActiveStandbyElector)
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:617)
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:510)
> Locked ownable synchronizers:
> - None
> {code}
> The blocked rmDispatcher EventHandler.
> {code:java}
> "AsyncDispatcher event handler" #135565 daemon prio=5 os_prio=0 tid=0x00007fdb2107f000 nid=0x2484a waiting for monitor entry [0x00007fda597cc000]
> java.lang.Thread.State: BLOCKED (on object monitor)
> at org.apache.hadoop.service.AbstractService.getConfig(AbstractService.java:403)
> - waiting to lock <0x00007fdb33e418f0> (a org.apache.hadoop.yarn.server.resourcemanager.ResourceManager)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:769)
> at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMFatalEventDispatcher.handle(ResourceManager.java:764)
> at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:184)
> at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:110)
> at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
> - None
> {code}
> This scenario will happen only when having the changes introduced in YARN-3742 where RMFatalEventDispatcher handles ERROR scenarios such as STATE_STORE_FENCED and tries to transitionToStandby.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org