You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "zhihai xu (JIRA)" <ji...@apache.org> on 2015/03/22 22:50:10 UTC

[jira] [Commented] (YARN-3385) Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.

    [ https://issues.apache.org/jira/browse/YARN-3385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375185#comment-14375185 ] 

zhihai xu commented on YARN-3385:
---------------------------------

I uploaded a patch YARN-3385.000.patch for review. The patch fixed both Op.delete and zkClient.delete for NoNodeException and optimized the code at removeRMDelegationTokenState to skip ZK delete operation if the node doesn't exist.

Without the patch, the test will fail with the following message
{code}
-------------------------------------------------------
 T E S T S
-------------------------------------------------------
Running org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.853 sec <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore
testRMAppDeleteNoNodeException(org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore)  Time elapsed: 1.253 sec  <<< FAILURE!
java.lang.AssertionError: NoNodeException should not happen.
	at org.junit.Assert.fail(Assert.java:88)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore.testRMAppDeleteNoNodeException(TestZKRMStateStore.java:405)
Results :
Failed tests: 
  TestZKRMStateStore.testRMAppDeleteNoNodeException:405 NoNodeException should not happen.
Tests run: 5, Failures: 1, Errors: 0, Skipped: 0

org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:949)
	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:915)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:920)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:916)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1080)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1101)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:916)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:928)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:697)
	at org.apache.hadoop.yarn.server.resourcemanager.recovery.TestZKRMStateStore.testRMAppDelete(TestZKRMStateStore.java:401)
{code}

> Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion.
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-3385
>                 URL: https://issues.apache.org/jira/browse/YARN-3385
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>         Attachments: YARN-3385.000.patch
>
>
> Race condition: KeeperException$NoNodeException will cause RM shutdown during ZK node deletion(Op.delete).
> The race condition is similar as YARN-2721 and YARN-3023.
> since the race condition exists for ZK node creation, it should also exist for  ZK node deletion.
> We see this issue with the following stack trace:
> {code}
> 2015-03-17 19:18:58,958 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:111)
> 	at org.apache.zookeeper.ZooKeeper.multiInternal(ZooKeeper.java:945)
> 	at org.apache.zookeeper.ZooKeeper.multi(ZooKeeper.java:911)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:857)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$4.run(ZKRMStateStore.java:854)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:973)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:992)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.doMultiWithRetries(ZKRMStateStore.java:854)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.removeApplicationStateInternal(ZKRMStateStore.java:647)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:691)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:766)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:761)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> 	at java.lang.Thread.run(Thread.java:745)
> 2015-03-17 19:18:58,959 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)