You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "zhihai xu (JIRA)" <ji...@apache.org> on 2015/01/09 04:55:34 UTC

[jira] [Created] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash

zhihai xu created YARN-3023:
-------------------------------

             Summary: Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash 
                 Key: YARN-3023
                 URL: https://issues.apache.org/jira/browse/YARN-3023
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 2.6.0
            Reporter: zhihai xu
            Assignee: zhihai xu


Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash.

The sequence for the Race condition is the following:
1, RM Store attempt state to ZK by calling createWithRetries
{code}
2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_000001 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_000001,
{code}

2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK.
The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded.
{code}
2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
{code}

3.RM did retry to store attempt state to ZK after one second
{code}
2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1
{code}

4. during the one second interval, the ZK session is reconnected.
{code}
2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session
2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 10000
{code}

5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck),
For the second try, it will fail with NodeExists KeeperException
{code}
2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation.
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up!
{code}

6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore
{code}
2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_000001
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
{code}

7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager
{code}
  protected void notifyStoreOperationFailed(Exception failureCause) {
    RMFatalEventType type;
    if (failureCause instanceof StoreFencedException) {
      type = RMFatalEventType.STATE_STORE_FENCED;
    } else {
      type = RMFatalEventType.STATE_STORE_OP_FAILED;
    }
    rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause));
  }
{code}

8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent.
{code}
2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause:
org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists
2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)