You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "zhihai xu (JIRA)" <ji...@apache.org> on 2015/02/22 06:20:11 UTC

[jira] [Commented] (YARN-3242) Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.

    [ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14332030#comment-14332030 ] 

zhihai xu commented on YARN-3242:
---------------------------------

The following ZooKeeper client logs in RM show this error:
{code}
// old session closed
2015-02-16 06:01:12,985 INFO org.apache.zookeeper.ZooKeeper: Session: 0x24b8df4044005d4 closed


// new session created and connected
2015-02-16 06:01:12,991 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete sessionid = 0x24b8df4044005d8, negotiated timeout = 10000
2015-02-16 06:01:12,994 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:SyncConnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED


// old session disconnected and EventThread shutdown
2015-02-16 06:01:12,995 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Watcher event type: None with state:Disconnected for path:null for Service org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore in state org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: STARTED
2015-02-16 06:01:12,995 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down

// Error: Wait for ZKClient creation timed out and RM shutdown
2015-02-16 06:01:13,095 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: Storing application with id application_1424095053378_0010
2015-02-16 06:01:33,100 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing app: application_1424095053378_0010
java.io.IOException: Wait for ZKClient creation timed out
2015-02-16 06:01:33,107 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1
{code}

The following ZooKeeper server logs show the new session 0x24b8df4044005d8 connected until RM shutdown at 2015-02-16 06:01:33.
{code}
2015-02-16 06:01:12,991 INFO org.apache.zookeeper.server.ZooKeeperServer: Established session 0x24b8df4044005d8 with negotiated timeout 10000 for client 

2015-02-16 06:01:33,886 WARN org.apache.zookeeper.server.NIOServerCnxn: caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x24b8df4044005d8, likely client has closed socket
	at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
	at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
	at java.lang.Thread.run(Thread.java:744)
2015-02-16 06:01:33,888 INFO org.apache.zookeeper.server.NIOServerCnxn: Closed socket connection for client which had sessionid 0x24b8df4044005d8
{code}

> Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3242
>                 URL: https://issues.apache.org/jira/browse/YARN-3242
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
> The watcher event from old ZK client session can still be sent to ZKRMStateStore when the old  ZK client session is closed.
> This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session.
> We only have one ZKRMStateStore but we can have multiple ZK client sessions.
> Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed.
> For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null
> {code}
>         case Disconnected:
>           LOG.info("ZKRMStateStore Session disconnected");
>           oldZkClient = zkClient;
>           zkClient = null;
>           break;
> {code}
> Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again.
> Then we will see all the ZKRMStateStore operations fail with IOException "Wait for ZKClient creation timed out" until  RM shutdown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)