You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (JIRA)" <ji...@apache.org> on 2015/09/01 23:22:47 UTC
[jira] [Updated] (YARN-3242) Asynchrony in ZK-close can lead to
ZKRMStateStore watcher receiving events for old client
[ https://issues.apache.org/jira/browse/YARN-3242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kumar Vavilapalli updated YARN-3242:
------------------------------------------
Fix Version/s: 2.6.1
Pulled this into 2.6.1. Ran compilation and TestZKRMStateStoreZKClientConnections before the push. Patch applied cleanly.
> Asynchrony in ZK-close can lead to ZKRMStateStore watcher receiving events for old client
> -----------------------------------------------------------------------------------------
>
> Key: YARN-3242
> URL: https://issues.apache.org/jira/browse/YARN-3242
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.6.0
> Reporter: zhihai xu
> Assignee: zhihai xu
> Priority: Critical
> Labels: 2.6.1-candidate
> Fix For: 2.7.0, 2.6.1
>
> Attachments: YARN-3242.000.patch, YARN-3242.001.patch, YARN-3242.002.patch, YARN-3242.003.patch, YARN-3242.004.patch
>
>
> Old ZK client session watcher event messed up new ZK client session due to ZooKeeper asynchronously closing client session.
> The watcher event from old ZK client session can still be sent to ZKRMStateStore after the old ZK client session is closed.
> This will cause seriously problem:ZKRMStateStore out of sync with ZooKeeper session.
> We only have one ZKRMStateStore but we can have multiple ZK client sessions.
> Currently ZKRMStateStore#processWatchEvent doesn't check whether this watcher event is from current session. So the watcher event from old ZK client session which just is closed will still be processed.
> For example, If a Disconnected event received from old session after new session is connected, the zkClient will be set to null
> {code}
> case Disconnected:
> LOG.info("ZKRMStateStore Session disconnected");
> oldZkClient = zkClient;
> zkClient = null;
> break;
> {code}
> Then ZKRMStateStore won't receive SyncConnected event from new session because new session is already in SyncConnected state and it won't send SyncConnected event until it is disconnected and connected again.
> Then we will see all the ZKRMStateStore operations fail with IOException "Wait for ZKClient creation timed out" until RM shutdown.
> The following code from zookeeper(ClientCnxn#EventThread) show even after receive eventOfDeath, EventThread will still process all the events until waitingEvents queue is empty.
> {code}
> while (true) {
> Object event = waitingEvents.take();
> if (event == eventOfDeath) {
> wasKilled = true;
> } else {
> processEvent(event);
> }
> if (wasKilled)
> synchronized (waitingEvents) {
> if (waitingEvents.isEmpty()) {
> isRunning = false;
> break;
> }
> }
> }
> private void processEvent(Object event) {
> try {
> if (event instanceof WatcherSetEventPair) {
> // each watcher will process the event
> WatcherSetEventPair pair = (WatcherSetEventPair) event;
> for (Watcher watcher : pair.watchers) {
> try {
> watcher.process(pair.event);
> } catch (Throwable t) {
> LOG.error("Error while calling watcher ", t);
> }
> }
> } else {
> public void disconnect() {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Disconnecting client for session: 0x"
> + Long.toHexString(getSessionId()));
> }
> sendThread.close();
> eventThread.queueEventOfDeath();
> }
> public void close() throws IOException {
> if (LOG.isDebugEnabled()) {
> LOG.debug("Closing client for session: 0x"
> + Long.toHexString(getSessionId()));
> }
> try {
> RequestHeader h = new RequestHeader();
> h.setType(ZooDefs.OpCode.closeSession);
> submitRequest(h, null, null, null);
> } catch (InterruptedException e) {
> // ignore, close the send/event threads
> } finally {
> disconnect();
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)