You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "Zhen Zhang (JIRA)" <ji...@apache.org> on 2013/10/02 20:23:44 UTC
[jira] [Created] (HELIX-264) fix zkclient#close() bug
Zhen Zhang created HELIX-264:
--------------------------------
Summary: fix zkclient#close() bug
Key: HELIX-264
URL: https://issues.apache.org/jira/browse/HELIX-264
Project: Apache Helix
Issue Type: Bug
Reporter: Zhen Zhang
Assignee: Zhen Zhang
Priority: Critical
When the flapping is detected, we are in the zkclient event thread context and we are calling zkclient.close() from its own event thread. Here is the ZkClient#close():
public void close() throws ZkInterruptedException {
if (_connection == null) {
return;
}
LOG.debug("Closing ZkClient...");
getEventLock().lock();
try {
setShutdownTrigger(true);
_eventThread.interrupt();
_eventThread.join(2000);
_connection.close();
_connection = null;
} catch (InterruptedException e) {
throw new ZkInterruptedException(e);
} finally {
getEventLock().unlock();
}
LOG.debug("Closing ZkClient...done");
}
_eventThread.interrupt(); <-- will set interrupt status of _eventThread which is in fact the currentThread.
_eventThread.join(2000); <-- will throw InterruptedException because currentThread has been interrupted.
_connection.close(); <-- SKIPPED!!!
So if flapping happens, we are calling ZkHelixManager#disconnectInternal(), which will always interrupt ZkClient#_eventThread but never disconnect the zk connection. This is probably a zkclient bug that we should never call zkclient.close() from its own event thread context.
fix steps:
1) workaround for this bug
2) add test cases for flapping detection
3) explore the possibility to have controller detect flapping participants and disable them (may via querying zk-server jmx metrics)
--
This message was sent by Atlassian JIRA
(v6.1#6144)