You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@helix.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2013/10/04 21:02:42 UTC
[jira] [Commented] (HELIX-264) fix zkclient#close() bug
[ https://issues.apache.org/jira/browse/HELIX-264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13786503#comment-13786503 ]
Hudson commented on HELIX-264:
------------------------------
FAILURE: Integrated in helix #1188 (See [https://builds.apache.org/job/helix/1188/])
[HELIX-264] fix zkclient#close() bug, rb=14483 (zzhang: rev b9fe738797cd5228e8ecaa284c8874bfa19f1ff2)
* helix-core/src/test/java/org/apache/helix/manager/zk/TestZkFlapping.java
* helix-core/src/test/java/org/apache/helix/ZkTestHelper.java
* helix-core/src/main/java/org/apache/helix/manager/zk/ZkClient.java
> fix zkclient#close() bug
> ------------------------
>
> Key: HELIX-264
> URL: https://issues.apache.org/jira/browse/HELIX-264
> Project: Apache Helix
> Issue Type: Bug
> Reporter: Zhen Zhang
> Assignee: Zhen Zhang
> Priority: Critical
>
> When the flapping is detected, we are in the zkclient event thread context and we are calling zkclient.close() from its own event thread. Here is the ZkClient#close():
> public void close() throws ZkInterruptedException {
> if (_connection == null) {
> return;
> }
> LOG.debug("Closing ZkClient...");
> getEventLock().lock();
> try {
> setShutdownTrigger(true);
> _eventThread.interrupt();
> _eventThread.join(2000);
> _connection.close();
> _connection = null;
> } catch (InterruptedException e) {
> throw new ZkInterruptedException(e);
> } finally {
> getEventLock().unlock();
> }
> LOG.debug("Closing ZkClient...done");
> }
> _eventThread.interrupt(); <-- will set interrupt status of _eventThread which is in fact the currentThread.
> _eventThread.join(2000); <-- will throw InterruptedException because currentThread has been interrupted.
> _connection.close(); <-- SKIPPED!!!
> So if flapping happens, we are calling ZkHelixManager#disconnectInternal(), which will always interrupt ZkClient#_eventThread but never disconnect the zk connection. This is probably a zkclient bug that we should never call zkclient.close() from its own event thread context.
> fix steps:
> 1) workaround for this bug
> 2) add test cases for flapping detection
> 3) explore the possibility to have controller detect flapping participants and disable them (may via querying zk-server jmx metrics)
--
This message was sent by Atlassian JIRA
(v6.1#6144)