You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/04/28 02:28:06 UTC

[jira] [Commented] (CURATOR-3) LeaderLatch race condition causing extra nodes to be added in Zookeeper Edit

    [ https://issues.apache.org/jira/browse/CURATOR-3?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14516035#comment-14516035 ] 

ASF GitHub Bot commented on CURATOR-3:
--------------------------------------

GitHub user wrmsr opened a pull request:

    https://github.com/apache/curator/pull/74

    LeaderSelector mutex resurrection

    I am not alone in experiencing situations in which my LeaderSelectors will wind up in an indefinite state of having no leader (relevant JIRA issues are listed below). This problem had been occurring nearly daily in our AWS environments (which regularly experience transient network issues). I believe I have solved this issue. This may not be the most elegant approach but, in the least, our deployments have behaved correctly since its activation.
    
    This branch allows an InterProcessMutex to optionally reuse an existing acquisition. This of course breaks the contract of re-entrance as stated by the InterProcessLock interface but it is not done by default and only used specifically by the LeaderSelector (which is the only thing I am interested in using it for). I have a test reliably (though hackily) reproducing this issue but it is written in terms of an internal project and as I am unfamiliar with your test code I haven't ported it yet. All existing tests pass. The term resurrect probably isn't the best but hey, it works :p
    
    Issues possibly fixed by this branch:
    https://issues.apache.org/jira/browse/CURATOR-3
    https://issues.apache.org/jira/browse/CURATOR-171
    https://issues.apache.org/jira/browse/CURATOR-188
    https://issues.apache.org/jira/browse/CURATOR-202
    https://issues.apache.org/jira/browse/CURATOR-205

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wrmsr/curator mutex_resurrection

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/74.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #74
    
----
commit b06ded7b0c76da9e2e14b593599ba2eb9c0b8b72
Author: William Timoney <wt...@140-sfoengwifi55-160.clients.corp.yelpcorp.com>
Date:   2015-04-27T23:24:29Z

    mutex resurrection

commit 298982f6b2a2175ef568da20c7dd48f733ed4025
Author: William Timoney <wt...@140-sfoengwifi55-160.clients.corp.yelpcorp.com>
Date:   2015-04-28T00:06:41Z

    test fix

commit b21543b0788591ddbe5c64f47caa14dd8b9583a4
Author: William Timoney <wt...@140-sfoengwifi55-160.clients.corp.yelpcorp.com>
Date:   2015-04-28T00:08:49Z

    indent

----


> LeaderLatch race condition causing extra nodes to be added in Zookeeper Edit
> ----------------------------------------------------------------------------
>
>                 Key: CURATOR-3
>                 URL: https://issues.apache.org/jira/browse/CURATOR-3
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.0.0-incubating
>            Reporter: Jordan Zimmerman
>             Fix For: TBD
>
>
> From https://github.com/Netflix/curator/issues/265
> Looks like there's a race condition in LeaderLatch. If LeaderLatch.close() is called at the right time while the latch's watch handler is running, the latch will place another node in Zookeeper after the latch is closed.
> Basically how it happens is this:
> 1) I have two processes contesting a LeaderLatch, ProcessA and ProcessB. ProcessA is leader.
> 2) ProcessA loses leadership somehow (it releases, its connection goes down, etc.)
> 3) This causes ProcessB's watch to get called, check the state is still STARTED, and if so the LeaderLatch will re-evaluate if it is leader.
> 4) While the watch handler is running, close() is called on the LeaderLatch on ProcessB. This sets the LeaderLatch state to CLOSED, removes the znode from ZK and closes off the LeaderLatch.
> 5) The watch handler has already checked that the state is STARTED, so it does a getChildren() on the latch path, and finds the latch's znode is missing. It goes ahead and calls reset(), which places a new znode in Zookeeper.
> Result: The LeaderLatch is closed, but there is still a node in Zookeeper that isn't associated with any LeaderLatch and won't go away until the session goes down. Subsequent LeaderLatches at this path can never get leadership while that session is up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)