You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2019/01/28 19:33:00 UTC

[jira] [Commented] (CURATOR-498) Protected Mode creation can mistake closing session's node causing problems for many recipes such as LeaderLatch

    [ https://issues.apache.org/jira/browse/CURATOR-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754289#comment-16754289 ] 

ASF GitHub Bot commented on CURATOR-498:
----------------------------------------

GitHub user Randgalt opened a pull request:

    https://github.com/apache/curator/pull/303

    [CURATOR-498] Protected Mode creation can mistake closing session's node causing problems for many recipes such as LeaderLatch

    Kudos to user Shay Shimony for his tireless and excellent work tracking this down. There are two problems addressed here: 1) Protected create mode can potentially find a ZNode that is about to be deleted due to an expired session. CreateBuilderImpl now keeps track of the session ID when the create is initiated. If after a connection loss the session ID has changed, any found protected node is ignored as it will soon be deleted. 2) For ZooKeeper 3.4.x the simulated (via reflection) InjectSessionExpiration was incorrectly setting the connection state to closed which caused the session expiration to be ignored.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/curator CURATOR-498

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/303.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #303
    
----
commit ea505f54291dc548aca947503630960cd10225d0
Author: randgalt <ra...@...>
Date:   2019-01-28T19:23:15Z

    CURATOR-498
    
    Kudos to user Shay Shimony for his tireless and excellent work tracking this down. There are two problems addressed here: 1) Protected create mode can potentially find a ZNode that is about to be deleted due to an expired session. CreateBuilderImpl now keeps track of the session ID when the create is initiated. If after a connection loss the session ID has changed, any found protected node is ignored as it will soon be deleted. 2) For ZooKeeper 3.4.x the simulated (via reflection) InjectSessionExpiration was incorrectly setting the connection state to closed which caused the session expiration to be ignored.

----


> Protected Mode creation can mistake closing session's node causing problems for many recipes such as LeaderLatch
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-498
>                 URL: https://issues.apache.org/jira/browse/CURATOR-498
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.0.1, 4.1.0
>         Environment: ZooKeeper 3.4.13, Curator 4.1.0 (selecting explicitly 3.4.13), Linux
>            Reporter: Shay Shimony
>            Assignee: Jordan Zimmerman
>            Priority: Blocker
>         Attachments: CURATOR-498.png, HaWatcher.log, LeaderLatch0.java, ha.tar.gz, logs.tar.gz, reproduction.tar.gz, reproduction2.tar.gz
>
>
> The Curator app I am working on uses the LeaderLatch to select a leader out of 6 clients.
> While testing my app, I noticed that when I make ZK lose its quorum for a while and then restore it, then after Curator in my app restores it's connection to ZK - sometimes not all the 6 clients are found in the latch path (using zkCli.sh). That is, I have 5 instead of 6.
> After investigating a little, I have a suspicion that LeaderLatch deleted the leader in method setNode.
> To investigate it I copied the LeaderLatch code and added some log messages, and from them it seems like very old create() background callback was surprisingly scheduled and corrupted the current leader with its stale path name. Meaning, this old one called setNode with its stale name, and set itself instead of the leader and deleted the leader. This leaves client running, thinking it is the leader, while another leader is selected.
> If my analysis is correct then it seems like we need to make this obsolete create callback cancelled (I think its session was suspended on 22:38:54 and then lost on 22:39:04 - so on SUSPENDED cancel ongoing callbacks).
> Please see attached log file and modified LeaderLatch0.
>  
> In the log, note that on 22:39:26 it shows that 0000000485 is replaced by 0000000480 and then probably deleted.
> Note also that at 22:38:52, 34 seconds before, we can see that it was in the reset() method ("RESET OUR PATH") and possibly triggered the creation of 0000000480 then.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)