You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/07/13 02:23:00 UTC
[jira] [Work logged] (CURATOR-644) CLONE - Race conditions in LeaderLatch after reconnecting to ensemble

     [ https://issues.apache.org/jira/browse/CURATOR-644?focusedWorklogId=790231&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-790231 ]

ASF GitHub Bot logged work on CURATOR-644:
------------------------------------------

                Author: ASF GitHub Bot
            Created on: 13/Jul/22 02:22
            Start Date: 13/Jul/22 02:22
    Worklog Time Spent: 10m 
      Work Description: tisonkun opened a new pull request, #430:
URL: https://github.com/apache/curator/pull/430

   See also:
   
   * https://issues.apache.org/jira/browse/CURATOR-644
   * https://issues.apache.org/jira/browse/CURATOR-645
   
   ## Livelock in details
   
   Here we have two race conditions to cause livelock:
   
   Case 1. Suppose there are two participants, p0 and p1:
   
   T0. p1 is going to watch on preceding node belongs to p0.
   T1. p0 gets reconnected, and thus reset its node, and create a new node to prepare watch on p1's node.
   T2. p1 find preceding node has gone, and reset itself.
   
   At the moment, p0 and p1 can be in the livelock that never see each other's node and infinitely reset themselves. This is the case reported by CURATOR-645.
   
   Case 2. The similar case can happen even if there is only one participant p:
   
   T0. In thread 0 (th0) p is going to `checkLeadership` and before it gets `outPath.get()`.
   T1. In thread 1 (th1) p gets reconnected and calls `reset`, now `outPath.get() == null`.
   T2. th0 gets `outPath.get() == null` and is going to `reset`.
   T3. th1 creates its new node and prepare to gets `ourPath.get()`
   T4. th0 calls `reset()`.
   
   At the moment, inside the same participant there are two thread competing each other and thus in a livelock. This is the case reported by CURATOR-644.
   
   ## Solution
   
   I make two significant changes to resolve these livelock cases:
   
   1. Call `getChildren` instead of `reset` when preceding node not found in callback. This is previously reported in https://github.com/apache/curator/commit/ff4ec29f5958cc9162f0302c02f4ec50c0e796cd#r31770630. I don't find a reason we perform different between callback and watcher for the same condition. And concurrent `reset`s are the trigger for these livelock.
   2. Call `getChildren` instead of `reset` when recovered from connection loss. The reason is similar to 1, while if a connection loss or session expire cause our node to be deleted, when `checkLeadership` we can see the condition and call `reset`.
   
   These changes should fix CURATOR-645 and release the case in CURATOR-644. However, as long as there's possibility to generate concurrent `checkLeadership` a participant can race itself. I ever thought we can use a `checkLeadershipLock` here but since all client request are handled in callbacks, the lock can protect little.
   
   I'm trying to add test cases and such changes must involve more eyes. Also if you have an idea to fix one participant multiple threads race condition, please comment.




Issue Time Tracking
-------------------

            Worklog Id:     (was: 790231)
    Remaining Estimate: 0h
            Time Spent: 10m

> CLONE - Race conditions in LeaderLatch after reconnecting to ensemble
> ---------------------------------------------------------------------
>
>                 Key: CURATOR-644
>                 URL: https://issues.apache.org/jira/browse/CURATOR-644
>             Project: Apache Curator
>          Issue Type: Bug
>    Affects Versions: 4.2.0
>            Reporter: Ken Huang
>            Assignee: Jordan Zimmerman
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Clone from CURATOR-504.
> We use LeaderLatch in a lot of places in our system and when ZooKeeper ensemble is unstable and clients are reconnecting to logs are full of messages like the following:
> {{{}[2017-08-31 19:18:34,562][ERROR][org.apache.curator.framework.recipes.leader.LeaderLatch] Can't find our node. Resetting. Index: -1 {{}}}}
> According to the [implementation|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L529-L536], this can happen in two cases:
>  * When internal state `ourPath` is null
>  * When the list of latches does not have the expected one.
> I believe we hit the first condition because of races that occur after client reconnects to ZooKeeper.
>  * Client reconnects to ZooKeeper and LeaderLatch gets the event and calls reset method which set the internal state (`ourPath`) to null, removes old latch and creates a new one. This happens in thread "Curator-ConnectionStateManager-0".
>  * Almost simultaneously, LeaderLatch gets another even NodeDeleted ([here|https://github.com/apache/curator/blob/4251fe328908e5fca37af034fabc190aa452c73f/curator-recipes/src/main/java/org/apache/curator/framework/recipes/leader/LeaderLatch.java#L543-L554]) and tries to re-read the list of latches and check leadership. This happens in the thread "main-EventThread".
> Therefore, sometimes there is a situation when method `checkLeadership` is called when `ourPath` is null.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)