You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2020/04/01 01:26:00 UTC

[jira] [Work logged] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

     [ https://issues.apache.org/jira/browse/CURATOR-525?focusedWorklogId=413659&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-413659 ]

ASF GitHub Bot logged work on CURATOR-525:
------------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Apr/20 01:25
            Start Date: 01/Apr/20 01:25
    Worklog Time Spent: 10m 
      Work Description: Randgalt commented on pull request #357: [CURATOR-525] Fix for Zombie LOST Curator
URL: https://github.com/apache/curator/pull/357
 
 
   There is a race whereby the ZooKeeper connection can be healed before Curator is finished processing the new connection state. When this happens
   the Curator instance becomes a Zombie stuck in the LOST state. This fix is a "hack". ConnectionStateManager will notice that the connection state is
   LOST but that the Curator instance reports that it is connected. When this happens, it is logged and the connection is reset.
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

            Worklog Id:     (was: 413659)
    Remaining Estimate: 0h
            Time Spent: 10m

> There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state 
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-525
>                 URL: https://issues.apache.org/jira/browse/CURATOR-525
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>            Reporter: Mikhail Valiev
>            Assignee: Jordan Zimmerman
>            Priority: Critical
>             Fix For: 5.0.0
>
>         Attachments: CuratorFrameworkTest.java, background-thread-infinite-loop.png, curator-race-condition.png, event-watcher-thread.png
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was originally found in the 2.11.1 version of Curator, but I tested the latest release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if called when there is no connection:
> client.delete().guaranteed().forPath(ourPath); 
> [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]
> This schedules a background operation which attempts to remove the node in infinite loop. Each time a background operation fails due to connection loss it performs a check (validateConnection() function) to see if the main thread is already aware of connection loss, and if it's not - raises the connection loss event. The problem is that this peace of code is also executed by the event watcher thread when connection events are happening - this leads to race condition. So when connection is restored it's easily possible for the main thread to raise RECONNECTED event and after that for background thread to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator inner Connection state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug point in validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() function and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background operation succeeded, but we are in SUSPENDED state - this would repair the Curator state and raise RECONNECTED event.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)