You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "Jordan Zimmerman (JIRA)" <ji...@apache.org> on 2019/06/06 13:31:00 UTC

[jira] [Commented] (CURATOR-525) There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state

    [ https://issues.apache.org/jira/browse/CURATOR-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16857647#comment-16857647 ] 

Jordan Zimmerman commented on CURATOR-525:
------------------------------------------

I created a test to simulate the problem. Here are the changes I made:

# Added a few latches to checkBackgroundRetry() in CuratorFrameworkImpl.java. E.g. https://gist.github.com/Randgalt/a5ab016c5dcab8bac9cee223932548c2
# Wrote a test that takes advantage of these: https://gist.github.com/Randgalt/0e070e073b0aff07e9b1b19a5a025bf4
** The test creates the condition described in this issue. It stops the server and then creates a node via async/background
** Because the server is stopped, and the debug latches are set, the code will end up blocking just before validateConnection() is called
** The server is then restarted and the test waits for the connection state to return to "RECONNECTED"
** The debug latch is counted down so that validateConnection() gets called

As described in this issue, the connection state goes back to SUSPENDED. HOWEVER - the connection will repair itself after a period of time. Note that I commented out the final assert in the test so that it just keeps outputting the connection state received. If you run the test you will see this. 

I believe there is no bug here. Curator is doing the right thing. Curator version 2.x probably did the wrong thing but since 3.x has the enhanced handling of forcing session expirations, etc. Of course, this is very noisy when this happens but there's not much more that Curator can do - ZooKeeper _is_ very noisy when there are network partitions. Note: the "Circuit Breaking ConnectionStateListener" (see http://curator.apache.org/utilities.html) was recently added that might help in this.

Let me know what you think but I believe this issue can be closed as "As Designed/Works Properly".

> There is a race condition in Curator which might lead to fake SUSPENDED event and ruin CuratorFrameworkImpl inner state 
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CURATOR-525
>                 URL: https://issues.apache.org/jira/browse/CURATOR-525
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 4.2.0
>            Reporter: Mikhail Valiev
>            Assignee: Cameron McKenzie
>            Priority: Critical
>         Attachments: CuratorFrameworkTest.java, background-thread-infinite-loop.png, curator-race-condition.png, event-watcher-thread.png
>
>
> This was originally found in the 2.11.1 version of Curator, but I tested the latest release as well, and the issue is still there.
> The issue is tied to guaranteed deletes and how it loops infinitely, if called when there is no connection:
> client.delete().guaranteed().forPath(ourPath); 
> [https://curator.apache.org/apidocs/org/apache/curator/framework/api/GuaranteeableDeletable.html]
> This schedules a background operation which attempts to remove the node in infinite loop. Each time a background operation fails due to connection loss it performs a check (validateConnection() function) to see if the main thread is already aware of connection loss, and if it's not - raises the connection loss event. The problem is that this peace of code is also executed by the event watcher thread when connection events are happening - this leads to race condition. So when connection is restored it's easily possible for the main thread to raise RECONNECTED event and after that for background thread to raise SUSPENDED event.
> We might get unlucky and get a "phantom" SUSPENDED event. It breaks Curator inner Connection state and leads to curator behaving unpredictably
> Attached some illustrations and Unit test to reproduce the issue. (Put debug point in validateConnection() )
> *Possible solution*: in CuratorFrameworkImpl class adjust the processEvent() function and add the following:
> if(event.getType() == CuratorEventType.SYNC) {
> connectionStateManager.addStateChange(ConnectionState.RECONNECTED);
> }
> If this is a same state as before - it will be ignored, if background operation succeeded, but we are in SUSPENDED state - this would repair the Curator state and raise RECONNECTED event.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)