You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@curator.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/12/14 01:39:00 UTC
[jira] [Commented] (CURATOR-495) Race and possible dead locks with RetryPolicies and several Curator Recipes

    [ https://issues.apache.org/jira/browse/CURATOR-495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16720824#comment-16720824 ] 

ASF GitHub Bot commented on CURATOR-495:
----------------------------------------

GitHub user Randgalt opened a pull request:

    https://github.com/apache/curator/pull/297

    [CURATOR-495] Fixes race in many Curator recipes which could block ZK's event thread

    Fixes race in many Curator recipes whereby a pattern was used that called "notifyAll" in a synchronized block in response to a ZooKeeper watcher callback. This created a race and possible deadlock if the recipe instance was already in a synchronized block. This would result in the ZK event thread getting blocked which would prevent ZK connections from getting repaired. This change adds a new executor (available from CuratorFramework) that can be used to do the sync/notify so that ZK's event thread is not blocked.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/curator CURATOR-495

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #297
    
----
commit f91adb22e83d8e47b99ad98f8c13c86251bc4cb3
Author: randgalt <ra...@...>
Date:   2018-12-14T01:34:40Z

    CURATOR-495
    
    Fixes race in many Curator recipes whereby a pattern was used that called "notifyAll" in a synchronized block in response to a ZooKeeper watcher callback. This created a race and possible deadlock if the recipe instance was already in a synchronized block. This would result in the ZK event thread getting blocked which would prevent ZK connections from getting repaired. This change adds a new executor (available from CuratorFramework) that can be used to do the sync/notify so that ZK's event thread is not blocked.

----


> Race and possible dead locks with RetryPolicies and several Curator Recipes
> ---------------------------------------------------------------------------
>
>                 Key: CURATOR-495
>                 URL: https://issues.apache.org/jira/browse/CURATOR-495
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 4.0.1
>            Reporter: Jordan Zimmerman
>            Assignee: Jordan Zimmerman
>            Priority: Blocker
>             Fix For: 4.1.0
>
>
> In trying to figure out why {{TestInterProcessSemaphoreMutex}} is so flakey I've come across a fairly serious edge case in how several of our recipes work. You can see the issue in {{InterProcessSemaphoreV2}} (which is what {{InterProcessSemaphoreMutex}} uses internally). Look here:
> [InterProcessSemaphoreV2.java|https://github.com/apache/curator/blob/master/curator-recipes/src/main/java/org/apache/curator/framework/recipes/locks/InterProcessSemaphoreV2.java#L373]
> The code synchronizes and then does {{client.getChildren()...}}. This is where the problem is. If there are connection problems inside of getChildren() the retry policy will do configured sleeping, retries, etc. Importantly, this is all done while the thread doing the retries holds InterProcessSemaphoreV2's monitor. If the ZK connection is repaired past the session timeout, ZK will eventually call InterProcessSemaphoreV2's watcher with an Expired message. InterProcessSemaphoreV2's watcher calls this method:
> {code}
> private synchronized void notifyFromWatcher()
> {
>     notifyAll();
> }
> {code}
> You can see that this is a race. The thread doing "getChildren" is holding the monitor and is in a retry loop waiting for the connection to be repaired. However, ZK's event loop is trying to obtain that same monitor as a result of trying to call the synchronized notifyFromWatcher(). This means that the retry policy will always fail because ZK's event loop is tied up until that thread exists. Worse still, if someone were to use a retry policy of "RetryForever" they'd have a deadlock.
> This pattern is in about 10 files or so. I'm trying to think of a workaround. One possibility is to use a separate thread for this type of notification. i.e. notifyFromWatcher() would just signal another thread that the notifyAll() needs to be called. This would unblock ZK's event thread so that things can progress. I'll play around with this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)