You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@curator.apache.org by "Stephen Ingram (JIRA)" <ji...@apache.org> on 2015/04/08 20:29:12 UTC

[jira] [Created] (CURATOR-205) Repeated InterruptedExceptions during mutex aquire leads to LeaderSelector deadlock

Stephen Ingram created CURATOR-205:
--------------------------------------

             Summary: Repeated InterruptedExceptions during mutex aquire leads to LeaderSelector deadlock
                 Key: CURATOR-205
                 URL: https://issues.apache.org/jira/browse/CURATOR-205
             Project: Apache Curator
          Issue Type: Bug
          Components: Recipes
    Affects Versions: 2.7.2
            Reporter: Stephen Ingram


When an InterruptedException is thrown during the internalLockLoop that is called during mutex.acquire, internalLockLoop will set a flag "doDelete" which signals during a finally clause to delete the lock path that we are trying to create.

However, in the pathInForeground function of DeleteBuilderImpl, a _second_ InterruptedException may occur before zookeeper can delete the specified path.  The RetryLoop machinery contained in the function will only retry if it is a Retryable Exception, an equivalence class which does not include InterruptedExceptions.  

The second InterruptedException exception then causes an exit of the pathInForeground function without deleting the path, leading to a deadlock where no one can acquire the mutex.

In my test, I am certain that both of these InterruptedExceptions are due to repeated fluctuation in the ConnectionStateManager's connection state.  When the state ceases to fluctuate, no leader can be selected due to the persistence of the node we failed to delete.

I was able to address this bug with a solution similar to CURATOR-45:  if the pathInForeground function is interrupted with an InterruptedException, I schedule a BackgroundCallback to attempt pathInForeground again.  This task is able to delete the path when the connection is stable and the mutex is acquired by the new leader.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)