You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by hanm <gi...@git.apache.org> on 2016/10/20 23:44:31 UTC

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

GitHub user hanm opened a pull request:

    https://github.com/apache/zookeeper/pull/92

    ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration.

    Use explicit fine grained locks for synchronizing access to QuorumVerifier states in QuorumPeer.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/hanm/zookeeper ZOOKEEPER-2080

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/92.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #92
    
----
commit cfb4d73a94f23c6e047f75d400372caa495f197f
Author: Michael Han <ha...@cloudera.com>
Date:   2016-10-20T22:37:22Z

    Use explicit fine grained lock for maintaining a consistent global view of QuorumVerifier in QuorumPeer.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfiguration...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/92
  
    Hey @shralex, want to close the loop here. Does the latest pull request look good to you? My stress tests on this branch has been green for past two weeks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by eribeiro <gi...@git.apache.org>.
Github user eribeiro commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/92#discussion_r96929565
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -434,6 +434,10 @@ public int getQuorumSize(){
         //last proposed quorum verifier
         public QuorumVerifier lastSeenQuorumVerifier = null;
     
    +    // Lock object that guard access to quorumVerifier and lastSeenQuorumVerifier.
    +    private byte[] qvLock = new byte[0];
    --- End diff --
    
    It should be *final* if you want to use it as a lock object. Also, as constant it would be renamed as *QV_LOCK*.
    
    nit: Also, we usually use a `Object`:
    ```
    private final Object LOCK = new Object();
    ```
    or a `ReentrantReadWriteLock` (more verbose tough, with try-finally, etc).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/zookeeper/pull/92


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/92#discussion_r97451281
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
    @@ -468,31 +469,33 @@ synchronized private boolean connectOne(long sid, InetSocketAddress electionAddr
          */
         
         synchronized void connectOne(long sid){
    +        connectOne(sid, self.getLastSeenQuorumVerifier());
    +    }
    +
    +    synchronized void connectOne(long sid, QuorumVerifier lastSeenQV){
             if (senderWorkerMap.get(sid) != null) {
    -             LOG.debug("There is a connection already for server " + sid);
    -             return;
    +            LOG.debug("There is a connection already for server " + sid);
    +            return;
             }
    -        synchronized(self) {
    -           boolean knownId = false;
    -            // Resolve hostname for the remote server before attempting to
    -            // connect in case the underlying ip address has changed.
    -            self.recreateSocketAddresses(sid);
    -            if (self.getView().containsKey(sid)) {
    -               knownId = true;
    -                if (connectOne(sid, self.getView().get(sid).electionAddr))
    -                   return;
    -            } 
    -            if (self.getLastSeenQuorumVerifier()!=null && self.getLastSeenQuorumVerifier().getAllMembers().containsKey(sid)
    -                   && (!knownId || (self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr !=
    -                   self.getView().get(sid).electionAddr))) {
    -               knownId = true;
    -                if (connectOne(sid, self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr))
    -                   return;
    -            } 
    -            if (!knownId) {
    -                LOG.warn("Invalid server id: " + sid);
    +        boolean knownId = false;
    +        // Resolve hostname for the remote server before attempting to
    +        // connect in case the underlying ip address has changed.
    +        self.recreateSocketAddresses(sid);
    +        if (self.getView().containsKey(sid)) {
    --- End diff --
    
    @shralex Thanks for review comments! Made two changes:
    
    * Refactored the code to reuse getView results. This view is not passed in as I thought that's simplified caller site.
    * This code block inside connectOne is now synchronized with the same lock that protecting other view / quorum verifiers of the same QuorumPeer. I think this makes the code block semantically equivalent to the previous code block before this change, where the code block was synchronizing on the whole QuorumPeer 'self' with the intention that during the entire execution of connectOne, accesses to configs are protected. I did not add any comments as with the explicit synchronizing block, the semantic should be self explanatory. 
    
    My stress tests look good so far with latest changes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by lvfangmin <gi...@git.apache.org>.
Github user lvfangmin commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/92#discussion_r90764675
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -1390,24 +1406,29 @@ public QuorumVerifier configFromString(String s) throws IOException, ConfigExcep
         }
         
         /**
    -     * Return QuorumVerifier object for the last committed configuration
    +     * Return QuorumVerifier object for the last committed configuration.
          */
    -
    -    public synchronized QuorumVerifier getQuorumVerifier(){
    -        return quorumVerifier;
    -
    +    public QuorumVerifier getQuorumVerifier(){
    +        synchronized (qvLock) {
    +            return quorumVerifier;
    +        }
         }
     
    -    public synchronized QuorumVerifier getLastSeenQuorumVerifier(){
    -        return lastSeenQuorumVerifier;        
    +    /**
    +     * Return QuorumVerifier object for the last proposed configuration.
    +     */
    +    public QuorumVerifier getLastSeenQuorumVerifier(){
    +        synchronized (qvLock) {
    +            return lastSeenQuorumVerifier;
    +        }
         }
         
    -    public synchronized void connectNewPeers(){
    -       if (qcm!=null && getQuorumVerifier()!=null && getLastSeenQuorumVerifier()!=null) {
    -           Map<Long, QuorumServer> committedView = getQuorumVerifier().getAllMembers();
    -           for (Entry<Long, QuorumServer> e: getLastSeenQuorumVerifier().getAllMembers().entrySet()){
    +    private void connectNewPeers(){
    --- End diff --
    
    Hi @hanm, I've followed the long discussion on the Jira, thanks for digging deeper into the problem. LGTM, only a small suggestion: it's error-prone to assume that in the future this method won't be called by other non-synchronized methods, we'd better to add the synchronize(qvLock) here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/92#discussion_r97451441
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -434,6 +434,10 @@ public int getQuorumSize(){
         //last proposed quorum verifier
         public QuorumVerifier lastSeenQuorumVerifier = null;
     
    +    // Lock object that guard access to quorumVerifier and lastSeenQuorumVerifier.
    +    private byte[] qvLock = new byte[0];
    --- End diff --
    
    Thanks for the comments, Edward. Code updated.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by shralex <gi...@git.apache.org>.
Github user shralex commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/92#discussion_r97266698
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
    @@ -468,31 +469,33 @@ synchronized private boolean connectOne(long sid, InetSocketAddress electionAddr
          */
         
         synchronized void connectOne(long sid){
    +        connectOne(sid, self.getLastSeenQuorumVerifier());
    +    }
    +
    +    synchronized void connectOne(long sid, QuorumVerifier lastSeenQV){
             if (senderWorkerMap.get(sid) != null) {
    -             LOG.debug("There is a connection already for server " + sid);
    -             return;
    +            LOG.debug("There is a connection already for server " + sid);
    +            return;
             }
    -        synchronized(self) {
    -           boolean knownId = false;
    -            // Resolve hostname for the remote server before attempting to
    -            // connect in case the underlying ip address has changed.
    -            self.recreateSocketAddresses(sid);
    -            if (self.getView().containsKey(sid)) {
    -               knownId = true;
    -                if (connectOne(sid, self.getView().get(sid).electionAddr))
    -                   return;
    -            } 
    -            if (self.getLastSeenQuorumVerifier()!=null && self.getLastSeenQuorumVerifier().getAllMembers().containsKey(sid)
    -                   && (!knownId || (self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr !=
    -                   self.getView().get(sid).electionAddr))) {
    -               knownId = true;
    -                if (connectOne(sid, self.getLastSeenQuorumVerifier().getAllMembers().get(sid).electionAddr))
    -                   return;
    -            } 
    -            if (!knownId) {
    -                LOG.warn("Invalid server id: " + sid);
    +        boolean knownId = false;
    +        // Resolve hostname for the remote server before attempting to
    +        // connect in case the underlying ip address has changed.
    +        self.recreateSocketAddresses(sid);
    +        if (self.getView().containsKey(sid)) {
    --- End diff --
    
    How about passing also the last committed view so you don't need to call getView() multiple times ?
    I know you're protecting this with the lock in QuorumPeer, but before I read the other file I thought there may be a race because of multiple accesses to the config. A comment would help here.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #92: ZOOKEEPER-2080: Fix deadlock in dynamic reconfig...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/92#discussion_r90916712
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/QuorumPeer.java ---
    @@ -1390,24 +1406,29 @@ public QuorumVerifier configFromString(String s) throws IOException, ConfigExcep
         }
         
         /**
    -     * Return QuorumVerifier object for the last committed configuration
    +     * Return QuorumVerifier object for the last committed configuration.
          */
    -
    -    public synchronized QuorumVerifier getQuorumVerifier(){
    -        return quorumVerifier;
    -
    +    public QuorumVerifier getQuorumVerifier(){
    +        synchronized (qvLock) {
    +            return quorumVerifier;
    +        }
         }
     
    -    public synchronized QuorumVerifier getLastSeenQuorumVerifier(){
    -        return lastSeenQuorumVerifier;        
    +    /**
    +     * Return QuorumVerifier object for the last proposed configuration.
    +     */
    +    public QuorumVerifier getLastSeenQuorumVerifier(){
    +        synchronized (qvLock) {
    +            return lastSeenQuorumVerifier;
    +        }
         }
         
    -    public synchronized void connectNewPeers(){
    -       if (qcm!=null && getQuorumVerifier()!=null && getLastSeenQuorumVerifier()!=null) {
    -           Map<Long, QuorumServer> committedView = getQuorumVerifier().getAllMembers();
    -           for (Entry<Long, QuorumServer> e: getLastSeenQuorumVerifier().getAllMembers().entrySet()){
    +    private void connectNewPeers(){
    --- End diff --
    
    @lvfangmin Good suggestion, PR updated. Thanks for double checking. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---