You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Paiva Junqueira (JIRA)" <ji...@apache.org> on 2008/09/13 17:44:44 UTC

[jira] Commented: (ZOOKEEPER-140) Deadlock in QuorumCnxManager

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630783#action_12630783 ] 

Flavio Paiva Junqueira commented on ZOOKEEPER-140:
--------------------------------------------------

It seems to me that there are two unnecessary synchronized blocks: one on sendTo() for the call to initiateConnection, and second upon a new connection and subsequent call to receiveConnection. Both methods synchronize again on senderWorkerMap when it is time to update the bookkeeping information on the connections. By removing these two, we prevent the problem pointed out in this jira. I have tested, and it seems to work, and logic also seems to work to me.

I will postpone submitting a patch because I'd like to have a patch for 127 reviewed and committed first. 

> Deadlock in QuorumCnxManager
> ----------------------------
>
>                 Key: ZOOKEEPER-140
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-140
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Flavio Paiva Junqueira
>
> Frequently the servers deadlock in QuorumCnxManager:initiateConnection on
> s.read(msgBuffer) when reading the challenge from the peer.
> Calls to initiateConnection and receiveConnection are synchronized, so only one or the other can be executing at a time. This prevents two connections from opening between the same pair of servers.
> However, it seems that this leads to deadlock, as in this scenario:
> {noformat}
> A (initiate --> B)
> B (initiate --> C)
> C (initiate --> A)
> {noformat}
> initiateConnection can only complete when receiveConnection runs on the remote peer and answers the challenge. If all servers are blocked in initiateConnection, receiveConnection never runs and leader election halts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.