You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Junqueira (JIRA)" <ji...@apache.org> on 2013/07/09 11:43:50 UTC

[jira] [Commented] (ZOOKEEPER-1678) Server fails to join quorum when a peer is unreachable (5 ZK server setup)

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13703093#comment-13703093 ] 

Flavio Junqueira commented on ZOOKEEPER-1678:
---------------------------------------------

[~juliolopez] FLE pushes notifications to other servers, but it could happen that we start a server and there is no one else around, so instead of sending an overwhelming number of messages, the server backs off and caps at 60 seconds as you say if I remember correctly. I don't mind having that cap value configurable if it helps with you case. 

I'm not convinced that the randomization of notifications will make any difference, but perhaps I'm not understanding your proposal correctly.

I think you're referring to SendWorker in QCM, is it right? We do have one per server, no?

As [~abranzyck] pointed out, it sounds like the problem reported here could be solved by the fix of ZOOKEEPER-900. Could you guys make sure that solution works here and possibly provide an updated patch?
                
> Server fails to join quorum when a peer is unreachable (5 ZK server setup)
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1678
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1678
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: java version "1.6.0_32"
> Java(TM) SE Runtime Environment (build 1.6.0_32-b05)
> Java HotSpot(TM) 64-Bit Server VM (build 20.7-b02, mixed mode)
> Distributor ID:	Ubuntu
> Description:	Ubuntu 12.04.1 LTS
> Release:	12.04
> Codename:	precise
> uname -a Linux ha-vani3-0 3.2.0-23-virtual #36-Ubuntu SMP Tue Apr 10 22:29:03 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: Julio Lopez
>
> In a 5-node ZK cluster setup, in the following state:
> * 1 host is down / not reachable.
> * 4 hosts are up.
> * 3 ZK servers are in quorum.
> * a 4th ZK server was restarted and is trying to re-join the quorum.
> The 4th server is not able to rejoin the quorum because the connection to the host that is not established, and apparently takes to long to timeout.
> Stack traces and additional information coming.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira