You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "JL (JIRA)" <ji...@apache.org> on 2013/03/29 01:31:17 UTC

[jira] [Commented] (ZOOKEEPER-1678) Server fails to join quorum when a peer is unreachable (5 ZK server setup)

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616907#comment-13616907 ] 

JL commented on ZOOKEEPER-1678:
-------------------------------

Apparently, the ZK server that is trying to re-join the quorum takes too long to realize that it cannot reach the node that is down.  It seems to take about 60+ seconds to timeout a connection attempt when looking for the quorum.
Every 60 - 70 seconds, an exception of the following form shows up:

{{
2013-03-29 00:14:16,181 - WARN  [WorkerSender[myid=1]:QuorumCnxManager@368] - Cannot open channel to 3 at election address 10.51.21.140:3888
java.net.NoRouteToHostException: No route to host
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.PlainSocketImpl.doConnect(Unknown Source)
        at java.net.PlainSocketImpl.connectToAddress(Unknown Source)
        at java.net.PlainSocketImpl.connect(Unknown Source)
        at java.net.SocksSocketImpl.connect(Unknown Source)
        at java.net.Socket.connect(Unknown Source)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
        at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
        at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
        at java.lang.Thread.run(Unknown Source)
}}
                
> Server fails to join quorum when a peer is unreachable (5 ZK server setup)
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1678
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1678
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.5
>         Environment: java version "1.6.0_32"
> Java(TM) SE Runtime Environment (build 1.6.0_32-b05)
> Java HotSpot(TM) 64-Bit Server VM (build 20.7-b02, mixed mode)
> Distributor ID:	Ubuntu
> Description:	Ubuntu 12.04.1 LTS
> Release:	12.04
> Codename:	precise
> uname -a Linux ha-vani3-0 3.2.0-23-virtual #36-Ubuntu SMP Tue Apr 10 22:29:03 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
>            Reporter: JL
>
> In a 5-node ZK cluster setup, in the following state:
> * 1 host is down / not reachable.
> * 4 hosts are up.
> * 3 ZK servers are in quorum.
> * a 4th ZK server was restarted and is trying to re-join the quorum.
> The 4th server is not able to rejoin the quorum because the connection to the host that is not established, and apparently takes to long to timeout.
> Stack traces and additional information coming.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira