You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@zookeeper.apache.org by Brian Tarbox <br...@gmail.com> on 2012/11/19 18:13:38 UTC

What to do when a node will not join the cluster?

I have a four node cluster (I know, it should be odd) that generally runs
fine but this morning I needed to restart the whole cluster and one of the
nodes will not sync.  The node asks for a snapshot from the leader..waits
for several minutes(!) and then fails.

11:46:55,130 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@294]
- Getting a snapshot from leader
11:47:01,535 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@325]
- Setting leader epoch e
11:47:21,707 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@341]
- Got zxid 0xe0000000a expected 0x1
11:55:01,515 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82]
- Exception when following the leader
java.io.EOFException

On the Leader side it appears to be sending the snapshot and then it fails.
I have no idea how to proceed...any suggestion appreciated.

11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@318] - Synchronizing with Follower sid: 4
maxCommittedLog=0xe00000009 minCommittedLog=0xe00000001
peerLastZxid=0x900323414
11:46:55,129 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@379] - Unhandled proposal scenario
11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@395] - Sending SNAP
11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@419] - Sending snapshot last zxid of peer is 0x900323414
 zxid of leader is 0xe00000009sent zxid of db as 0xe00000009
11:55:01,513 [myid:5] - ERROR [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@562] - Unexpected exception causing shutdown while sock
still open
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.io.BufferedInputStream.fill(Unknown Source)
        at java.io.BufferedInputStream.read(Unknown Source)
        at java.io.DataInputStream.readInt(Unknown Source)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
        at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
11:55:01,513 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@575] - ******* GOODBYE /172.16.10.200:46021 ********

Re: What to do when a node will not join the cluster?

Posted by Diego Oliveira <lo...@gmail.com>.
Brian,

    Take a look in the configuration option initLimit and
syncLimit<http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html>,
this may help. I have beam some problema like that in a 3 node cluster due
the data size and quantity, the initial sync and even the role system was
messed up by some time running. In my case a rised that values and did some
trips to reduce/compact the data in zk.

On Mon, Nov 19, 2012 at 3:13 PM, Brian Tarbox <br...@gmail.com> wrote:

> I have a four node cluster (I know, it should be odd) that generally runs
> fine but this morning I needed to restart the whole cluster and one of the
> nodes will not sync.  The node asks for a snapshot from the leader..waits
> for several minutes(!) and then fails.
>
> 11:46:55,130 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@294
> ]
> - Getting a snapshot from leader
> 11:47:01,535 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@325
> ]
> - Setting leader epoch e
> 11:47:21,707 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@341
> ]
> - Got zxid 0xe0000000a expected 0x1
> 11:55:01,515 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82
> ]
> - Exception when following the leader
> java.io.EOFException
>
> On the Leader side it appears to be sending the snapshot and then it fails.
> I have no idea how to proceed...any suggestion appreciated.
>
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@318] - Synchronizing with Follower sid: 4
> maxCommittedLog=0xe00000009 minCommittedLog=0xe00000001
> peerLastZxid=0x900323414
> 11:46:55,129 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@379] - Unhandled proposal scenario
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@395] - Sending SNAP
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@419] - Sending snapshot last zxid of peer is 0x900323414
>  zxid of leader is 0xe00000009sent zxid of db as 0xe00000009
> 11:55:01,513 [myid:5] - ERROR [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@562] - Unexpected exception causing shutdown while sock
> still open
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(Unknown Source)
>         at java.net.SocketInputStream.read(Unknown Source)
>         at java.io.BufferedInputStream.fill(Unknown Source)
>         at java.io.BufferedInputStream.read(Unknown Source)
>         at java.io.DataInputStream.readInt(Unknown Source)
>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
> 11:55:01,513 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@575] - ******* GOODBYE /172.16.10.200:46021 ********
>



-- 
Diego de Oliveira
diego@diegooliveira.com
www.diegooliveira.com
Never argue with a fool -- people might not be able to tell the difference