You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Patrick Hunt (JIRA)" <ji...@apache.org> on 2012/12/18 21:18:15 UTC

[jira] [Comment Edited] (ZOOKEEPER-1599) 3.3 server cannot join 3.4 quorum

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535265#comment-13535265 ] 

Patrick Hunt edited comment on ZOOKEEPER-1599 at 12/18/12 8:17 PM:
-------------------------------------------------------------------

[~breed] I thought I was pretty clear on the scenario (my recent comment above, requoted here: )

bq. In the common rolling upgrade scenario ops typically upgrades the ensemble first, then the clients. So the case such as multi commands being run against 3.4 prior to the upgrade completing (ensemble), while something that can certainly happen, is up to the user and unlikely to happen if they follow our instructions. If they do so then they are outside the parameters of what we support. However in the common case all they want to do is upgrade the servers w/o service downtime. We should support this. We always have in the past, and we tell people that's a guarantee.

Rolling upgrade is important to allow upgrading the ensemble with high availability. The goal is to do it in a short period of time. The goal is not to mix clients using new features unsupported by the older servers.

I agree with Alex, if we make a significant change such that it's not backward compatible we need to increment the major version number (and the protocol version). That's always been one of our reasons for having a major version number. That said I don't see why we'd need to do so in this case.

                
      was (Author: phunt):
    [~breed] I thought I was pretty clear on the scenario (my recent comment above, requoted here:)

{noformat}
In the common rolling upgrade scenario ops typically upgrades the ensemble first, then the clients. So the case such as multi commands being run against 3.4 prior to the upgrade completing (ensemble), while something that can certainly happen, is up to the user and unlikely to happen if they follow our instructions. If they do so then they are outside the parameters of what we support. However in the common case all they want to do is upgrade the servers w/o service downtime. We should support this. We always have in the past, and we tell people that's a guarantee.
{noformat}

Rolling upgrade is important to allow upgrading the ensemble with high availability. The goal is to do it in a short period of time. The goal is not to mix clients using new features unsupported by the older servers.

I agree with Alex, if we make a significant change such that it's not backward compatible we need to increment the major version number (and the protocol version). That's always been one of our reasons for having a major version number. That said I don't see why we'd need to do so in this case.

                  
> 3.3 server cannot join 3.4 quorum
> ---------------------------------
>
>                 Key: ZOOKEEPER-1599
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1599
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum
>    Affects Versions: 3.3.6, 3.4.5
>            Reporter: Skye Wanderman-Milne
>            Assignee: Skye Wanderman-Milne
>            Priority: Blocker
>             Fix For: 3.4.6
>
>         Attachments: ZOOKEEPER-1599.patch
>
>
> When a 3.3 server attempts to join an existing quorum lead by a 3.4 server, the 3.3 server is disconnected while trying to download the leader's snapshot. The 3.3 server restarts and starts the process over again, but is never able to join the quorum.
> 3.3 server log:
> {code}
> 2012-12-07 10:44:34,582 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Learner@294] - Getting a snapshot from leader
> 2012-12-07 10:44:34,582 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Learner@325] - Setting leader epoch 12
> 2012-12-07 10:44:54,604 - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Follower@82] - Exception when following the leader
> java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148)
>         at org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:332)
>         at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
> 2012-12-07 10:44:54,605 - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2183:Follower@165] - shutdown called
> java.lang.Exception: shutdown Follower
>         at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165)
>         at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649)
> {code}
> 3.4 leader log:
> {code}
> 2012-12-07 10:51:35,178 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection$Messenger$WorkerReceiver@273] - Backward compatibility mode, server id=3
> 2012-12-07 10:51:35,178 [myid:2] - INFO  [WorkerReceiver[myid=2]:FastLeaderElection@542] - Notification: 3 (n.leader), 0x1100000000 (n.zxid), 0x2 (n.round), LOOKING (n.state), 3 (n.sid), 0x11 (n.peerEPoch), LEADING (my state)
> 2012-12-07 10:51:35,182 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:37654:LearnerHandler@263] - Follower sid: 3 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer@262f4873
> 2012-12-07 10:51:35,182 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:37654:LearnerHandler@318] - Synchronizing with Follower sid: 3 maxCommittedLog=0x0 minCommittedLog=0x0 peerLastZxid=0x1100000000
> 2012-12-07 10:51:35,182 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:37654:LearnerHandler@395] - Sending SNAP
> 2012-12-07 10:51:35,183 [myid:2] - INFO  [LearnerHandler-/127.0.0.1:37654:LearnerHandler@419] - Sending snapshot last zxid of peer is 0x1100000000  zxid of leader is 0x1200000000sent zxid of db as 0x1200000000
> 2012-12-07 10:51:55,204 [myid:2] - ERROR [LearnerHandler-/127.0.0.1:37654:LearnerHandler@562] - Unexpected exception causing shutdown while sock still open
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(SocketInputStream.java:150)
>         at java.net.SocketInputStream.read(SocketInputStream.java:121)
>         at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>         at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>         at java.io.DataInputStream.readInt(DataInputStream.java:387)
>         at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
> 2012-12-07 10:51:55,205 [myid:2] - WARN  [LearnerHandler-/127.0.0.1:37654:LearnerHandler@575] - ******* GOODBYE /127.0.0.1:37654 ********
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira