You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Dave Latham (JIRA)" <ji...@apache.org> on 2013/04/16 23:49:17 UTC

[jira] [Commented] (ZOOKEEPER-1277) servers stop serving when lower 32bits of zxid roll over

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13633420#comment-13633420 ] 

Dave Latham commented on ZOOKEEPER-1277:
----------------------------------------

We recently experienced an HBase outage that I believe was caused by this issue.  Running on ZK 3.4.4, the log for the leader shows this:

{noformat}
2013-04-12 17:46:25,894 INFO org.apache.zookeeper.server.quorum.Leader: Have quorum of supporters; starting up and setting last processed zxid: 0x1a00000004
2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.FinalRequestProcessor: Zxid outstanding 111669149696 is less than current 111669149697
2013-04-12 17:46:25,895 WARN org.apache.zookeeper.server.quorum.LearnerHandler: ******* GOODBYE /10.0.1.100:34796 ********
2013-04-12 17:46:25,896 ERROR org.apache.zookeeper.server.NIOServerCnxnFactory: Thread LearnerHandler Socket[addr=/10.0.1.100,port=34796,localport=2888] tickOfLastAck:897811 synced?:true queuedPacketLength:0 died
java.lang.IllegalThreadStateException
	at java.lang.Thread.start(Thread.java:638)
	at org.apache.zookeeper.server.quorum.LeaderZooKeeperServer.startSessionTracker(LeaderZooKeeperServer.java:87)
	at org.apache.zookeeper.server.ZooKeeperServer.startup(ZooKeeperServer.java:394)
	at org.apache.zookeeper.server.quorum.Leader.processAck(Leader.java:531)
	at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:497)
{noformat}

Immediately after this one of the followers had a new election and became a follower again.  Also, the heap on the leader immediately climbed until the process became stuck spending most of its time in GC.  At this point HBase region servers started dropping like flies and then the ZK node was killed.

I'm adding this comment now for two purposes.  First, so that if other people see the same symptom in their logs they may find this issue faster.  Second, I'd love to hear from anyone more familiar with ZooKeeper if this issue does indeeed explain the observations I wrote and mentioned above.
                
> servers stop serving when lower 32bits of zxid roll over
> --------------------------------------------------------
>
>                 Key: ZOOKEEPER-1277
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1277
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.3.3
>            Reporter: Patrick Hunt
>            Assignee: Patrick Hunt
>            Priority: Critical
>             Fix For: 3.3.5, 3.4.4, 3.5.0
>
>         Attachments: ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br33.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_br34.patch, ZOOKEEPER-1277_trunk.patch, ZOOKEEPER-1277_trunk.patch
>
>
> When the lower 32bits of a zxid "roll over" (zxid is a 64 bit number, however the upper 32 are considered the epoch number) the epoch number (upper 32 bits) are incremented and the lower 32 start at 0 again.
> This should work fine, however in the current 3.3 branch the followers see this as a NEWLEADER message, which it's not, and effectively stop serving clients. Attached clients seem to eventually time out given that heartbeats (or any operation) are no longer processed. The follower doesn't recover from this.
> I've tested this out on 3.3 branch and confirmed this problem, however I haven't tried it on 3.4/3.5. It may not happen on the newer branches due to ZOOKEEPER-335, however there is certainly an issue with updating the "acceptedEpoch" files contained in the datadir. (I'll enter a separate jira for that)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira