You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Camille Fournier (Commented) (JIRA)" <ji...@apache.org> on 2011/11/01 16:25:32 UTC

[jira] [Commented] (ZOOKEEPER-1264) FollowerResyncConcurrencyTest failing intermittently

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141235#comment-13141235 ] 

Camille Fournier commented on ZOOKEEPER-1264:
---------------------------------------------

>From a comment I added to the tracker that this change was attached to:
ZOOKEEPER-1136 causes a concurrency bug. Specifically:
1. Follower rejoins, gets snap from leader
2. Follower gets NEWLEADER message and takes a snapshot
3. Follower gets some additional tranactions forwarded from leader, applies these directly to data tree
4. Follower gets an UPTODATE message, does not take a snapshot
5. Follower starts following, writes some new transactions to its log, and is killed before it takes another snapshot
6. Follower restarts and gets a DIFF from the leader

The transactions that came in between NEWLEADER and UPTODATE are lost because they never go anywhere but the internal data tree, and if that tree isn't snapshotted and the follower restarts with only a DIFF, the follower will lose these transactions.

I think the proper thing to do is snapshot after UPTODATE, but I'm not sure why we changed this to snapshot after NEWLEADER instead. The wiki doesn't seem to explain that clearly. 
                
> FollowerResyncConcurrencyTest failing intermittently
> ----------------------------------------------------
>
>                 Key: ZOOKEEPER-1264
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1264
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: tests
>    Affects Versions: 3.3.3, 3.4.0, 3.5.0
>            Reporter: Patrick Hunt
>            Assignee: Camille Fournier
>            Priority: Blocker
>             Fix For: 3.3.4, 3.4.0, 3.5.0
>
>         Attachments: ZOOKEEPER-1264.patch, ZOOKEEPER-1264_branch33.patch, ZOOKEEPER-1264_branch34.patch, followerresyncfailure_log.txt.gz, logs.zip, tmp.zip
>
>
> The FollowerResyncConcurrencyTest test is failing intermittently. 
> saw the following on 3.4:
> {noformat}
> junit.framework.AssertionFailedError: Should have same number of
> ephemerals in both followers expected:<11741> but was:<14001>
>        at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.verifyState(FollowerResyncConcurrencyTest.java:400)
>        at org.apache.zookeeper.test.FollowerResyncConcurrencyTest.testResyncBySnapThenDiffAfterFollowerCrashes(FollowerResyncConcurrencyTest.java:196)
>        at org.apache.zookeeper.JUnit4ZKTestRunner$LoggedInvokeMethod.evaluate(JUnit4ZKTestRunner.java:52)
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira