You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Andor Molnar (JIRA)" <ji...@apache.org> on 2018/10/05 15:18:00 UTC

[jira] [Commented] (ZOOKEEPER-3157) Improve FuzzySnapshotRelatedTest to avoid flaky due to issues like connection loss

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16639943#comment-16639943 ] 

Andor Molnar commented on ZOOKEEPER-3157:
-----------------------------------------

[~lvfangmin] [~hanm]

I think I have a better approach for this, what do you think:

The problem is here:

{code:java}
        LOG.info("Restarting follower A to load snapshot");
        mt[followerA].shutdown();
        mt[followerA].start();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
{code}

I believe that the problem is when the check validates the CONNECTED state, the client has realised yet that the server went down and it's still connected. The check goes on and the rest is just about luck and good timing. I would add an additional check like this:

{code:java}
        LOG.info("Restarting follower A to load snapshot");
        mt[followerA].shutdown();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTING);
        mt[followerA].start();
        QuorumPeerMainTest.waitForOne(zk[followerA], States.CONNECTED);
{code}

Just to make sure that the client gets fully disconnected before restarting the follower.

> Improve FuzzySnapshotRelatedTest to avoid flaky due to issues like connection loss
> ----------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3157
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3157
>             Project: ZooKeeper
>          Issue Type: Test
>          Components: tests
>    Affects Versions: 3.6.0
>            Reporter: Fangmin Lv
>            Assignee: Fangmin Lv
>            Priority: Minor
>             Fix For: 3.6.0
>
>
> [~hanm] noticed that the test might failure because of ConnectionLoss when trying to getData, [here is an example|https://builds.apache.org/job/ZooKeepertrunk/198/testReport/junit/org.apache.zookeeper.server.quorum/FuzzySnapshotRelatedTest/testPZxidUpdatedWhenLoadingSnapshot], we should catch this and retry to avoid flaky.
> Internally, we 'fixed' flaky test by adding junit.RetryRule in ZKTestCase, which is the base class for most of the tests. I'm not sure this is the right way to go or not, since it's actually 'hiding' the flaky tests, but this will help reducing the flaky tests a lot if we're not going to tackle it in the near time, and we can check the testing history to find out which tests are flaky and deal with them separately. So let me know if this seems to provide any benefit in short term, if it is I'll provide a patch to do that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)