You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by afine <gi...@git.apache.org> on 2017/04/17 23:36:56 UTC

[GitHub] zookeeper pull request #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeepe...

GitHub user afine opened a pull request:

    https://github.com/apache/zookeeper/pull/229

    ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.server.quorum.QuorumCnxManagerTest.testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid

    I noticed some inconsistent test results from `testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid`. It can be seen failing on Apache infra here: http://mail-archives.apache.org/mod_mbox/zookeeper-dev/201701.mbox/%3c374175863.2852.1485127554310.JavaMail.jenkins@crius%3e
    
    The problem can be "reproduced" by adding a `Thread.sleep(4000)` as the first line of the finally block   for `RecvWorker.run` in `QuorumCnxManager.java`.
    
    The issue is caused by a race condition between the removal of the relevant `sid` from `senderWorkerMap` for `peer0` and the 3 second delay from `assertEventuallyNotConnected`. 
    
    Other tests in this class do not experience the same problem because `testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid` is unique in that it is the only test using `assertEventuallyNotConnected` where the peer with the lower `sid` is not using SASL. This means that the peer with the lower `sid` will always create send and receive workers in `handleConnection` (as it will not throw an exception on `authServer.authenticate(sock, din)`) and may not destroy them in time for the test assertion.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2759

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #229
    
----
commit c567c494d116b6893e24e5b1b812a6ff24791614
Author: Abraham Fine <af...@apache.org>
Date:   2017-04-17T22:51:27Z

    ZOOKEEPER-2759: Flaky test:
    org.apache.zookeeper.server.quorum.QuorumCnxManagerTest.testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve...

Posted by afine <gi...@git.apache.org>.
Github user afine commented on the issue:

    https://github.com/apache/zookeeper/pull/229
  
    thanks @hanm for the review


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/229
  
    lgtm. One minor comment - a short comment in the test case would help for future readers here since this test case is the only case use a different connection status verification. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve...

Posted by afine <gi...@git.apache.org>.
Github user afine commented on the issue:

    https://github.com/apache/zookeeper/pull/229
  
    @hanm please see https://github.com/apache/zookeeper/pull/242


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve...

Posted by afine <gi...@git.apache.org>.
Github user afine commented on the issue:

    https://github.com/apache/zookeeper/pull/229
  
    @hanm thank you for the through review. you are right the original test is bogus.
    
    I wanted to write the test in a way that verifies behavior for both peers so I used mockito to spy on the sender worker maps. Let me know what you think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeepe...

Posted by afine <gi...@git.apache.org>.
GitHub user afine reopened a pull request:

    https://github.com/apache/zookeeper/pull/229

    ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.server.quorum.QuorumCnxManagerTest.testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid

    I noticed some inconsistent test results from `testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid`. It can be seen failing on Apache infra here: http://mail-archives.apache.org/mod_mbox/zookeeper-dev/201701.mbox/%3c374175863.2852.1485127554310.JavaMail.jenkins@crius%3e
    
    The problem can be "reproduced" by adding a `Thread.sleep(4000)` as the first line of the finally block   for `RecvWorker.run` in `QuorumCnxManager.java`.
    
    The issue is caused by a race condition between the removal of the relevant `sid` from `senderWorkerMap` for `peer0` and the 3 second delay from `assertEventuallyNotConnected`. 
    
    Other tests in this class do not experience the same problem because `testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid` is unique in that it is the only test using `assertEventuallyNotConnected` where the peer with the lower `sid` is not using SASL. This means that the peer with the lower `sid` will always create send and receive workers in `handleConnection` (as it will not throw an exception on `authServer.authenticate(sock, din)`) and may not destroy them in time for the test assertion.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2759

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/229.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #229
    
----
commit c567c494d116b6893e24e5b1b812a6ff24791614
Author: Abraham Fine <af...@apache.org>
Date:   2017-04-17T22:51:27Z

    ZOOKEEPER-2759: Flaky test:
    org.apache.zookeeper.server.quorum.QuorumCnxManagerTest.testNoAuthLearnerConnectToAuthRequiredServerWithHigherSid

commit 112a70ba14f5cbb94e301702b328acfff54b0b02
Author: Abraham Fine <af...@apache.org>
Date:   2017-04-27T03:56:01Z

    fix tests with mockito

commit 55869fac9d0f99eb52156c889e1b5a76424047f6
Author: Abraham Fine <af...@apache.org>
Date:   2017-04-27T20:05:00Z

    add some comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/229
  
    @afine Can you please rebase your working branch with apache branch-3.4 and then open a new PR (just one line fix) for your recent fix? The new PR can reuse same JIRA issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeepe...

Posted by afine <gi...@git.apache.org>.
Github user afine closed the pull request at:

    https://github.com/apache/zookeeper/pull/229


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper pull request #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeepe...

Posted by afine <gi...@git.apache.org>.
Github user afine closed the pull request at:

    https://github.com/apache/zookeeper/pull/229


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #229: ZOOKEEPER-2759: Flaky test: org.apache.zookeeper.serve...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/229
  
    Thanks for work on this! I hope we have more contributors focus on improving the test.
    
    The proposed fix masks the real problem - it will always success no matter what - even if the connection was succeeded, since the halt will clean up the sender worker map of the peer 0.
    
    >>  a race condition between the removal of the relevant sid from senderWorkerMap for peer0 and the 3 second delay from assertEventuallyNotConnected.
    I think there is a race, but my observation on it is slightly different. The race is between when peer 0 received (and finished) establish connection with peer 1, and the assert inside assertEventuallyNotConnected.
    
    >> may not destroy them in time for the test assertion.
    Because of the race, once peer 0 has received the connection, its sender worker map will has sid 1 entry forever until the death of the peer 0 itself. 
    
    I think instead of checking on peer 0, we can check peer 1. Its sender worker map should always be clean. We can't check peer 0 here because the current API does not encode the true state of the connection - it only encodes what happened (the connection from 1 to 0 was established on 0's side), but it does not encode the status of the connection (is the connection still alive or not - in this case peer 1 will close the socket later when it found out peer 0 has no auth. but on peer 0 side the sender work will have sid 1 inside until its death, what's done has been done.). The state of peer 1 wrt its sender map should also reflect state of peer 0  since there is only a single connection between 0 and 1.
    
    Something like this should serve the purpose:
    assertEventuallyNotConnected(peer1, 0);
    
    Maybe add a sleep before it to make sure all the connection attempts were made (from 0  to 1 and from 1 to 0 two ways), but it's a best effort. 5 sec sounds enough.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---