You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by afine <gi...@git.apache.org> on 2017/05/12 23:30:59 UTC

[GitHub] zookeeper pull request #251: ZOOKEEPER-2781 Flaky test: testClientAuthAgains...

GitHub user afine opened a pull request:

    https://github.com/apache/zookeeper/pull/251

    ZOOKEEPER-2781 Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

    The flaky test appears to be caused by a race condition in QuorumCnxManager that could potentially prevent two servers from connecting to each other. I was able to reproduce the issue with a debugger and a little bit of patience. It would be great if someone can share a less contrived way to reproduce the same issue. Here is the basic order of execution required to reproduce the issue between two peers (using lines of code from before this patch).  Point of clarification, reaching a line means hitting but not yet executing that line (equivalent to setting a breakpoint on that line).
    
    1. peer1 enters `startConnection` and reaches QuorumCnxManager.java:365
    2. peer0's Listener enters `handleConnection` reaches  QuorumCnxManager.java:506
    2. peer0 enters `startConnection` and reaches QuorumCnxManager.java:353
    3. peer1's Listener enters `handleConnection` and reaches QuorumCnxManager.java:483
    3. peer1 executes QuorumCnxManager.java:365 and reaches QuorumCnxManager.java:374
    4. peer0's Listener executes QuorumCnxManager.java:506 and starts a RecvWorker which stops at QuorumCnxManager.java:1027. The Listener reaches QuorumCnxManager.java:516.
    5. peer1's Listener continues executing from QuorumCnxManager.java:483, which removes the SendWorker and RecvWorker for its connection to peer0, and reaches QuorumCnxManager.java:493
    6. peer0's RecvWorker executes  QuorumCnxManager.java:1027, the socket had since been closed on peer1 and we throw an exception
    ```
        [junit] 2017-05-07 14:48:11,055 [myid:] - WARN  [RecvWorker:1:QuorumCnxManager$RecvWorker@1042] - Connection broken for id 1, my id = 0, error =
        [junit] java.io.EOFException
        [junit]     at java.io.DataInputStream.readInt(DataInputStream.java:392)
        [junit]     at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1027)
    ```
    9. peer0 executes  QuorumCnxManager.java:353 reaching  QuorumCnxManager.java:377
    10. peer1 executes connectOne at QuorumCnxManager.java:493 and continues until it reaches QuorumCnxManager.java:290 which completes the thread
    11. peer0's listener continues executing from QuorumCnxManager.java:516. Calls `handleConnection ` again but returns at QuorumCnxManager.java:469 since peer1 never wrote its sid to the socket.
    12. peer0 continues executing from QuorumCnxManager.java:377
    13. peer1 continues executing from QuorumCnxManager.java:374
    
    
    Both threads finish and no quorum has been formed. `testClientAuthAgainstNoAuthServerWithLowerSid` times out
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2781

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/251.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #251
    
----
commit 5f231641a9496aaf84a0b58d87f5fc365fa9b7e6
Author: Abraham Fine <af...@apache.org>
Date:   2017-05-12T20:23:55Z

    ZOOKEEPER-2781: Flaky test: testClientAuthAgainstNoAuthServerWithLowerSid

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---