You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by afine <gi...@git.apache.org> on 2017/06/30 22:55:12 UTC

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

GitHub user afine opened a pull request:

    https://github.com/apache/zookeeper/pull/300

    ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test.WatchEventWhenAutoResetTest.testNodeDataChanged

    This failure appears to be related to ZOOKEEPER-2024. When a follower is syncing with the leader and logging "the stuff that came in between the snapshot and the uptodate" we do not wait for those commits to hit the db before starting the follower.
    
    In 3.6, we now have two separate queues in the CommitProcessor, it is possible that the follower can serve requests before all of the commits from the leader have been applied. In the case of this test failure we can set a watch in between a create and delete call. This patch makes sure that the "delete" call will be committed before attempting to handle a client's watcher request.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/afine/zookeeper ZOOKEEPER-2807

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/300.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #300
    
----
commit ca1774dca59d7a1c9e1d9e3c8391d21cf8c2c733
Author: Abraham Fine <af...@apache.org>
Date:   2017-06-30T22:44:26Z

    ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test.WatchEventWhenAutoResetTest.testNodeDataChanged

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @afine do you think this is a good one to commit?


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2431/



---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @afine @phunt shall we commit this?


---

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/300#discussion_r154363054
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -327,6 +257,95 @@ public void run() {
             LOG.info("CommitProcessor exited loop!");
         }
     
    +    private void processCommittedRequest() throws IOException, InterruptedException {
    +        // In case of a spurious wakeup in waitForCommittedRequests we should not
    +        // remove the request from the queue until it has been processed
    +        Request request = committedRequests.peek();
    +
    +        if (request == null) {
    +            committedRequests.poll();
    --- End diff --
    
    If request==null then the committedRequests queue is already empty. Why do need to poll() here?
    Could it be a small mistake and originally you wanted to poll() after the peek() was successful...?


---

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

Posted by afine <gi...@git.apache.org>.

Github user afine commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/300#discussion_r154786233
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -327,6 +257,95 @@ public void run() {
             LOG.info("CommitProcessor exited loop!");
         }
     
    +    private void processCommittedRequest() throws IOException, InterruptedException {
    +        // In case of a spurious wakeup in waitForCommittedRequests we should not
    +        // remove the request from the queue until it has been processed
    +        Request request = committedRequests.peek();
    +
    +        if (request == null) {
    +            committedRequests.poll();
    --- End diff --
    
    we don't


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @afine If I'm not mistaken there's a thread running which receives messages from the Leader in QuorumPeer.java:997:
    Call stack is:
    QuorumPeer.run() -> Follower.followLeader() -> Follower.processPacket() -> FollowerZookeeperServer.commit()
    This code path could potentially add more commit messages to the queue which are not related to the sync.
    
    Even if I'm wrong with the above, @shralex 's concern is still valid: "Even if you drain the committedRequests, I'm not sure that guarantees that there are no more that will arrive."
    It's possible to drain the commit queue faster than leader sends messages.
    
    In other words: draining the commit queue doesn't seem to guarantee that the sync is complete.


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @phunt @rakeshadr @hanm This test is quite frequently flaking, I think it'd be useful to commit the patch if we're done with the review.


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @afine You're right. I missed that syncWithLeader() call is on the same path and in the same thread as adding commits to the queue.
    
    In which case this must be right:
    - syncWithLeader() blocks the follower, 
    - Leader sends commits for the sync process, 
    - Leader sends UPTODATE at the very end, 
    - Follower drains the commit queue
    - Follower starts following.
    



---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    +1 looks good to me.
    @afine Would you please trigger another build to get a green one?


---

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/300#discussion_r155103949
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -240,84 +240,14 @@ public void run() {
                         }
     
                         // Process committed head
    -                    if ((request = committedRequests.poll()) == null) {
    -                        throw new IOException("Error: committed head is null");
    -                    }
    -
    -                    /*
    -                     * Check if request is pending, if so, update it with the committed info
    -                     */
    -                    LinkedList<Request> sessionQueue = pendingRequests
    -                            .get(request.sessionId);
    -                    if (sessionQueue != null) {
    -                        // If session queue != null, then it is also not empty.
    -                        Request topPending = sessionQueue.poll();
    -                        if (request.cxid != topPending.cxid) {
    -                            /*
    -                             * TL;DR - we should not encounter this scenario often under normal load.
    -                             * We pass the commit to the next processor and put the pending back with a warning.
    -                             *
    -                             * Generally, we can get commit requests that are not at the queue head after
    -                             * a session moved (see ZOOKEEPER-2684). Let's denote the previous server of the session
    -                             * with A, and the server that the session moved to with B (keep in mind that it is
    -                             * possible that the session already moved from B to a new server C, and maybe C=A).
    -                             * 1. If request.cxid < topPending.cxid : this means that the session requested this update
    -                             * from A, then moved to B (i.e., which is us), and now B receives the commit
    -                             * for the update after the session already performed several operations in B
    -                             * (and therefore its cxid is higher than that old request).
    -                             * 2. If request.cxid > topPending.cxid : this means that the session requested an updated
    -                             * from B with cxid that is bigger than the one we know therefore in this case we
    -                             * are A, and we lost the connection to the session. Given that we are waiting for a commit
    -                             * for that update, it means that we already sent the request to the leader and it will
    -                             * be committed at some point (in this case the order of cxid won't follow zxid, since zxid
    -                             * is an increasing order). It is not safe for us to delete the session's queue at this
    -                             * point, since it is possible that the session has newer requests in it after it moved
    -                             * back to us. We just leave the queue as it is, and once the commit arrives (for the old
    -                             * request), the finalRequestProcessor will see a closed cnxn handle, and just won't send a
    -                             * response.
    -                             * Also note that we don't have a local session, therefore we treat the request
    -                             * like any other commit for a remote request, i.e., we perform the update without sending
    -                             * a response.
    -                             */
    -                            LOG.warn("Got request " + request +
    -                                    " but we are expecting request " + topPending);
    -                            sessionQueue.addFirst(topPending);
    -                        } else {
    -                            /*
    -                             * Generally, we want to send to the next processor our version of the request,
    -                             * since it contains the session information that is needed for post update processing.
    -                             * In more details, when a request is in the local queue, there is (or could be) a client
    -                             * attached to this server waiting for a response, and there is other bookkeeping of
    -                             * requests that are outstanding and have originated from this server
    -                             * (e.g., for setting the max outstanding requests) - we need to update this info when an
    -                             * outstanding request completes. Note that in the other case (above), the operation
    -                             * originated from a different server and there is no local bookkeeping or a local client
    -                             * session that needs to be notified.
    -                             */
    -                            topPending.setHdr(request.getHdr());
    -                            topPending.setTxn(request.getTxn());
    -                            topPending.zxid = request.zxid;
    -                            request = topPending;
    -                        }
    -                    }
    -
    -                    sendToNextProcessor(request);
    -
    -                    waitForEmptyPool();
    -
    -                    /*
    -                     * Process following reads if any, remove session queue if
    -                     * empty.
    -                     */
    -                    if (sessionQueue != null) {
    -                        while (!stopped && !sessionQueue.isEmpty()
    -                                && !needCommit(sessionQueue.peek())) {
    -                            sendToNextProcessor(sessionQueue.poll());
    -                        }
    -                        // Remove empty queues
    -                        if (sessionQueue.isEmpty()) {
    -                            pendingRequests.remove(request.sessionId);
    +                    // We only need to perform synchronization if we are on the last request in the queue
    +                    if (committedRequests.size() == 1) {
    +                        synchronized (committedRequests) {
    --- End diff --
    
    Ok



---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @lavacat Issue happened here too in the latest build.


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @lavacat @lvfangmin
    This is a potential flaky test fix you might want to take a quick look.


---

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/300#discussion_r154851878
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -240,84 +240,14 @@ public void run() {
                         }
     
                         // Process committed head
    -                    if ((request = committedRequests.poll()) == null) {
    -                        throw new IOException("Error: committed head is null");
    -                    }
    -
    -                    /*
    -                     * Check if request is pending, if so, update it with the committed info
    -                     */
    -                    LinkedList<Request> sessionQueue = pendingRequests
    -                            .get(request.sessionId);
    -                    if (sessionQueue != null) {
    -                        // If session queue != null, then it is also not empty.
    -                        Request topPending = sessionQueue.poll();
    -                        if (request.cxid != topPending.cxid) {
    -                            /*
    -                             * TL;DR - we should not encounter this scenario often under normal load.
    -                             * We pass the commit to the next processor and put the pending back with a warning.
    -                             *
    -                             * Generally, we can get commit requests that are not at the queue head after
    -                             * a session moved (see ZOOKEEPER-2684). Let's denote the previous server of the session
    -                             * with A, and the server that the session moved to with B (keep in mind that it is
    -                             * possible that the session already moved from B to a new server C, and maybe C=A).
    -                             * 1. If request.cxid < topPending.cxid : this means that the session requested this update
    -                             * from A, then moved to B (i.e., which is us), and now B receives the commit
    -                             * for the update after the session already performed several operations in B
    -                             * (and therefore its cxid is higher than that old request).
    -                             * 2. If request.cxid > topPending.cxid : this means that the session requested an updated
    -                             * from B with cxid that is bigger than the one we know therefore in this case we
    -                             * are A, and we lost the connection to the session. Given that we are waiting for a commit
    -                             * for that update, it means that we already sent the request to the leader and it will
    -                             * be committed at some point (in this case the order of cxid won't follow zxid, since zxid
    -                             * is an increasing order). It is not safe for us to delete the session's queue at this
    -                             * point, since it is possible that the session has newer requests in it after it moved
    -                             * back to us. We just leave the queue as it is, and once the commit arrives (for the old
    -                             * request), the finalRequestProcessor will see a closed cnxn handle, and just won't send a
    -                             * response.
    -                             * Also note that we don't have a local session, therefore we treat the request
    -                             * like any other commit for a remote request, i.e., we perform the update without sending
    -                             * a response.
    -                             */
    -                            LOG.warn("Got request " + request +
    -                                    " but we are expecting request " + topPending);
    -                            sessionQueue.addFirst(topPending);
    -                        } else {
    -                            /*
    -                             * Generally, we want to send to the next processor our version of the request,
    -                             * since it contains the session information that is needed for post update processing.
    -                             * In more details, when a request is in the local queue, there is (or could be) a client
    -                             * attached to this server waiting for a response, and there is other bookkeeping of
    -                             * requests that are outstanding and have originated from this server
    -                             * (e.g., for setting the max outstanding requests) - we need to update this info when an
    -                             * outstanding request completes. Note that in the other case (above), the operation
    -                             * originated from a different server and there is no local bookkeeping or a local client
    -                             * session that needs to be notified.
    -                             */
    -                            topPending.setHdr(request.getHdr());
    -                            topPending.setTxn(request.getTxn());
    -                            topPending.zxid = request.zxid;
    -                            request = topPending;
    -                        }
    -                    }
    -
    -                    sendToNextProcessor(request);
    -
    -                    waitForEmptyPool();
    -
    -                    /*
    -                     * Process following reads if any, remove session queue if
    -                     * empty.
    -                     */
    -                    if (sessionQueue != null) {
    -                        while (!stopped && !sessionQueue.isEmpty()
    -                                && !needCommit(sessionQueue.peek())) {
    -                            sendToNextProcessor(sessionQueue.poll());
    -                        }
    -                        // Remove empty queues
    -                        if (sessionQueue.isEmpty()) {
    -                            pendingRequests.remove(request.sessionId);
    +                    // We only need to perform synchronization if we are on the last request in the queue
    +                    if (committedRequests.size() == 1) {
    +                        synchronized (committedRequests) {
    --- End diff --
    
    Does synchronized - notifyAll() have any performance hit in normal operation of the follower?


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

https://github.com/apache/zookeeper/pull/300

@afine Generally speaking, I like the idea of using LinkedBlockingQueue's intrinsic lock to wait for becoming empty, but in this particular case I think it's possible that committedRequests will never be empty if the leader is constantly sending commit requests.

Correct me if I'm wrong please (there's a very good chance that I completely misunderstand something), but my feeling is that the following situation is possible:
1. Learner starts syncing with leader in syncWithLeader() method,
2. Learner blocks and wait for all commits to be processed before finishing the sync,
3. FollowerZookeeperServer is already running and keep receiving commits from the Leader including non-syncing ones,
4. Learner will never be notified or only at some point in the future much more later then sync complete or way before that.

To address this, if we could get the number of commits that we must wait before proceeding, we would be able to implement a CountDownLatch in CommitProcessor and wait for the number of commits which are expected in the sync process. However that does not guarantee that we received all sync-related commits either.

Otherwise I could also agree with @shralex in the Jira saying: "Intuitively this may not be the right place for such a fix - this probably should be higher level - **making sure that follower does not even accept local ops before properly completing the sync.** Even if you drain the committedRequests, I'm not sure that guarantees that there are no more that will arrive."

That would be the best solution here in my opinion.

---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by afine <gi...@git.apache.org>.

Github user afine commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    Thanks for taking a look @anmolnar!
    
    I'm confused by what you mean in item 3. Unless I am missing something, I don't understand how `FollowerZooKeeperServer` is "already running and keep receiving commits from the Leader including non-syncing ones" before `syncWithLeader` is completed.



---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    Rebased on master.


---

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/300#discussion_r154367742
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -240,84 +240,14 @@ public void run() {
                         }
     
                         // Process committed head
    -                    if ((request = committedRequests.poll()) == null) {
    -                        throw new IOException("Error: committed head is null");
    -                    }
    -
    -                    /*
    -                     * Check if request is pending, if so, update it with the committed info
    -                     */
    -                    LinkedList<Request> sessionQueue = pendingRequests
    -                            .get(request.sessionId);
    -                    if (sessionQueue != null) {
    -                        // If session queue != null, then it is also not empty.
    -                        Request topPending = sessionQueue.poll();
    -                        if (request.cxid != topPending.cxid) {
    -                            /*
    -                             * TL;DR - we should not encounter this scenario often under normal load.
    -                             * We pass the commit to the next processor and put the pending back with a warning.
    -                             *
    -                             * Generally, we can get commit requests that are not at the queue head after
    -                             * a session moved (see ZOOKEEPER-2684). Let's denote the previous server of the session
    -                             * with A, and the server that the session moved to with B (keep in mind that it is
    -                             * possible that the session already moved from B to a new server C, and maybe C=A).
    -                             * 1. If request.cxid < topPending.cxid : this means that the session requested this update
    -                             * from A, then moved to B (i.e., which is us), and now B receives the commit
    -                             * for the update after the session already performed several operations in B
    -                             * (and therefore its cxid is higher than that old request).
    -                             * 2. If request.cxid > topPending.cxid : this means that the session requested an updated
    -                             * from B with cxid that is bigger than the one we know therefore in this case we
    -                             * are A, and we lost the connection to the session. Given that we are waiting for a commit
    -                             * for that update, it means that we already sent the request to the leader and it will
    -                             * be committed at some point (in this case the order of cxid won't follow zxid, since zxid
    -                             * is an increasing order). It is not safe for us to delete the session's queue at this
    -                             * point, since it is possible that the session has newer requests in it after it moved
    -                             * back to us. We just leave the queue as it is, and once the commit arrives (for the old
    -                             * request), the finalRequestProcessor will see a closed cnxn handle, and just won't send a
    -                             * response.
    -                             * Also note that we don't have a local session, therefore we treat the request
    -                             * like any other commit for a remote request, i.e., we perform the update without sending
    -                             * a response.
    -                             */
    -                            LOG.warn("Got request " + request +
    -                                    " but we are expecting request " + topPending);
    -                            sessionQueue.addFirst(topPending);
    -                        } else {
    -                            /*
    -                             * Generally, we want to send to the next processor our version of the request,
    -                             * since it contains the session information that is needed for post update processing.
    -                             * In more details, when a request is in the local queue, there is (or could be) a client
    -                             * attached to this server waiting for a response, and there is other bookkeeping of
    -                             * requests that are outstanding and have originated from this server
    -                             * (e.g., for setting the max outstanding requests) - we need to update this info when an
    -                             * outstanding request completes. Note that in the other case (above), the operation
    -                             * originated from a different server and there is no local bookkeeping or a local client
    -                             * session that needs to be notified.
    -                             */
    -                            topPending.setHdr(request.getHdr());
    -                            topPending.setTxn(request.getTxn());
    -                            topPending.zxid = request.zxid;
    -                            request = topPending;
    -                        }
    -                    }
    -
    -                    sendToNextProcessor(request);
    -
    -                    waitForEmptyPool();
    -
    -                    /*
    -                     * Process following reads if any, remove session queue if
    -                     * empty.
    -                     */
    -                    if (sessionQueue != null) {
    -                        while (!stopped && !sessionQueue.isEmpty()
    -                                && !needCommit(sessionQueue.peek())) {
    -                            sendToNextProcessor(sessionQueue.poll());
    -                        }
    -                        // Remove empty queues
    -                        if (sessionQueue.isEmpty()) {
    -                            pendingRequests.remove(request.sessionId);
    +                    // We only need to perform synchronization if we are on the last request in the queue
    +                    if (committedRequests.size() == 1) {
    +                        synchronized (committedRequests) {
    --- End diff --
    
    Unfortunately I don't get why do you need to synchronise here. Would you please elaborate a little bit?


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by afine <gi...@git.apache.org>.

Github user afine commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @anmolnar With respect to the code path above, shouldn't that be blocked on `syncWithLeader`?
    
    > Even if you drain the committedRequests, I'm not sure that guarantees that there are no more that will arrive.
    
    I'm not sure I understand how we don't have this guarantee. My understanding is that `syncWithLeader` loops until an `UPTODATE` message is received by the follower. Incoming packets from the leader are read by:
    ```java
                    syncWithLeader(newEpochZxid);                
                    QuorumPacket qp = new QuorumPacket();
                    while (this.isRunning()) {
                        readPacket(qp);
                        processPacket(qp);
                    }
    ```
    
    In addition, my understanding is that requests are only added to `CommitProcessor`'s `committedRequests` in `processPacket`. What am I missing?


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @lavacat It's pretty much clear to me that there's a deadlock in the follower probably at the change where it's waiting for commits to be processed:
    ```
    zk.waitForCommits();
    ```
    Follower is unable to process the response for REVALIDATE from the leader, because it's blocked at this wait. Most likely.
    Trying to find the root cause in #677 , but cannot repro the failure with trace logging enabled. 


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by lavacat <gi...@git.apache.org>.

Github user lavacat commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    @anmolnar applied the patch to latest master and run tests 10 times with 8 threads. Original error in testNodeDataChanged is gone, but it failed 4 times with
    2018-10-23 09:37:31,566 [myid:] - INFO  [main:JUnit4ZKTestRunner$LoggedInvokeMethod@98] - TEST METHOD FAILED testNodeDataChanged
    java.util.concurrent.TimeoutException: Failed to connect to ZooKeeper server.
            at org.apache.zookeeper.test.ClientBase$CountdownWatcher.waitForConnected(ClientBase.java:151)
            at org.apache.zookeeper.test.WatchEventWhenAutoResetTest.testNodeDataChanged(WatchEventWhenAutoResetTest.java:116)
    
    I'll investigate more


---

[GitHub] zookeeper pull request #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeepe...

Posted by afine <gi...@git.apache.org>.

Github user afine commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/300#discussion_r155102382
  
    --- Diff: src/java/main/org/apache/zookeeper/server/quorum/CommitProcessor.java ---
    @@ -240,84 +240,14 @@ public void run() {
                         }
     
                         // Process committed head
    -                    if ((request = committedRequests.poll()) == null) {
    -                        throw new IOException("Error: committed head is null");
    -                    }
    -
    -                    /*
    -                     * Check if request is pending, if so, update it with the committed info
    -                     */
    -                    LinkedList<Request> sessionQueue = pendingRequests
    -                            .get(request.sessionId);
    -                    if (sessionQueue != null) {
    -                        // If session queue != null, then it is also not empty.
    -                        Request topPending = sessionQueue.poll();
    -                        if (request.cxid != topPending.cxid) {
    -                            /*
    -                             * TL;DR - we should not encounter this scenario often under normal load.
    -                             * We pass the commit to the next processor and put the pending back with a warning.
    -                             *
    -                             * Generally, we can get commit requests that are not at the queue head after
    -                             * a session moved (see ZOOKEEPER-2684). Let's denote the previous server of the session
    -                             * with A, and the server that the session moved to with B (keep in mind that it is
    -                             * possible that the session already moved from B to a new server C, and maybe C=A).
    -                             * 1. If request.cxid < topPending.cxid : this means that the session requested this update
    -                             * from A, then moved to B (i.e., which is us), and now B receives the commit
    -                             * for the update after the session already performed several operations in B
    -                             * (and therefore its cxid is higher than that old request).
    -                             * 2. If request.cxid > topPending.cxid : this means that the session requested an updated
    -                             * from B with cxid that is bigger than the one we know therefore in this case we
    -                             * are A, and we lost the connection to the session. Given that we are waiting for a commit
    -                             * for that update, it means that we already sent the request to the leader and it will
    -                             * be committed at some point (in this case the order of cxid won't follow zxid, since zxid
    -                             * is an increasing order). It is not safe for us to delete the session's queue at this
    -                             * point, since it is possible that the session has newer requests in it after it moved
    -                             * back to us. We just leave the queue as it is, and once the commit arrives (for the old
    -                             * request), the finalRequestProcessor will see a closed cnxn handle, and just won't send a
    -                             * response.
    -                             * Also note that we don't have a local session, therefore we treat the request
    -                             * like any other commit for a remote request, i.e., we perform the update without sending
    -                             * a response.
    -                             */
    -                            LOG.warn("Got request " + request +
    -                                    " but we are expecting request " + topPending);
    -                            sessionQueue.addFirst(topPending);
    -                        } else {
    -                            /*
    -                             * Generally, we want to send to the next processor our version of the request,
    -                             * since it contains the session information that is needed for post update processing.
    -                             * In more details, when a request is in the local queue, there is (or could be) a client
    -                             * attached to this server waiting for a response, and there is other bookkeeping of
    -                             * requests that are outstanding and have originated from this server
    -                             * (e.g., for setting the max outstanding requests) - we need to update this info when an
    -                             * outstanding request completes. Note that in the other case (above), the operation
    -                             * originated from a different server and there is no local bookkeeping or a local client
    -                             * session that needs to be notified.
    -                             */
    -                            topPending.setHdr(request.getHdr());
    -                            topPending.setTxn(request.getTxn());
    -                            topPending.zxid = request.zxid;
    -                            request = topPending;
    -                        }
    -                    }
    -
    -                    sendToNextProcessor(request);
    -
    -                    waitForEmptyPool();
    -
    -                    /*
    -                     * Process following reads if any, remove session queue if
    -                     * empty.
    -                     */
    -                    if (sessionQueue != null) {
    -                        while (!stopped && !sessionQueue.isEmpty()
    -                                && !needCommit(sessionQueue.peek())) {
    -                            sendToNextProcessor(sessionQueue.poll());
    -                        }
    -                        // Remove empty queues
    -                        if (sessionQueue.isEmpty()) {
    -                            pendingRequests.remove(request.sessionId);
    +                    // We only need to perform synchronization if we are on the last request in the queue
    +                    if (committedRequests.size() == 1) {
    +                        synchronized (committedRequests) {
    --- End diff --
    
    I believe I was able to mitigate any potential performance impact by only entering a synchronized block when there is only a single committed request in the queue.


---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    
    Refer to this link for build results (access rights to CI server needed): 
    https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/2490/



---

[GitHub] zookeeper issue #300: ZOOKEEPER-2807: Flaky test: org.apache.zookeeper.test....

Posted by anmolnar <gi...@git.apache.org>.

Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/300
  
    Looks like follower that the client is trying to reconnect at line 116 is stuck with revalidating the session:
    
    ```
    2018-10-24 12:20:31,215 [myid:] - INFO  [NIOWorkerThread-1:Learner@118] - Revalidating client: 0x100efaa0cd30000
    ```
    
    It doesn't get a response from the leader and client will eventually timeouts. Looks like it's a big concern regarding this patch.


---