You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by bothejjms <gi...@git.apache.org> on 2018/07/06 14:52:18 UTC

[GitHub] zookeeper pull request #563: Fix for ZOOKEEPER-3072

GitHub user bothejjms opened a pull request:

    https://github.com/apache/zookeeper/pull/563

    Fix for ZOOKEEPER-3072

    Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle
    
    Added an additional async hammer thread which is pretty reliably reproduces the race condition. The globalOutstandingLimit is decreased so throttling code is executed.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bothejjms/zookeeper ZOOKEEPER-3072

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/563.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #563
    
----
commit 0f19de8f8b95992d4e95e3ffa46208c1b12b8cb0
Author: Botond Hejj <bo...@...>
Date:   2018-07-06T14:43:18Z

    Fix for ZOOKEEPER-3072
    Making the throttle check before passing over the request to the next thread will prevent the possibility of throttling code running after unthrottle

----


---

[GitHub] zookeeper pull request #563: Fix for ZOOKEEPER-3072

Posted by nkalmar <gi...@git.apache.org>.
Github user nkalmar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/563#discussion_r201258688
  
    --- Diff: src/java/test/org/apache/zookeeper/test/ThrottleRaceTest.java ---
    @@ -0,0 +1,156 @@
    +/**
    + * Licensed to the Apache Software Foundation (ASF) under one
    + * or more contributor license agreements.  See the NOTICE file
    + * distributed with this work for additional information
    + * regarding copyright ownership.  The ASF licenses this file
    + * to you under the Apache License, Version 2.0 (the
    + * "License"); you may not use this file except in compliance
    + * with the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.zookeeper.test;
    +
    +import static org.apache.zookeeper.test.ClientBase.CONNECTION_TIMEOUT;
    +import static org.apache.zookeeper.test.ClientBase.verifyThreadTerminated;
    +
    +import org.apache.zookeeper.AsyncCallback.StatCallback;
    +import org.apache.zookeeper.KeeperException;
    +import org.apache.zookeeper.ZKTestCase;
    +import org.apache.zookeeper.TestableZooKeeper;
    +import org.apache.zookeeper.WatchedEvent;
    +import org.apache.zookeeper.data.Stat;
    +import org.apache.zookeeper.test.ClientBase.CountdownWatcher;
    +import org.junit.After;
    +import org.junit.Assert;
    +import org.junit.Before;
    +import org.junit.Test;
    +import org.slf4j.Logger;
    +import org.slf4j.LoggerFactory;
    +
    +public class ThrottleRaceTest extends ZKTestCase {
    +    private static final Logger LOG = LoggerFactory.getLogger(ThrottleRaceTest.class);
    +
    +    private QuorumBase qb = new QuorumBase();
    +
    +    private volatile boolean bang;
    +
    +    public void setUp() throws Exception {
    +        qb.setUp();
    +    }
    +
    +    public void tearDown() throws Exception {
    +        LOG.info("Test clients shutting down");
    +        qb.tearDown();
    +    }
    +
    +    /**
    +     * Send exists /zookeeper requests asynchronously, max 30 outstanding
    +     */
    +    class HammerThreadExists extends Thread implements StatCallback {
    +        private static final int MAX_OUTSTANDING = 30;
    +
    +        private TestableZooKeeper zk;
    +        private int outstanding;
    +
    +        private volatile boolean failed = false;
    +
    +        public HammerThreadExists(String name) {
    +            super(name);
    +        }
    +
    +        public void run() {
    +            try {
    +                CountdownWatcher watcher = new CountdownWatcher();
    +                zk = new TestableZooKeeper(qb.hostPort, CONNECTION_TIMEOUT,
    +                        watcher);
    +                watcher.waitForConnected(CONNECTION_TIMEOUT);
    +                while(bang) {
    +                    incOutstanding(); // before create otw race
    +                    zk.exists("/zookeeper", false, this, null);
    +                }
    +            } catch (InterruptedException e) {
    +                if (bang) {
    +                    LOG.error("sanity check Assert.failed!!!"); // sanity check
    +                    return;
    +                }
    +            } catch (Exception e) {
    +                LOG.error("Client create operation Assert.failed", e);
    +                return;
    +            } finally {
    +                if (zk != null) {
    +                    try {
    +                        if (!zk.close(CONNECTION_TIMEOUT)) {
    +                            failed = true;
    +                            LOG.error("Client did not shutdown");
    +                        }
    +                    } catch (InterruptedException e) {
    +                        LOG.info("Interrupted", e);
    +                    }
    +                }
    +            }
    +        }
    +
    +        private synchronized void incOutstanding() throws InterruptedException {
    +            outstanding++;
    +            while(outstanding > MAX_OUTSTANDING) {
    +                wait();
    +            }
    +        }
    +
    +        private synchronized void decOutstanding() {
    +            outstanding--;
    +            Assert.assertTrue("outstanding >= 0", outstanding >= 0);
    +            notifyAll();
    +        }
    +
    +        public void process(WatchedEvent event) {
    +            // ignore for purposes of this test
    +        }
    +
    +        public void processResult(int rc, String path, Object ctx, Stat stat) {
    +            if (rc != KeeperException.Code.OK.intValue()) {
    +                if (bang) {
    +                    failed = true;
    +                    LOG.error("Exists Assert.failed for 0x"
    +                            + Long.toHexString(zk.getSessionId())
    +                            + "with rc:" + rc + " path:" + path);
    +                }
    +                decOutstanding();
    +                return;
    +            }
    +            decOutstanding();
    +        }
    +    }
    +
    +    @Test
    +    public void testExistsHammer() throws Exception {
    +        System.setProperty("zookeeper.globalOutstandingLimit", "1");
    +        setUp();
    +        bang = true;
    +        LOG.info("Starting hammers");
    +        HammerThreadExists[] hammers = new HammerThreadExists[100];
    +        for (int i = 0; i < hammers.length; i++) {
    +            hammers[i] = new HammerThreadExists("HammerThread-" + i);
    +            hammers[i].start();
    +        }
    +        LOG.info("Started hammers");
    +        Thread.sleep(30000); // allow the clients to run for max 5sec
    --- End diff --
    
    nit: This is 3 seconds not five as in the comment


---

[GitHub] zookeeper issue #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by breed <gi...@git.apache.org>.
Github user breed commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    thank you @bothejjms !


---

[GitHub] zookeeper issue #563: Fix for ZOOKEEPER-3072

Posted by bothejjms <gi...@git.apache.org>.
Github user bothejjms commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    On "pretty reliably" I mean the test has failed for me like 90% of the time with the original code but the result can differ on different machines since it is a race condition. 
    Reproducing race condition in a test is not simple. I am open to suggestions how to do it reliably. Do you recall any other tests for race conditions in the test suite? 
    
    After the fix the test passed on my machine always. I am not sure yet why it fails on jenkins.
    
    For me the test takes 40 sec on my VM which is not particularly strong. I am also not satisfied with this test. I just wanted to prove that the race condition is there. Instead of the test I could add a description on how to reproduce and skip permanent testing for it.



---

[GitHub] zookeeper issue #563: Fix for ZOOKEEPER-3072

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    @bothejjms What do you mean by "pretty reliably" exactly? 
    I see that the test has failed on Jenkins and also I've tested it and it basically killed my machine for 45 mins which seems to me a little bit overkill. 
    - Could we run the 100 threads in a ThreadPoolExecutor for example and see how it behaves? 
    - Do we need that many threads while setting `globalOutstandingLimit` to 1?
    - Also, I don't want to stick with covering this issue with test, if we could only do it pretty reliably and by introducing a new flaky test. :(


---

[GitHub] zookeeper issue #563: Fix for ZOOKEEPER-3072

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    @bothejjms Sorry, there's a typo in my previous comment: it was 45 seconds on my machine and it literally killed the entire machine which I think isn't acceptable on Jenkins slaves.
    
    I'd give it a try with ThreadPoolExecutor at the first place and dig into why it's not 100% reliable currently. If there's no success with a few days work, just skip adding test here.


---

[GitHub] zookeeper pull request #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by bothejjms <gi...@git.apache.org>.
Github user bothejjms commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/563#discussion_r205778339
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java ---
    @@ -1128,9 +1128,9 @@ public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOE
                     Record rsp = processSasl(incomingBuffer,cnxn);
                     ReplyHeader rh = new ReplyHeader(h.getXid(), 0, KeeperException.Code.OK.intValue());
                     cnxn.sendResponse(rh,rsp, "response"); // not sure about 3rd arg..what is it?
    -                return;
    --- End diff --
    
    I have refactored like that.
    Returns are actually unnecessary but I have consistently added them now.


---

[GitHub] zookeeper pull request #563: Fix for ZOOKEEPER-3072

Posted by bothejjms <gi...@git.apache.org>.
Github user bothejjms commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/563#discussion_r200961680
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java ---
    @@ -1124,6 +1124,7 @@ public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOE
                 }
                 return;
             } else {
    +            cnxn.incrOutstandingRequests(h);
    --- End diff --
    
    hmm right. That return was not there in 3.5.3 where I have spotted the issue. I have missed it when I have moved my change to master.
    I see ZOOKEEPER-2785 introduced it. I will update my pr and move incr to the else branch to avoid sasl throttling.


---

[GitHub] zookeeper pull request #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/zookeeper/pull/563


---

[GitHub] zookeeper pull request #563: Fix for ZOOKEEPER-3072

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/563#discussion_r200956044
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java ---
    @@ -1124,6 +1124,7 @@ public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOE
                 }
                 return;
             } else {
    +            cnxn.incrOutstandingRequests(h);
    --- End diff --
    
    I have 2 observations here which probably don't make a big difference but might worse to consider.
    - First, the return statements in the if branches are not required anymore, because there's no more statement at the end of the method anymore,
    - Second, moving `cnxn.incrOutstandingRequests(h)` here means that from now on you'll trigger throttling for `sasl` requests too, which was not the case previously. Same for `auth` packets which I believe was done intentionally.


---

[GitHub] zookeeper issue #563: Fix for ZOOKEEPER-3072

Posted by bothejjms <gi...@git.apache.org>.
Github user bothejjms commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    I have tweaked the test to use significantly less threads and be faster. Unfortunately it still fails on jenkins. :(
    
    I am not sure how ThreadPoolExecutor would help with this. It will spin up the same amount of threads in background, isn't it?


---

[GitHub] zookeeper issue #563: Fix for ZOOKEEPER-3072

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    @bothejjms It would spin up only a limited amount of threads, but that wouldn't help either as you said. You literally want 100 clients simultaneously sending request until test stops. `AsyncHammerTest` does pretty much the same, looks like you copied most of the logic implemented in there.
    
    I don't how to do this properly. I suspect `AsyncHammerTest` is also flaky which is another reason why not create another similar test.


---

[GitHub] zookeeper issue #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by bothejjms <gi...@git.apache.org>.
Github user bothejjms commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    I have removed the test for now as I don't have a good way to test this race condition. I can be reproduced easily by starting a server where the globalOutstandingLimit is 1 and sending a lot exists requests. There is a good chance that one session will stuck in a throttled state despite it has no active requests.


---

[GitHub] zookeeper issue #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by anmolnar <gi...@git.apache.org>.
Github user anmolnar commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    Thanks @bothejjms . I think the patch can be accepted now without the test.
    We need at least one more committer to approve. @hanm @phunt  ?


---

[GitHub] zookeeper issue #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by bothejjms <gi...@git.apache.org>.
Github user bothejjms commented on the issue:

    https://github.com/apache/zookeeper/pull/563
  
    I have refactored the branches as suggested.


---

[GitHub] zookeeper pull request #563: ZOOKEEPER-3072: Throttle race condition fix

Posted by breed <gi...@git.apache.org>.
Github user breed commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/563#discussion_r205668951
  
    --- Diff: src/java/main/org/apache/zookeeper/server/ZooKeeperServer.java ---
    @@ -1128,9 +1128,9 @@ public void processPacket(ServerCnxn cnxn, ByteBuffer incomingBuffer) throws IOE
                     Record rsp = processSasl(incomingBuffer,cnxn);
                     ReplyHeader rh = new ReplyHeader(h.getXid(), 0, KeeperException.Code.OK.intValue());
                     cnxn.sendResponse(rh,rsp, "response"); // not sure about 3rd arg..what is it?
    -                return;
    --- End diff --
    
    it would be nice to keep this return since it matches the handling of the other auth logic above.
    
    it would also be nice if this was an
    
    `} else if (h.getType() == OpCode.sasl) {`
    
    clause and the
    `}
        else {`
    
    was done outside of the if since all the other blocks will have returned. i think it makes the logic easier to follow.


---