You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by DanBenediktson <gi...@git.apache.org> on 2017/08/09 17:05:07 UTC

[GitHub] zookeeper pull request #330: ZOOKEEPER-2471: ZK Java client should not count...

GitHub user DanBenediktson opened a pull request:

    https://github.com/apache/zookeeper/pull/330

    ZOOKEEPER-2471: ZK Java client should not count sleep time as connect time

    ClientCnxnSocket uses a member variable "now" to track the current time, but does not update it at all potentially-blocking times: in particular, it does not update it after the random sleep introduced if an initial connect attempt fails. This results in the random sleep time being counted towards connect time, resulting in incorrect application of connection timeout currently, and if ZOOKEEPER-2869 is taken, a very real possibility (we have seen it in production) of wedging the Zookeeper client so that it can never successfully reconnect, because its sleep time may grow beyond its connection timeout, especially in scenarios where there is a big gap between negotiated session timeout and client-requested session timeout.
    
    Rather than fixing the bug by adding another "updateNow()" call, keeping the brittle "updateNow()" implementation which led to the bug in the first place, I have deleted updateNow() and replaced usage of that member variable with actually getting the current system timestamp whenever the implementation needs to know the current time.
    
    Regarding unit testing, this is, IMO, too difficult to test without introducing a lot of invasive changes to ClientCnxn.java, seeing as the only effective change is that, on connection retry, the random sleep time is no longer counted towards a time budget. I can throw a lot of mocks at this, like ClientReconnectTest, but I'm still going to be stuck depending on the behavior of that randomly-generated sleep time, which is going to be inherently unreliable. If a fix is taken for ZOOKEEPER-2869, this should become much easier to test, since I will then be able to inject a different backoff sleep behavior, and since I'm planning to submit a pull request for that ticket as well, so maybe as a compromise I can submit a test for this bug fix at that time?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DanBenediktson/zookeeper ZOOKEEPER-2471

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #330
    
----
commit 60f38726e7f07b4bb970cc8fb089363ff48eb3df
Author: Dan Benediktson <db...@twitter.com>
Date:   2017-08-09T16:41:42Z

    ZOOKEEPER-2471: Zookeeper Java client should not count time spent sleeping as time spent connecting
    
    Rather than keep the brittle "updateNow()" implementation which led to the bug and fixing the bug by
    adding another "updateNow()" call, I have deleted updateNow() and replaced usage of that member variable
    with actually getting the current system timestamp.
    
    This is, IMO, too difficult to test without introducing a lot of invasive changes to ClientCnxn.java,
    seeing as the only effective change is that, on connection retry, a random sleep time is no longer
    counted towards a time budget. If a fix is taken for ZOOKEEPER-2869, this should become much easier to
    test, and since I'm planning to submit a pull request for that ticket as well, maybe as a compromise
    I can submit a test for this patch at that time?

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #330: ZOOKEEPER-2471: ZK Java client should not count sleep ...

Posted by nicktrav <gi...@git.apache.org>.
Github user nicktrav commented on the issue:

    https://github.com/apache/zookeeper/pull/330
  
    @DanBenediktson - I've been looking into writing a test for this patch, but I can't seem to replicate the case you speak about on the original ticket.
    
    Specifically:
    
    > The exact code path it goes through in this case is complicated, because there has to be a previously-closed socket still waiting in the selector (otherwise, the first timeout evaluation will not fail because "now" still hasn't been updated, and then the actual connect timeout will be applied in ClientCnxnSocket.doTransport()) so that select() will harvest the IO from the previous socket and updateNow(), resulting in the next loop through ClientCnxnSocket.SendThread.run() observing the spurious timeout and failing.
    
    Are you able to provide some more details on how this client can get into this state? Walking through the code, I'm having difficulty understanding how the client can end up a reconnect loop.
    
    We are keen to see this patch land as it would make a fix for ZOOKEEPER-2869 inherently safer.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #330: ZOOKEEPER-2471: ZK Java client should not count sleep ...

Posted by nicktrav <gi...@git.apache.org>.
Github user nicktrav commented on the issue:

    https://github.com/apache/zookeeper/pull/330
  
    Feels a little awkward to have a non-trivial change without any tests :/
    
    > I'm still going to be stuck depending on the behavior of that randomly-generated sleep time, which is going to be inherently unreliable
    
    Thinking aloud here ... given your second commit updates to use the `Time` utility, could you now inject that class, (or perhaps a faked implementation given the methods are static), into `ClientCnxnSocket` and its subclasses? This would make it more amenable to testing. You'd have more control over the times that are returned.
    
    You'd still need to plumb that through into `ClientCnxn`, but it looks like the constructor takes a `ClientCnxnSocket`, which could work here?
    
    Maybe we can get that scaffolding wired up in another PR, get that merged in, and this change could leverage that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #330: ZOOKEEPER-2471: ZK Java client should not count sleep ...

Posted by hanm <gi...@git.apache.org>.
Github user hanm commented on the issue:

    https://github.com/apache/zookeeper/pull/330
  
    The `now` variable was there when this project starts, and `updateNow` was introduced in ZOOKEEPER-909 which was just a refactoring so it's irrelevant to the core logic. I had a chat with @phunt who wrote the original code and the `now` was not introduced by accident - according to Pat it's used to solve two problems:
    * A performance optimization by caching as on some platforms getting current time is not cheap.
    * On some platforms get current time will go backwards so to prevent that a cached time was used at the very start.
    Now after 10 years and many changes (like using monotonic clock) the original design might become irrelevant, and it might make sense to instead calculate the now time when needed. Though, I'd like to have another look into this issue before making a conclusion.
    
    One idea is to have a test case that shows the bug (client infinity connect loop etc), and then verify that the bug gets fixed with the patch. This will be more convincing. I understand writing such test takes more effort than the patch itself, but eventually it's something must to have for such a change.
    
    On a side note - this patch is doing fine with my stress test and does not cause any regressions.
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] zookeeper issue #330: ZOOKEEPER-2471: ZK Java client should not count sleep ...

Posted by nicktrav <gi...@git.apache.org>.
Github user nicktrav commented on the issue:

    https://github.com/apache/zookeeper/pull/330
  
    @DanBenediktson - bumping this. Any thoughts?


---