You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/08/09 17:06:00 UTC
[jira] [Commented] (ZOOKEEPER-2471) Java Zookeeper Client incorrectly considers time spent sleeping as time spent connecting, potentially resulting in infinite reconnect loop

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16120259#comment-16120259 ] 

ASF GitHub Bot commented on ZOOKEEPER-2471:
-------------------------------------------

GitHub user DanBenediktson opened a pull request:

    https://github.com/apache/zookeeper/pull/330

    ZOOKEEPER-2471: ZK Java client should not count sleep time as connect time

    ClientCnxnSocket uses a member variable "now" to track the current time, but does not update it at all potentially-blocking times: in particular, it does not update it after the random sleep introduced if an initial connect attempt fails. This results in the random sleep time being counted towards connect time, resulting in incorrect application of connection timeout currently, and if ZOOKEEPER-2869 is taken, a very real possibility (we have seen it in production) of wedging the Zookeeper client so that it can never successfully reconnect, because its sleep time may grow beyond its connection timeout, especially in scenarios where there is a big gap between negotiated session timeout and client-requested session timeout.
    
    Rather than fixing the bug by adding another "updateNow()" call, keeping the brittle "updateNow()" implementation which led to the bug in the first place, I have deleted updateNow() and replaced usage of that member variable with actually getting the current system timestamp whenever the implementation needs to know the current time.
    
    Regarding unit testing, this is, IMO, too difficult to test without introducing a lot of invasive changes to ClientCnxn.java, seeing as the only effective change is that, on connection retry, the random sleep time is no longer counted towards a time budget. I can throw a lot of mocks at this, like ClientReconnectTest, but I'm still going to be stuck depending on the behavior of that randomly-generated sleep time, which is going to be inherently unreliable. If a fix is taken for ZOOKEEPER-2869, this should become much easier to test, since I will then be able to inject a different backoff sleep behavior, and since I'm planning to submit a pull request for that ticket as well, so maybe as a compromise I can submit a test for this bug fix at that time?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/DanBenediktson/zookeeper ZOOKEEPER-2471

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/zookeeper/pull/330.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #330
    
----
commit 60f38726e7f07b4bb970cc8fb089363ff48eb3df
Author: Dan Benediktson <db...@twitter.com>
Date:   2017-08-09T16:41:42Z

    ZOOKEEPER-2471: Zookeeper Java client should not count time spent sleeping as time spent connecting
    
    Rather than keep the brittle "updateNow()" implementation which led to the bug and fixing the bug by
    adding another "updateNow()" call, I have deleted updateNow() and replaced usage of that member variable
    with actually getting the current system timestamp.
    
    This is, IMO, too difficult to test without introducing a lot of invasive changes to ClientCnxn.java,
    seeing as the only effective change is that, on connection retry, a random sleep time is no longer
    counted towards a time budget. If a fix is taken for ZOOKEEPER-2869, this should become much easier to
    test, and since I'm planning to submit a pull request for that ticket as well, maybe as a compromise
    I can submit a test for this patch at that time?

----


> Java Zookeeper Client incorrectly considers time spent sleeping as time spent connecting, potentially resulting in infinite reconnect loop
> ------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2471
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2471
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: java client
>    Affects Versions: 3.5.1
>         Environment: all
>            Reporter: Dan Benediktson
>            Assignee: Dan Benediktson
>         Attachments: ZOOKEEPER-2471.patch
>
>
> ClientCnxnSocket uses a member variable "now" to track the current time, and lastSend / lastHeard variables to track socket liveness. Implementations, and even ClientCnxn itself, are expected to call both updateNow() to reset "now" to System.currentTimeMillis, and then call updateLastSend()/updateLastHeard() on IO completions.
> This is a fragile contract, so it's not surprising that there's a bug resulting from it: ClientCnxn.SendThread.run() calls updateLastSendAndHeard() as soon as startConnect() returns, but it does not call updateNow() first. I expect when this was written, either the expectation was that startConnect() was an asynchronous operation and that updateNow() would have been called very recently, or simply the requirement to call updateNow() was forgotten at this point. As far as I can see, this bug has been present since the "updateNow" method was first introduced in the distant past. As it turns out, since startConnect() calls HostProvider.next(), which can sleep, quite a lot of time can pass, leaving a big gap between "now" and now.
> If you are using very short session timeouts (one of our ZK ensembles has many clients using a 1-second timeout), this is potentially disastrous, because the sleep time may exceed the connection timeout itself, which can potentially result in the Java client being stuck in a perpetual reconnect loop. The exact code path it goes through in this case is complicated, because there has to be a previously-closed socket still waiting in the selector (otherwise, the first timeout evaluation will not fail because "now" still hasn't been updated, and then the actual connect timeout will be applied in ClientCnxnSocket.doTransport()) so that select() will harvest the IO from the previous socket and updateNow(), resulting in the next loop through ClientCnxnSocket.SendThread.run() observing the spurious timeout and failing. In practice it does happen to us fairly frequently; we only got to the bottom of the bug yesterday. Worse, when it does happen, the Zookeeper client object is rendered unusable: it's stuck in a perpetual reconnect loop where it keeps sleeping, opening a socket, and immediately closing it.
> I have a patch. Rather than calling updateNow() right after startConnect(), my fix is to remove the "now" member variable and the updateNow() method entirely, and to instead just call System.currentTimeMillis() whenever time needs to be evaluated. I realize there is a benefit (aside from a trivial micro-optimization not worth worrying about) to having the time be "fixed", particularly for truth in the logging: if time is fixed by an updateNow() call, then the log for a timeout will still show exactly the same value the code reasoned about. However, this benefit is in my opinion not enough to merit the fragility of the contract which led to this (for us) highly impactful and difficult-to-find bug in the first place.
> I'm currently running ant tests locally against my patch on trunk, and then I'll upload it here.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)