You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Flavio Junqueira (JIRA)" <ji...@apache.org> on 2016/08/03 17:03:20 UTC

[jira] [Commented] (ZOOKEEPER-2447) Zookeeper adds good delay when one of the quorum host is not reachable

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15406224#comment-15406224 ] 

Flavio Junqueira commented on ZOOKEEPER-2447:
---------------------------------------------

[~dbenediktson] [~vishk] [~hanm] [~eribeiro] thank you all for working on a patch and reviewing. I'm a bit confused about the state of this issue. The initial patch by Vishal proposed to check if the host is reachable in {{StaticHostProvider.resolveAndShuffle}}. That is supposed to avoid servers that are down at the time we execute {{resolveAndShuffle}}.

Unless I'm missing something, the latest patch does not seem to be addressing the issue described here. It doesn't reduce the delay to connect to some server. Instead, it sets a lower bound for the connection timeout.

I don't think the approach of testing that server is reachable is bad because we don't have to do it often, only in the beginning and as we hit unavailable servers. We also need to be careful to not exclude unavailable servers permanently from the list of servers, you might and will probably come back eventually.

> Zookeeper adds  good delay when one of the quorum host is not reachable
> -----------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2447
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2447
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.6, 3.5.0
>            Reporter: Vishal Khandelwal
>            Assignee: Dan Benediktson
>             Fix For: 3.5.3, 3.6.0
>
>         Attachments: ZOOKEEPER-2447-MinConnectTimeoutOnly.patch, ZOOKEEPER-2447.3.5.patch, withfix.txt, withoutFix.txt
>
>
> StaticHostProvider --> resolveAndShuffle method adds all of the address which are valid in the quorum to the list, shuffles them and sends back to client connection class. If after shuffling if first node appear to be the one which is not reachable, Clientcnx.SendThread.run will keep on connecting to the failure till a timeout and the moves to a different node. This adds up random delay in zookeeper connection in case a host is down. Rather we could check if host is reachable in StaticHostProvider and ignore isReachable is false. Same as we do for UnknownHostException Exception.
> This can tested using following test code by providing a valid host which is not reachable. for quick test comment Collections.shuffle(tmpList, sourceOfRandomness); in StaticHostProvider.resolveAndShuffle
> {code}
>  @Test
>   public void test() throws Exception {
>     EventsWatcher watcher = new EventsWatcher();
>     QuorumUtil qu = new QuorumUtil(1);
>     qu.startAll();
>     
>     ZooKeeper zk =
>         new ZooKeeper("<hostnamet:2181," + qu.getConnString(), 180 * 1000, watcher);
>     
>     watcher.waitForConnected(CONNECTION_TIMEOUT * 5);
>     Assert.assertTrue("connection Established", watcher.isConnected());
>     zk.close();    
>   }
> {code}
> Following fix can be added to StaticHostProvider.resolveAndShuffle
> {code}
>  if(taddr.isReachable(4000 // can be some value)) {
>                       tmpList.add(new InetSocketAddress(taddr, address.getPort()));
>                     } 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)