You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Enrico Olivelli (Jira)" <ji...@apache.org> on 2020/01/18 15:15:00 UTC
[jira] [Commented] (ZOOKEEPER-3698) NoRouteToHostException when starting large ZooKeeper cluster on localhost

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-3698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17018611#comment-17018611 ] 

Enrico Olivelli commented on ZOOKEEPER-3698:
--------------------------------------------

In my option we could address this issue working on the following items:
1) do not use a "parallelStream" to perform the reachability test (https://github.com/apache/zookeeper/blob/9053f7c431bb17ed79c2be129b6ba4ba18d15ab1/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/MultipleAddresses.java#L124)
2) make the timeout configurable, and maybe default to 1second  (https://github.com/apache/zookeeper/blob/9053f7c431bb17ed79c2be129b6ba4ba18d15ab1/zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/MultipleAddresses.java#L43)
3) make che reachabilty test configurabile, in a way the it is not done by default
4) do not make the reachability test in case of ONE single address (this will make the behaviour similar to 3.5 in case of one single address per server, that is the very common configuration)

I would go for 3 and/or 4 this new reachability test adds new burden to the system and by default if you have only a single route to the other peer it is not needed, our code is already handling the case of connection failure.




> NoRouteToHostException when starting large ZooKeeper cluster on localhost
> -------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3698
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3698
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Mate Szalay-Beko
>            Assignee: Mate Szalay-Beko
>            Priority: Major
>             Fix For: 3.6.0
>
>
> During testing RC for 3.6.0, we found that ZooKeeper cluster with large number of ensemble members (e.g. 23) can not start properly. We see a lot of warnings in the log:
> {code:java}
> 2020-01-15 20:02:13,431 [myid:13] - WARN
>  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691]
> - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190]
> {code}
>  
> The exception is happening when the new MultiAddress feature tries to filter the unreachable hosts from the address list. This involves the calling of the InetAddress.isReachable method with a default timeout of 500ms, which goes down to a native call in java and basically try to do a ping (an ICMP echo request) to the host. Naturally, the localhost should be always reachable. For some reason, this call gets timeouted on mac if we have many ensemble members. I tested with 9 members and the cluster started properly. With 11-13-15 members it took more and more time to get the cluster to start, and the "NoRouteToHostException" started to appear in the logs. After around 1 minute the 15 ensemble members cluster started, but obviously this is not good this way. (I also tried with JDK 11 but the I found the same behaviour)
>  
> On linux, I haven't been able to reproduce the problem. I tried with 5, 9, 15 and 23 ensemble members and the quorum always seems to start properly in a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)