You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Szalay-Bekő Máté <sz...@gmail.com> on 2020/01/17 13:17:08 UTC

RC failure root cause: ICMP throttling settings on mac

TLDR:
During testing RC for 3.6.0, we found that ZooKeeper cluster with large
number of ensemble members (e.g. 23) can not start properly. This issue
seems to happen only on mac, and a workaround is to disable the ICMP
throttling. The question is if this workaround is enough for the RC, or if
we should change the code in ZooKeeper to limit the number of ICMP requests.


The problem:

On linux, I haven't been able to reproduce the problem. I tried with 5, 9,
15 and 23 ensemble members and the quorum always seems to start properly in
a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)

On mac, the problem is consistently happening for large ensembles. The
server is very slow to start and we see a lot of warnings in the log like
these:

2020-01-15 20:02:13,431 [myid:13] - WARN
 [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691]
- None of the addresses (/192.168.1.91:4190) are reachable for sid 10
java.net.NoRouteToHostException: No valid address among [/192.168.1.91:4190]

2020-01-17 11:02:26,177 [myid:4] - WARN
 [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address /
127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6

The exception is happening when the new MultiAddress feature tries to
filter the unreachable hosts from the address list when trying to decide
which election address to connect. This involves the calling of the
InetAddress.isReachable method with a default timeout of 500ms, which goes
down to a native call in java and basically try to do a ping (an ICMP echo
request) to the host. Naturally, the localhost should be always reachable.
This call gets timeouted on mac if we have many ensemble members. I tested
with 9 members and the cluster started properly. With 11-13-15 members it
took more and more time to get the cluster to start, and the
"NoRouteToHostException" started to appear in the logs. After around 1
minute the 15 ensemble members cluster started, but obviously this is way
too long.

On mac, we we have the ICMP rate limit set to 250 by default. You can turn
this off using the following command: sudo sysctl -w
net.inet.icmp.icmplim=0
(see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)

Using the above command before starting the 23 ensemble members cluster
locally seems to solve the problem for me. (can someone verify?) The
question is if this workaround is enough or not.

As far as I can tell, the current code will generate 2*A*(M-1) ICMP calls
in each ZooKeeper server during startup, if 'X' is the number of ensemble
members and 'A' is the number of election addresses provided for each
member. This is not that high, if each ZooKeeper server is started on a
different machine, but if we start a lot of ZooKeeper servers on a single
machine, then it can quickly go beyond the predefined limit of 250 for mac.

OPTION 1: we keep the code as it is. we might change the documentation for
zkconf mentioning this mac specific issue and the way how to disable the
ICMP rate limit.

OPTION 2: we change the code not to filter the list of election addresses
if the list has only a single element. This seems to be a logical way to
decrease the ICMP requests. However, if we would run a large number of
ZooKeeper servers on a single machine using multiple election addresses for
each server, we would get the same problem (most probably even quicker)

OPTION 3: make the address filtering configurable and change zkconf to
disable it by default. (but disabling will make the quorum potentially
unable to recover during network failures, so it is not recommended during
production)

OPTION 4: refactor the MultiAddress feature and remove the ICMP calls from
the ZooKeeper code. However, it is clearly helps for the quick recovery
during network failures... at the moment I can't think any good solution to
avoid it.

Kind regards,
Mate

Re: RC failure root cause: ICMP throttling settings on mac

Posted by Patrick Hunt <ph...@apache.org>.

Option 1 sounds good to me. However i'd recommend documenting it (setting
net.inet.icmp.icmplim if the error is hit) within ZK itself. Also happy to
update zkconf with the detail.

https://github.com/phunt/zkconf/commit/281ad019e1d497a94f7168aa1c74053687667225

Thanks for digging into this!

Regards,

Patrick

On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <sz...@gmail.com>
wrote:

> TLDR:
> During testing RC for 3.6.0, we found that ZooKeeper cluster with large
> number of ensemble members (e.g. 23) can not start properly. This issue
> seems to happen only on mac, and a workaround is to disable the ICMP
> throttling. The question is if this workaround is enough for the RC, or if
> we should change the code in ZooKeeper to limit the number of ICMP
> requests.
>
>
> The problem:
>
> On linux, I haven't been able to reproduce the problem. I tried with 5, 9,
> 15 and 23 ensemble members and the quorum always seems to start properly in
> a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
>
> On mac, the problem is consistently happening for large ensembles. The
> server is very slow to start and we see a lot of warnings in the log like
> these:
>
> 2020-01-15 20:02:13,431 [myid:13] - WARN
>  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691
> ]
> - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> java.net.NoRouteToHostException: No valid address among [/
> 192.168.1.91:4190]
>
> 2020-01-17 11:02:26,177 [myid:4] - WARN
>  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address /
> 127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
>
> The exception is happening when the new MultiAddress feature tries to
> filter the unreachable hosts from the address list when trying to decide
> which election address to connect. This involves the calling of the
> InetAddress.isReachable method with a default timeout of 500ms, which goes
> down to a native call in java and basically try to do a ping (an ICMP echo
> request) to the host. Naturally, the localhost should be always reachable.
> This call gets timeouted on mac if we have many ensemble members. I tested
> with 9 members and the cluster started properly. With 11-13-15 members it
> took more and more time to get the cluster to start, and the
> "NoRouteToHostException" started to appear in the logs. After around 1
> minute the 15 ensemble members cluster started, but obviously this is way
> too long.
>
> On mac, we we have the ICMP rate limit set to 250 by default. You can turn
> this off using the following command: sudo sysctl -w
> net.inet.icmp.icmplim=0
> (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
>
> Using the above command before starting the 23 ensemble members cluster
> locally seems to solve the problem for me. (can someone verify?) The
> question is if this workaround is enough or not.
>
> As far as I can tell, the current code will generate 2*A*(M-1) ICMP calls
> in each ZooKeeper server during startup, if 'X' is the number of ensemble
> members and 'A' is the number of election addresses provided for each
> member. This is not that high, if each ZooKeeper server is started on a
> different machine, but if we start a lot of ZooKeeper servers on a single
> machine, then it can quickly go beyond the predefined limit of 250 for mac.
>
> OPTION 1: we keep the code as it is. we might change the documentation for
> zkconf mentioning this mac specific issue and the way how to disable the
> ICMP rate limit.
>
> OPTION 2: we change the code not to filter the list of election addresses
> if the list has only a single element. This seems to be a logical way to
> decrease the ICMP requests. However, if we would run a large number of
> ZooKeeper servers on a single machine using multiple election addresses for
> each server, we would get the same problem (most probably even quicker)
>
> OPTION 3: make the address filtering configurable and change zkconf to
> disable it by default. (but disabling will make the quorum potentially
> unable to recover during network failures, so it is not recommended during
> production)
>
> OPTION 4: refactor the MultiAddress feature and remove the ICMP calls from
> the ZooKeeper code. However, it is clearly helps for the quick recovery
> during network failures... at the moment I can't think any good solution to
> avoid it.
>
> Kind regards,
> Mate
>

Re: RC failure root cause: ICMP throttling settings on mac

Posted by Ted Dunning <te...@gmail.com>.

I was not advocating an avoidance of the issue. I was suggesting that it
isn't a stop-ship issue.



On Thu, Jan 23, 2020 at 11:05 AM Michael K. Edwards <m....@gmail.com>
wrote:

> While I agree that this is not a very production-like configuration, I
> think it's good to recognize that there are plenty of clusters out there
> where more than five zookeeper nodes are called for.  I run systems
> routinely with seven voting members plus three or more observers, for
> reasons having to do with system behavior during network split scenarios in
> AWS EC2.  Mac OS specific issues aside, it would be unfortunate if there
> were an artificial cap on the number of nodes in a machine-local test
> cluster, especially if it were related to an ICMP storm scenario.
>
> On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <te...@gmail.com> wrote:
>
> > I think that this is far outside the normal operation bounds and has an
> > easy work-around.
> >
> > First, it is very uncommon to run more than 5 ZK nodes. Running 23 on a
> > single host is bizarre (viewed from an operational lens).
> >
> > Second, there is a simple configuration change that makes the strange
> > configuration work anyway.
> >
> > A third point unrelated to operational considerations is that there is
> risk
> > in making a last minute changes to code. That risk is borne by normal
> > configurations as well as these unusual ones.
> >
> > In sum,
> >
> > - this might look like a P1 (system down) issue, but there is a
> workaround
> > so it is certainly no more than P2
> >
> > - even P2 is unwarranted because the is a non-production configuration
> >
> > - a P3 issue isn't a stop-ship issue.
> >
> >
> >
> > On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <
> > szalay.beko.mate@gmail.com>
> > wrote:
> >
> > > TLDR:
> > > During testing RC for 3.6.0, we found that ZooKeeper cluster with large
> > > number of ensemble members (e.g. 23) can not start properly. This issue
> > > seems to happen only on mac, and a workaround is to disable the ICMP
> > > throttling. The question is if this workaround is enough for the RC, or
> > if
> > > we should change the code in ZooKeeper to limit the number of ICMP
> > > requests.
> > >
> > >
> > > The problem:
> > >
> > > On linux, I haven't been able to reproduce the problem. I tried with 5,
> > 9,
> > > 15 and 23 ensemble members and the quorum always seems to start
> properly
> > in
> > > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> > >
> > > On mac, the problem is consistently happening for large ensembles. The
> > > server is very slow to start and we see a lot of warnings in the log
> like
> > > these:
> > >
> > > 2020-01-15 20:02:13,431 [myid:13] - WARN
> > >  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193
> > :QuorumCnxManager@691
> > > ]
> > > - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> > > java.net.NoRouteToHostException: No valid address among [/
> > > 192.168.1.91:4190]
> > >
> > > 2020-01-17 11:02:26,177 [myid:4] - WARN
> > >  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address
> /
> > > 127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
> > >
> > > The exception is happening when the new MultiAddress feature tries to
> > > filter the unreachable hosts from the address list when trying to
> decide
> > > which election address to connect. This involves the calling of the
> > > InetAddress.isReachable method with a default timeout of 500ms, which
> > goes
> > > down to a native call in java and basically try to do a ping (an ICMP
> > echo
> > > request) to the host. Naturally, the localhost should be always
> > reachable.
> > > This call gets timeouted on mac if we have many ensemble members. I
> > tested
> > > with 9 members and the cluster started properly. With 11-13-15 members
> it
> > > took more and more time to get the cluster to start, and the
> > > "NoRouteToHostException" started to appear in the logs. After around 1
> > > minute the 15 ensemble members cluster started, but obviously this is
> way
> > > too long.
> > >
> > > On mac, we we have the ICMP rate limit set to 250 by default. You can
> > turn
> > > this off using the following command: sudo sysctl -w
> > > net.inet.icmp.icmplim=0
> > > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
> > >
> > > Using the above command before starting the 23 ensemble members cluster
> > > locally seems to solve the problem for me. (can someone verify?) The
> > > question is if this workaround is enough or not.
> > >
> > > As far as I can tell, the current code will generate 2*A*(M-1) ICMP
> calls
> > > in each ZooKeeper server during startup, if 'X' is the number of
> ensemble
> > > members and 'A' is the number of election addresses provided for each
> > > member. This is not that high, if each ZooKeeper server is started on a
> > > different machine, but if we start a lot of ZooKeeper servers on a
> single
> > > machine, then it can quickly go beyond the predefined limit of 250 for
> > mac.
> > >
> > > OPTION 1: we keep the code as it is. we might change the documentation
> > for
> > > zkconf mentioning this mac specific issue and the way how to disable
> the
> > > ICMP rate limit.
> > >
> > > OPTION 2: we change the code not to filter the list of election
> addresses
> > > if the list has only a single element. This seems to be a logical way
> to
> > > decrease the ICMP requests. However, if we would run a large number of
> > > ZooKeeper servers on a single machine using multiple election addresses
> > for
> > > each server, we would get the same problem (most probably even quicker)
> > >
> > > OPTION 3: make the address filtering configurable and change zkconf to
> > > disable it by default. (but disabling will make the quorum potentially
> > > unable to recover during network failures, so it is not recommended
> > during
> > > production)
> > >
> > > OPTION 4: refactor the MultiAddress feature and remove the ICMP calls
> > from
> > > the ZooKeeper code. However, it is clearly helps for the quick recovery
> > > during network failures... at the moment I can't think any good
> solution
> > to
> > > avoid it.
> > >
> > > Kind regards,
> > > Mate
> > >
> >
>

Re: RC failure root cause: ICMP throttling settings on mac

Posted by Szalay-Bekő Máté <sz...@gmail.com>.

Thanks for all the comments!

I agree that the 23 nodes ZK cluster is not a production-like setup, but
still it is valuable for testing.
Also I agree with Enrico. I can actually imagine a production environment,
where ICMP is completely disabled for some security reasons. Although it is
not very likely, as ZooKeeper usually runs deeply in the backend. But who
knows... I think there is a value in making this 'new behaviour'
configurable. We just did it with a small patch already:
https://github.com/apache/zookeeper/pull/1228.

I was also thinking about how we could remove the ICMP completely from the
multi-address feature in the long term, so I created a follow-up ticket
here: https://issues.apache.org/jira/browse/ZOOKEEPER-3705
(hopefully I can work on it during February)

Kind regards,
Mate

On Thu, Jan 23, 2020 at 8:23 PM Enrico Olivelli <eo...@gmail.com> wrote:

> I feel that Option 2 is more  conservative, the multi address feature is
> new in 3.6 and in my opinion it won't we used by current users of 3.4 and
> 3.5, at least not immediately after an upgrade because it needs a different
> network architecture.
>
> If you do not use the multi address property with option 1 you will be
> seening an extra traffic (ICMP) between your hosts and maybe this fact
> won't be well seen.
> With option 2 the behaviour is the same as zk 3.5.
> This is why I prefer option 2
>
>
> Enrico
>
> Il Gio 23 Gen 2020, 20:15 Patrick Hunt <ph...@apache.org> ha scritto:
>
> > Agree with both folks (ted/michael) - I view this as a "chaos monkey" of
> > sorts. If it runs with 5 shouldn't it run with 7 and so on.... I don't
> > remember why I chose 23, it's been 10 years or so that I've been running
> > this test. Don't do this at home folks. ;-) Also I just don't try
> starting
> > the cluster, I also kill servers, restart them and so on, it's a very
> good
> > stress test for the quorum protocol, etc... Option1 sounds fine to me,
> but
> > wanted to make sure the community reviewed before signing off on letting
> > the code stand, or whatever as long as it's reviewed/understood given it
> > was/is new behavior in 3.6 esp. Conscious decision at the eod.
> >
> > Regards,
> >
> > Patrick
> >
> > On Thu, Jan 23, 2020 at 11:05 AM Michael K. Edwards <
> m.k.edwards@gmail.com
> > >
> > wrote:
> >
> > > While I agree that this is not a very production-like configuration, I
> > > think it's good to recognize that there are plenty of clusters out
> there
> > > where more than five zookeeper nodes are called for.  I run systems
> > > routinely with seven voting members plus three or more observers, for
> > > reasons having to do with system behavior during network split
> scenarios
> > in
> > > AWS EC2.  Mac OS specific issues aside, it would be unfortunate if
> there
> > > were an artificial cap on the number of nodes in a machine-local test
> > > cluster, especially if it were related to an ICMP storm scenario.
> > >
> > > On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <te...@gmail.com>
> wrote:
> > >
> > > > I think that this is far outside the normal operation bounds and has
> an
> > > > easy work-around.
> > > >
> > > > First, it is very uncommon to run more than 5 ZK nodes. Running 23
> on a
> > > > single host is bizarre (viewed from an operational lens).
> > > >
> > > > Second, there is a simple configuration change that makes the strange
> > > > configuration work anyway.
> > > >
> > > > A third point unrelated to operational considerations is that there
> is
> > > risk
> > > > in making a last minute changes to code. That risk is borne by normal
> > > > configurations as well as these unusual ones.
> > > >
> > > > In sum,
> > > >
> > > > - this might look like a P1 (system down) issue, but there is a
> > > workaround
> > > > so it is certainly no more than P2
> > > >
> > > > - even P2 is unwarranted because the is a non-production
> configuration
> > > >
> > > > - a P3 issue isn't a stop-ship issue.
> > > >
> > > >
> > > >
> > > > On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <
> > > > szalay.beko.mate@gmail.com>
> > > > wrote:
> > > >
> > > > > TLDR:
> > > > > During testing RC for 3.6.0, we found that ZooKeeper cluster with
> > large
> > > > > number of ensemble members (e.g. 23) can not start properly. This
> > issue
> > > > > seems to happen only on mac, and a workaround is to disable the
> ICMP
> > > > > throttling. The question is if this workaround is enough for the
> RC,
> > or
> > > > if
> > > > > we should change the code in ZooKeeper to limit the number of ICMP
> > > > > requests.
> > > > >
> > > > >
> > > > > The problem:
> > > > >
> > > > > On linux, I haven't been able to reproduce the problem. I tried
> with
> > 5,
> > > > 9,
> > > > > 15 and 23 ensemble members and the quorum always seems to start
> > > properly
> > > > in
> > > > > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> > > > >
> > > > > On mac, the problem is consistently happening for large ensembles.
> > The
> > > > > server is very slow to start and we see a lot of warnings in the
> log
> > > like
> > > > > these:
> > > > >
> > > > > 2020-01-15 20:02:13,431 [myid:13] - WARN
> > > > >  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193
> > > > :QuorumCnxManager@691
> > > > > ]
> > > > > - None of the addresses (/192.168.1.91:4190) are reachable for sid
> > 10
> > > > > java.net.NoRouteToHostException: No valid address among [/
> > > > > 192.168.1.91:4190]
> > > > >
> > > > > 2020-01-17 11:02:26,177 [myid:4] - WARN
> > > > >  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination
> > address
> > > /
> > > > > 127.0.0.1 not reachable anymore, shutting down the SendWorker for
> > sid 6
> > > > >
> > > > > The exception is happening when the new MultiAddress feature tries
> to
> > > > > filter the unreachable hosts from the address list when trying to
> > > decide
> > > > > which election address to connect. This involves the calling of the
> > > > > InetAddress.isReachable method with a default timeout of 500ms,
> which
> > > > goes
> > > > > down to a native call in java and basically try to do a ping (an
> ICMP
> > > > echo
> > > > > request) to the host. Naturally, the localhost should be always
> > > > reachable.
> > > > > This call gets timeouted on mac if we have many ensemble members. I
> > > > tested
> > > > > with 9 members and the cluster started properly. With 11-13-15
> > members
> > > it
> > > > > took more and more time to get the cluster to start, and the
> > > > > "NoRouteToHostException" started to appear in the logs. After
> around
> > 1
> > > > > minute the 15 ensemble members cluster started, but obviously this
> is
> > > way
> > > > > too long.
> > > > >
> > > > > On mac, we we have the ICMP rate limit set to 250 by default. You
> can
> > > > turn
> > > > > this off using the following command: sudo sysctl -w
> > > > > net.inet.icmp.icmplim=0
> > > > > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/
> )
> > > > >
> > > > > Using the above command before starting the 23 ensemble members
> > cluster
> > > > > locally seems to solve the problem for me. (can someone verify?)
> The
> > > > > question is if this workaround is enough or not.
> > > > >
> > > > > As far as I can tell, the current code will generate 2*A*(M-1) ICMP
> > > calls
> > > > > in each ZooKeeper server during startup, if 'X' is the number of
> > > ensemble
> > > > > members and 'A' is the number of election addresses provided for
> each
> > > > > member. This is not that high, if each ZooKeeper server is started
> > on a
> > > > > different machine, but if we start a lot of ZooKeeper servers on a
> > > single
> > > > > machine, then it can quickly go beyond the predefined limit of 250
> > for
> > > > mac.
> > > > >
> > > > > OPTION 1: we keep the code as it is. we might change the
> > documentation
> > > > for
> > > > > zkconf mentioning this mac specific issue and the way how to
> disable
> > > the
> > > > > ICMP rate limit.
> > > > >
> > > > > OPTION 2: we change the code not to filter the list of election
> > > addresses
> > > > > if the list has only a single element. This seems to be a logical
> way
> > > to
> > > > > decrease the ICMP requests. However, if we would run a large number
> > of
> > > > > ZooKeeper servers on a single machine using multiple election
> > addresses
> > > > for
> > > > > each server, we would get the same problem (most probably even
> > quicker)
> > > > >
> > > > > OPTION 3: make the address filtering configurable and change zkconf
> > to
> > > > > disable it by default. (but disabling will make the quorum
> > potentially
> > > > > unable to recover during network failures, so it is not recommended
> > > > during
> > > > > production)
> > > > >
> > > > > OPTION 4: refactor the MultiAddress feature and remove the ICMP
> calls
> > > > from
> > > > > the ZooKeeper code. However, it is clearly helps for the quick
> > recovery
> > > > > during network failures... at the moment I can't think any good
> > > solution
> > > > to
> > > > > avoid it.
> > > > >
> > > > > Kind regards,
> > > > > Mate
> > > > >
> > > >
> > >
> >
>

Re: RC failure root cause: ICMP throttling settings on mac

Posted by Enrico Olivelli <eo...@gmail.com>.

I feel that Option 2 is more  conservative, the multi address feature is
new in 3.6 and in my opinion it won't we used by current users of 3.4 and
3.5, at least not immediately after an upgrade because it needs a different
network architecture.

If you do not use the multi address property with option 1 you will be
seening an extra traffic (ICMP) between your hosts and maybe this fact
won't be well seen.
With option 2 the behaviour is the same as zk 3.5.
This is why I prefer option 2


Enrico

Il Gio 23 Gen 2020, 20:15 Patrick Hunt <ph...@apache.org> ha scritto:

> Agree with both folks (ted/michael) - I view this as a "chaos monkey" of
> sorts. If it runs with 5 shouldn't it run with 7 and so on.... I don't
> remember why I chose 23, it's been 10 years or so that I've been running
> this test. Don't do this at home folks. ;-) Also I just don't try starting
> the cluster, I also kill servers, restart them and so on, it's a very good
> stress test for the quorum protocol, etc... Option1 sounds fine to me, but
> wanted to make sure the community reviewed before signing off on letting
> the code stand, or whatever as long as it's reviewed/understood given it
> was/is new behavior in 3.6 esp. Conscious decision at the eod.
>
> Regards,
>
> Patrick
>
> On Thu, Jan 23, 2020 at 11:05 AM Michael K. Edwards <m.k.edwards@gmail.com
> >
> wrote:
>
> > While I agree that this is not a very production-like configuration, I
> > think it's good to recognize that there are plenty of clusters out there
> > where more than five zookeeper nodes are called for.  I run systems
> > routinely with seven voting members plus three or more observers, for
> > reasons having to do with system behavior during network split scenarios
> in
> > AWS EC2.  Mac OS specific issues aside, it would be unfortunate if there
> > were an artificial cap on the number of nodes in a machine-local test
> > cluster, especially if it were related to an ICMP storm scenario.
> >
> > On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <te...@gmail.com> wrote:
> >
> > > I think that this is far outside the normal operation bounds and has an
> > > easy work-around.
> > >
> > > First, it is very uncommon to run more than 5 ZK nodes. Running 23 on a
> > > single host is bizarre (viewed from an operational lens).
> > >
> > > Second, there is a simple configuration change that makes the strange
> > > configuration work anyway.
> > >
> > > A third point unrelated to operational considerations is that there is
> > risk
> > > in making a last minute changes to code. That risk is borne by normal
> > > configurations as well as these unusual ones.
> > >
> > > In sum,
> > >
> > > - this might look like a P1 (system down) issue, but there is a
> > workaround
> > > so it is certainly no more than P2
> > >
> > > - even P2 is unwarranted because the is a non-production configuration
> > >
> > > - a P3 issue isn't a stop-ship issue.
> > >
> > >
> > >
> > > On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <
> > > szalay.beko.mate@gmail.com>
> > > wrote:
> > >
> > > > TLDR:
> > > > During testing RC for 3.6.0, we found that ZooKeeper cluster with
> large
> > > > number of ensemble members (e.g. 23) can not start properly. This
> issue
> > > > seems to happen only on mac, and a workaround is to disable the ICMP
> > > > throttling. The question is if this workaround is enough for the RC,
> or
> > > if
> > > > we should change the code in ZooKeeper to limit the number of ICMP
> > > > requests.
> > > >
> > > >
> > > > The problem:
> > > >
> > > > On linux, I haven't been able to reproduce the problem. I tried with
> 5,
> > > 9,
> > > > 15 and 23 ensemble members and the quorum always seems to start
> > properly
> > > in
> > > > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> > > >
> > > > On mac, the problem is consistently happening for large ensembles.
> The
> > > > server is very slow to start and we see a lot of warnings in the log
> > like
> > > > these:
> > > >
> > > > 2020-01-15 20:02:13,431 [myid:13] - WARN
> > > >  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193
> > > :QuorumCnxManager@691
> > > > ]
> > > > - None of the addresses (/192.168.1.91:4190) are reachable for sid
> 10
> > > > java.net.NoRouteToHostException: No valid address among [/
> > > > 192.168.1.91:4190]
> > > >
> > > > 2020-01-17 11:02:26,177 [myid:4] - WARN
> > > >  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination
> address
> > /
> > > > 127.0.0.1 not reachable anymore, shutting down the SendWorker for
> sid 6
> > > >
> > > > The exception is happening when the new MultiAddress feature tries to
> > > > filter the unreachable hosts from the address list when trying to
> > decide
> > > > which election address to connect. This involves the calling of the
> > > > InetAddress.isReachable method with a default timeout of 500ms, which
> > > goes
> > > > down to a native call in java and basically try to do a ping (an ICMP
> > > echo
> > > > request) to the host. Naturally, the localhost should be always
> > > reachable.
> > > > This call gets timeouted on mac if we have many ensemble members. I
> > > tested
> > > > with 9 members and the cluster started properly. With 11-13-15
> members
> > it
> > > > took more and more time to get the cluster to start, and the
> > > > "NoRouteToHostException" started to appear in the logs. After around
> 1
> > > > minute the 15 ensemble members cluster started, but obviously this is
> > way
> > > > too long.
> > > >
> > > > On mac, we we have the ICMP rate limit set to 250 by default. You can
> > > turn
> > > > this off using the following command: sudo sysctl -w
> > > > net.inet.icmp.icmplim=0
> > > > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
> > > >
> > > > Using the above command before starting the 23 ensemble members
> cluster
> > > > locally seems to solve the problem for me. (can someone verify?) The
> > > > question is if this workaround is enough or not.
> > > >
> > > > As far as I can tell, the current code will generate 2*A*(M-1) ICMP
> > calls
> > > > in each ZooKeeper server during startup, if 'X' is the number of
> > ensemble
> > > > members and 'A' is the number of election addresses provided for each
> > > > member. This is not that high, if each ZooKeeper server is started
> on a
> > > > different machine, but if we start a lot of ZooKeeper servers on a
> > single
> > > > machine, then it can quickly go beyond the predefined limit of 250
> for
> > > mac.
> > > >
> > > > OPTION 1: we keep the code as it is. we might change the
> documentation
> > > for
> > > > zkconf mentioning this mac specific issue and the way how to disable
> > the
> > > > ICMP rate limit.
> > > >
> > > > OPTION 2: we change the code not to filter the list of election
> > addresses
> > > > if the list has only a single element. This seems to be a logical way
> > to
> > > > decrease the ICMP requests. However, if we would run a large number
> of
> > > > ZooKeeper servers on a single machine using multiple election
> addresses
> > > for
> > > > each server, we would get the same problem (most probably even
> quicker)
> > > >
> > > > OPTION 3: make the address filtering configurable and change zkconf
> to
> > > > disable it by default. (but disabling will make the quorum
> potentially
> > > > unable to recover during network failures, so it is not recommended
> > > during
> > > > production)
> > > >
> > > > OPTION 4: refactor the MultiAddress feature and remove the ICMP calls
> > > from
> > > > the ZooKeeper code. However, it is clearly helps for the quick
> recovery
> > > > during network failures... at the moment I can't think any good
> > solution
> > > to
> > > > avoid it.
> > > >
> > > > Kind regards,
> > > > Mate
> > > >
> > >
> >
>

Re: RC failure root cause: ICMP throttling settings on mac

Posted by Patrick Hunt <ph...@apache.org>.

Agree with both folks (ted/michael) - I view this as a "chaos monkey" of
sorts. If it runs with 5 shouldn't it run with 7 and so on.... I don't
remember why I chose 23, it's been 10 years or so that I've been running
this test. Don't do this at home folks. ;-) Also I just don't try starting
the cluster, I also kill servers, restart them and so on, it's a very good
stress test for the quorum protocol, etc... Option1 sounds fine to me, but
wanted to make sure the community reviewed before signing off on letting
the code stand, or whatever as long as it's reviewed/understood given it
was/is new behavior in 3.6 esp. Conscious decision at the eod.

Regards,

Patrick

On Thu, Jan 23, 2020 at 11:05 AM Michael K. Edwards <m....@gmail.com>
wrote:

> While I agree that this is not a very production-like configuration, I
> think it's good to recognize that there are plenty of clusters out there
> where more than five zookeeper nodes are called for.  I run systems
> routinely with seven voting members plus three or more observers, for
> reasons having to do with system behavior during network split scenarios in
> AWS EC2.  Mac OS specific issues aside, it would be unfortunate if there
> were an artificial cap on the number of nodes in a machine-local test
> cluster, especially if it were related to an ICMP storm scenario.
>
> On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <te...@gmail.com> wrote:
>
> > I think that this is far outside the normal operation bounds and has an
> > easy work-around.
> >
> > First, it is very uncommon to run more than 5 ZK nodes. Running 23 on a
> > single host is bizarre (viewed from an operational lens).
> >
> > Second, there is a simple configuration change that makes the strange
> > configuration work anyway.
> >
> > A third point unrelated to operational considerations is that there is
> risk
> > in making a last minute changes to code. That risk is borne by normal
> > configurations as well as these unusual ones.
> >
> > In sum,
> >
> > - this might look like a P1 (system down) issue, but there is a
> workaround
> > so it is certainly no more than P2
> >
> > - even P2 is unwarranted because the is a non-production configuration
> >
> > - a P3 issue isn't a stop-ship issue.
> >
> >
> >
> > On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <
> > szalay.beko.mate@gmail.com>
> > wrote:
> >
> > > TLDR:
> > > During testing RC for 3.6.0, we found that ZooKeeper cluster with large
> > > number of ensemble members (e.g. 23) can not start properly. This issue
> > > seems to happen only on mac, and a workaround is to disable the ICMP
> > > throttling. The question is if this workaround is enough for the RC, or
> > if
> > > we should change the code in ZooKeeper to limit the number of ICMP
> > > requests.
> > >
> > >
> > > The problem:
> > >
> > > On linux, I haven't been able to reproduce the problem. I tried with 5,
> > 9,
> > > 15 and 23 ensemble members and the quorum always seems to start
> properly
> > in
> > > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> > >
> > > On mac, the problem is consistently happening for large ensembles. The
> > > server is very slow to start and we see a lot of warnings in the log
> like
> > > these:
> > >
> > > 2020-01-15 20:02:13,431 [myid:13] - WARN
> > >  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193
> > :QuorumCnxManager@691
> > > ]
> > > - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> > > java.net.NoRouteToHostException: No valid address among [/
> > > 192.168.1.91:4190]
> > >
> > > 2020-01-17 11:02:26,177 [myid:4] - WARN
> > >  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address
> /
> > > 127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
> > >
> > > The exception is happening when the new MultiAddress feature tries to
> > > filter the unreachable hosts from the address list when trying to
> decide
> > > which election address to connect. This involves the calling of the
> > > InetAddress.isReachable method with a default timeout of 500ms, which
> > goes
> > > down to a native call in java and basically try to do a ping (an ICMP
> > echo
> > > request) to the host. Naturally, the localhost should be always
> > reachable.
> > > This call gets timeouted on mac if we have many ensemble members. I
> > tested
> > > with 9 members and the cluster started properly. With 11-13-15 members
> it
> > > took more and more time to get the cluster to start, and the
> > > "NoRouteToHostException" started to appear in the logs. After around 1
> > > minute the 15 ensemble members cluster started, but obviously this is
> way
> > > too long.
> > >
> > > On mac, we we have the ICMP rate limit set to 250 by default. You can
> > turn
> > > this off using the following command: sudo sysctl -w
> > > net.inet.icmp.icmplim=0
> > > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
> > >
> > > Using the above command before starting the 23 ensemble members cluster
> > > locally seems to solve the problem for me. (can someone verify?) The
> > > question is if this workaround is enough or not.
> > >
> > > As far as I can tell, the current code will generate 2*A*(M-1) ICMP
> calls
> > > in each ZooKeeper server during startup, if 'X' is the number of
> ensemble
> > > members and 'A' is the number of election addresses provided for each
> > > member. This is not that high, if each ZooKeeper server is started on a
> > > different machine, but if we start a lot of ZooKeeper servers on a
> single
> > > machine, then it can quickly go beyond the predefined limit of 250 for
> > mac.
> > >
> > > OPTION 1: we keep the code as it is. we might change the documentation
> > for
> > > zkconf mentioning this mac specific issue and the way how to disable
> the
> > > ICMP rate limit.
> > >
> > > OPTION 2: we change the code not to filter the list of election
> addresses
> > > if the list has only a single element. This seems to be a logical way
> to
> > > decrease the ICMP requests. However, if we would run a large number of
> > > ZooKeeper servers on a single machine using multiple election addresses
> > for
> > > each server, we would get the same problem (most probably even quicker)
> > >
> > > OPTION 3: make the address filtering configurable and change zkconf to
> > > disable it by default. (but disabling will make the quorum potentially
> > > unable to recover during network failures, so it is not recommended
> > during
> > > production)
> > >
> > > OPTION 4: refactor the MultiAddress feature and remove the ICMP calls
> > from
> > > the ZooKeeper code. However, it is clearly helps for the quick recovery
> > > during network failures... at the moment I can't think any good
> solution
> > to
> > > avoid it.
> > >
> > > Kind regards,
> > > Mate
> > >
> >
>

Re: RC failure root cause: ICMP throttling settings on mac

Posted by "Michael K. Edwards" <m....@gmail.com>.

While I agree that this is not a very production-like configuration, I
think it's good to recognize that there are plenty of clusters out there
where more than five zookeeper nodes are called for.  I run systems
routinely with seven voting members plus three or more observers, for
reasons having to do with system behavior during network split scenarios in
AWS EC2.  Mac OS specific issues aside, it would be unfortunate if there
were an artificial cap on the number of nodes in a machine-local test
cluster, especially if it were related to an ICMP storm scenario.

On Thu, Jan 23, 2020, 8:11 AM Ted Dunning <te...@gmail.com> wrote:

> I think that this is far outside the normal operation bounds and has an
> easy work-around.
>
> First, it is very uncommon to run more than 5 ZK nodes. Running 23 on a
> single host is bizarre (viewed from an operational lens).
>
> Second, there is a simple configuration change that makes the strange
> configuration work anyway.
>
> A third point unrelated to operational considerations is that there is risk
> in making a last minute changes to code. That risk is borne by normal
> configurations as well as these unusual ones.
>
> In sum,
>
> - this might look like a P1 (system down) issue, but there is a workaround
> so it is certainly no more than P2
>
> - even P2 is unwarranted because the is a non-production configuration
>
> - a P3 issue isn't a stop-ship issue.
>
>
>
> On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <
> szalay.beko.mate@gmail.com>
> wrote:
>
> > TLDR:
> > During testing RC for 3.6.0, we found that ZooKeeper cluster with large
> > number of ensemble members (e.g. 23) can not start properly. This issue
> > seems to happen only on mac, and a workaround is to disable the ICMP
> > throttling. The question is if this workaround is enough for the RC, or
> if
> > we should change the code in ZooKeeper to limit the number of ICMP
> > requests.
> >
> >
> > The problem:
> >
> > On linux, I haven't been able to reproduce the problem. I tried with 5,
> 9,
> > 15 and 23 ensemble members and the quorum always seems to start properly
> in
> > a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
> >
> > On mac, the problem is consistently happening for large ensembles. The
> > server is very slow to start and we see a lot of warnings in the log like
> > these:
> >
> > 2020-01-15 20:02:13,431 [myid:13] - WARN
> >  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193
> :QuorumCnxManager@691
> > ]
> > - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> > java.net.NoRouteToHostException: No valid address among [/
> > 192.168.1.91:4190]
> >
> > 2020-01-17 11:02:26,177 [myid:4] - WARN
> >  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address /
> > 127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
> >
> > The exception is happening when the new MultiAddress feature tries to
> > filter the unreachable hosts from the address list when trying to decide
> > which election address to connect. This involves the calling of the
> > InetAddress.isReachable method with a default timeout of 500ms, which
> goes
> > down to a native call in java and basically try to do a ping (an ICMP
> echo
> > request) to the host. Naturally, the localhost should be always
> reachable.
> > This call gets timeouted on mac if we have many ensemble members. I
> tested
> > with 9 members and the cluster started properly. With 11-13-15 members it
> > took more and more time to get the cluster to start, and the
> > "NoRouteToHostException" started to appear in the logs. After around 1
> > minute the 15 ensemble members cluster started, but obviously this is way
> > too long.
> >
> > On mac, we we have the ICMP rate limit set to 250 by default. You can
> turn
> > this off using the following command: sudo sysctl -w
> > net.inet.icmp.icmplim=0
> > (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
> >
> > Using the above command before starting the 23 ensemble members cluster
> > locally seems to solve the problem for me. (can someone verify?) The
> > question is if this workaround is enough or not.
> >
> > As far as I can tell, the current code will generate 2*A*(M-1) ICMP calls
> > in each ZooKeeper server during startup, if 'X' is the number of ensemble
> > members and 'A' is the number of election addresses provided for each
> > member. This is not that high, if each ZooKeeper server is started on a
> > different machine, but if we start a lot of ZooKeeper servers on a single
> > machine, then it can quickly go beyond the predefined limit of 250 for
> mac.
> >
> > OPTION 1: we keep the code as it is. we might change the documentation
> for
> > zkconf mentioning this mac specific issue and the way how to disable the
> > ICMP rate limit.
> >
> > OPTION 2: we change the code not to filter the list of election addresses
> > if the list has only a single element. This seems to be a logical way to
> > decrease the ICMP requests. However, if we would run a large number of
> > ZooKeeper servers on a single machine using multiple election addresses
> for
> > each server, we would get the same problem (most probably even quicker)
> >
> > OPTION 3: make the address filtering configurable and change zkconf to
> > disable it by default. (but disabling will make the quorum potentially
> > unable to recover during network failures, so it is not recommended
> during
> > production)
> >
> > OPTION 4: refactor the MultiAddress feature and remove the ICMP calls
> from
> > the ZooKeeper code. However, it is clearly helps for the quick recovery
> > during network failures... at the moment I can't think any good solution
> to
> > avoid it.
> >
> > Kind regards,
> > Mate
> >
>

Re: RC failure root cause: ICMP throttling settings on mac

Posted by Ted Dunning <te...@gmail.com>.

I think that this is far outside the normal operation bounds and has an
easy work-around.

First, it is very uncommon to run more than 5 ZK nodes. Running 23 on a
single host is bizarre (viewed from an operational lens).

Second, there is a simple configuration change that makes the strange
configuration work anyway.

A third point unrelated to operational considerations is that there is risk
in making a last minute changes to code. That risk is borne by normal
configurations as well as these unusual ones.

In sum,

- this might look like a P1 (system down) issue, but there is a workaround
so it is certainly no more than P2

- even P2 is unwarranted because the is a non-production configuration

- a P3 issue isn't a stop-ship issue.



On Fri, Jan 17, 2020 at 5:17 AM Szalay-Bekő Máté <sz...@gmail.com>
wrote:

> TLDR:
> During testing RC for 3.6.0, we found that ZooKeeper cluster with large
> number of ensemble members (e.g. 23) can not start properly. This issue
> seems to happen only on mac, and a workaround is to disable the ICMP
> throttling. The question is if this workaround is enough for the RC, or if
> we should change the code in ZooKeeper to limit the number of ICMP
> requests.
>
>
> The problem:
>
> On linux, I haven't been able to reproduce the problem. I tried with 5, 9,
> 15 and 23 ensemble members and the quorum always seems to start properly in
> a few seconds. (I used OpenJDK 1.8.232 on Ubuntu 18.04)
>
> On mac, the problem is consistently happening for large ensembles. The
> server is very slow to start and we see a lot of warnings in the log like
> these:
>
> 2020-01-15 20:02:13,431 [myid:13] - WARN
>  [ListenerHandler-phunt-MBP13.local/192.168.1.91:4193:QuorumCnxManager@691
> ]
> - None of the addresses (/192.168.1.91:4190) are reachable for sid 10
> java.net.NoRouteToHostException: No valid address among [/
> 192.168.1.91:4190]
>
> 2020-01-17 11:02:26,177 [myid:4] - WARN
>  [Thread-2531:QuorumCnxManager$SendWorker@1269] - destination address /
> 127.0.0.1 not reachable anymore, shutting down the SendWorker for sid 6
>
> The exception is happening when the new MultiAddress feature tries to
> filter the unreachable hosts from the address list when trying to decide
> which election address to connect. This involves the calling of the
> InetAddress.isReachable method with a default timeout of 500ms, which goes
> down to a native call in java and basically try to do a ping (an ICMP echo
> request) to the host. Naturally, the localhost should be always reachable.
> This call gets timeouted on mac if we have many ensemble members. I tested
> with 9 members and the cluster started properly. With 11-13-15 members it
> took more and more time to get the cluster to start, and the
> "NoRouteToHostException" started to appear in the logs. After around 1
> minute the 15 ensemble members cluster started, but obviously this is way
> too long.
>
> On mac, we we have the ICMP rate limit set to 250 by default. You can turn
> this off using the following command: sudo sysctl -w
> net.inet.icmp.icmplim=0
> (see https://krypted.com/mac-os-x/disable-icmp-rate-limiting-os-x/)
>
> Using the above command before starting the 23 ensemble members cluster
> locally seems to solve the problem for me. (can someone verify?) The
> question is if this workaround is enough or not.
>
> As far as I can tell, the current code will generate 2*A*(M-1) ICMP calls
> in each ZooKeeper server during startup, if 'X' is the number of ensemble
> members and 'A' is the number of election addresses provided for each
> member. This is not that high, if each ZooKeeper server is started on a
> different machine, but if we start a lot of ZooKeeper servers on a single
> machine, then it can quickly go beyond the predefined limit of 250 for mac.
>
> OPTION 1: we keep the code as it is. we might change the documentation for
> zkconf mentioning this mac specific issue and the way how to disable the
> ICMP rate limit.
>
> OPTION 2: we change the code not to filter the list of election addresses
> if the list has only a single element. This seems to be a logical way to
> decrease the ICMP requests. However, if we would run a large number of
> ZooKeeper servers on a single machine using multiple election addresses for
> each server, we would get the same problem (most probably even quicker)
>
> OPTION 3: make the address filtering configurable and change zkconf to
> disable it by default. (but disabling will make the quorum potentially
> unable to recover during network failures, so it is not recommended during
> production)
>
> OPTION 4: refactor the MultiAddress feature and remove the ICMP calls from
> the ZooKeeper code. However, it is clearly helps for the quick recovery
> during network failures... at the moment I can't think any good solution to
> avoid it.
>
> Kind regards,
> Mate
>