You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2016/01/30 00:48:18 UTC
Review Request 42988: Changed ZooKeeper reconnection logic to retry
more aggressively.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/
-----------------------------------------------------------
Review request for mesos and Joris Van Remoortere.
Bugs: MESOS-4546
https://issues.apache.org/jira/browse/MESOS-4546
Repository: mesos
Description
-------
The previous implementation of `GroupProcess` tried to establish a single
ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
retry internally, but it only retries by attempting to reconnect to a list of
previously resolved; it doesn't attempt to re-resolve those IPs to pickup
updates to DNS configuration. Because DNS configuration can be quite dynamic,
arrange to close the current Zk handle and open a new one if we've seen a
successful `zookeeper_init` but haven't been connected within the ZooKeeper
session timeout.
Diffs
-----
src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
Diff: https://reviews.apache.org/r/42988/diff/
Testing
-------
make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
Thanks,
Neil Conway
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/#review117063
-----------------------------------------------------------
src/zookeeper/group.cpp (line 366)
<https://reviews.apache.org/r/42988/#comment178122>
I _believe_ that the connect timer will always be set when we reach here, but I want to double-check this.
- Neil Conway
On Jan. 30, 2016, midnight, Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42988/
> -----------------------------------------------------------
>
> (Updated Jan. 30, 2016, midnight)
>
>
> Review request for mesos and Joris Van Remoortere.
>
>
> Bugs: MESOS-4546
> https://issues.apache.org/jira/browse/MESOS-4546
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The previous implementation of `GroupProcess` tried to establish a single
> ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
> retry internally, but it only retries by attempting to reconnect to a list of
> previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
> updates to DNS configuration. Because DNS configuration can be quite dynamic,
> we now close the current Zk handle and open a new one if we've seen a
> successful `zookeeper_init` but haven't been connected within the ZooKeeper
> session timeout.
>
>
> Diffs
> -----
>
> src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
> src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
> src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
>
> Diff: https://reviews.apache.org/r/42988/diff/
>
>
> Testing
> -------
>
> make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/#review117093
-----------------------------------------------------------
Patch looks great!
Reviews applied: [42987, 42988]
Passed command: export OS='ubuntu:14.04' CONFIGURATION='--verbose' COMPILER='gcc' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh
- Mesos ReviewBot
On Jan. 30, 2016, 1:16 a.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42988/
> -----------------------------------------------------------
>
> (Updated Jan. 30, 2016, 1:16 a.m.)
>
>
> Review request for mesos and Joris Van Remoortere.
>
>
> Bugs: MESOS-4546
> https://issues.apache.org/jira/browse/MESOS-4546
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The previous implementation of `GroupProcess` tried to establish a single
> ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
> retry internally, but it only retries by attempting to reconnect to a list of
> previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
> updates to DNS configuration. Because DNS configuration can be quite dynamic,
> we now close the current Zk handle and open a new one if we've seen a
> successful `zookeeper_init` but haven't been connected within the ZooKeeper
> session timeout.
>
>
> Diffs
> -----
>
> src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d
> src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
> src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
> src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
>
> Diff: https://reviews.apache.org/r/42988/diff/
>
>
> Testing
> -------
>
> make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
>
> Also added a new unit test. Verified that the test fails w/o this patch applied and passes deterministically (`gtest_repeat=100`) with the patch applied.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Neil Conway <ne...@gmail.com>.
> On Jan. 30, 2016, 9:53 p.m., Joris Van Remoortere wrote:
> > src/tests/group_tests.cpp, lines 451-452
> > <https://reviews.apache.org/r/42988/diff/4/?file=1226926#file1226926line451>
> >
> > Maybe a comment explaining that we're triggering the timeout? Or is this too self-explanatory?
Done.
- Neil
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/#review117118
-----------------------------------------------------------
On Jan. 30, 2016, 10:20 p.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42988/
> -----------------------------------------------------------
>
> (Updated Jan. 30, 2016, 10:20 p.m.)
>
>
> Review request for mesos and Joris Van Remoortere.
>
>
> Bugs: MESOS-4546
> https://issues.apache.org/jira/browse/MESOS-4546
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The previous implementation of `GroupProcess` tried to establish a single
> ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
> retry internally, but it only retries by attempting to reconnect to a list of
> previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
> updates to DNS configuration. Because DNS configuration can be quite dynamic,
> we now close the current Zk handle and open a new one if we've seen a
> successful `zookeeper_init` but haven't been connected within the ZooKeeper
> session timeout.
>
>
> Diffs
> -----
>
> src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d
> src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
> src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
> src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
>
> Diff: https://reviews.apache.org/r/42988/diff/
>
>
> Testing
> -------
>
> make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
>
> Also added a new unit test. Verified that the test fails w/o this patch applied and passes deterministically (`gtest_repeat=100`) with the patch applied.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Joris Van Remoortere <jo...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/#review117118
-----------------------------------------------------------
Fix it, then Ship it!
Thanks for taking this on Neil!
As we found out, this code is not the easiest to reason through.
I left some issues for places we may be able to make it easier to read through the state assertions for the next set of readers.
src/tests/group_tests.cpp (lines 445 - 446)
<https://reviews.apache.org/r/42988/#comment178203>
Can we add a short comment as to the state we're trying to achieve here?
I think it will help readers of the test.
src/tests/group_tests.cpp (lines 451 - 452)
<https://reviews.apache.org/r/42988/#comment178204>
Maybe a comment explaining that we're triggering the timeout? Or is this too self-explanatory?
src/zookeeper/group.cpp (lines 128 - 137)
<https://reviews.apache.org/r/42988/#comment178213>
Not yours:
Can we add a comment that we don't need to clean up the `delay` `Timer`s because they won't be invoked if libprocess can no longer get a `ProcessReference` to this Actor?
src/zookeeper/group.cpp (line 154)
<https://reviews.apache.org/r/42988/#comment178209>
Should we s/promptly/within the sessionTimeout/ to be more clear?
src/zookeeper/group.cpp (lines 154 - 159)
<https://reviews.apache.org/r/42988/#comment178210>
Some places we refer to `ZK` as in Zookeeper. Other places we refer to the handle `zk` as in the variable.
This introduces a third `Zk`. Can we keep the code consistent with just the 2 names above?
We could say either the `ZK handle` or the ``zk` handle`?
Here and elsewhere in your patch.
src/zookeeper/group.cpp (lines 365 - 366)
<https://reviews.apache.org/r/42988/#comment178214>
Can we explain that a timer always exists during a fresh connection, and a reconnect?
Maybe we can point to a top level comment where you explain the DNS stale-ness problem.
src/zookeeper/group.cpp (lines 367 - 368)
<https://reviews.apache.org/r/42988/#comment178215>
Comment along these lines:
Once we are connected, we will be notified of a disconnect through the `reconnecting` callback, at which point we will re-establish a timer (per the DNS stale-ness issue).
src/zookeeper/group.cpp (line 464)
<https://reviews.apache.org/r/42988/#comment178216>
Comment along the lines of:
This assertion tests that we only receive a single `reconnecting` callback for the `connected -> disconnected` state transition in the zookeeper client.
- Joris Van Remoortere
On Jan. 30, 2016, 1:16 a.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42988/
> -----------------------------------------------------------
>
> (Updated Jan. 30, 2016, 1:16 a.m.)
>
>
> Review request for mesos and Joris Van Remoortere.
>
>
> Bugs: MESOS-4546
> https://issues.apache.org/jira/browse/MESOS-4546
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The previous implementation of `GroupProcess` tried to establish a single
> ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
> retry internally, but it only retries by attempting to reconnect to a list of
> previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
> updates to DNS configuration. Because DNS configuration can be quite dynamic,
> we now close the current Zk handle and open a new one if we've seen a
> successful `zookeeper_init` but haven't been connected within the ZooKeeper
> session timeout.
>
>
> Diffs
> -----
>
> src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d
> src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
> src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
> src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
>
> Diff: https://reviews.apache.org/r/42988/diff/
>
>
> Testing
> -------
>
> make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
>
> Also added a new unit test. Verified that the test fails w/o this patch applied and passes deterministically (`gtest_repeat=100`) with the patch applied.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/#review117130
-----------------------------------------------------------
Patch looks great!
Reviews applied: [42987, 42988]
Passed command: export OS='ubuntu:14.04' CONFIGURATION='--verbose' COMPILER='gcc' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh
- Mesos ReviewBot
On Jan. 30, 2016, 10:20 p.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/42988/
> -----------------------------------------------------------
>
> (Updated Jan. 30, 2016, 10:20 p.m.)
>
>
> Review request for mesos and Joris Van Remoortere.
>
>
> Bugs: MESOS-4546
> https://issues.apache.org/jira/browse/MESOS-4546
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The previous implementation of `GroupProcess` tried to establish a single
> ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
> retry internally, but it only retries by attempting to reconnect to a list of
> previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
> updates to DNS configuration. Because DNS configuration can be quite dynamic,
> we now close the current Zk handle and open a new one if we've seen a
> successful `zookeeper_init` but haven't been connected within the ZooKeeper
> session timeout.
>
>
> Diffs
> -----
>
> src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d
> src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
> src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
> src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
>
> Diff: https://reviews.apache.org/r/42988/diff/
>
>
> Testing
> -------
>
> make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
>
> Also added a new unit test. Verified that the test fails w/o this patch applied and passes deterministically (`gtest_repeat=100`) with the patch applied.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/
-----------------------------------------------------------
(Updated Jan. 30, 2016, 10:20 p.m.)
Review request for mesos and Joris Van Remoortere.
Changes
-------
Revise comments, per code review. Add a `CHECK_NONE` assert to `startConnection`.
Bugs: MESOS-4546
https://issues.apache.org/jira/browse/MESOS-4546
Repository: mesos
Description
-------
The previous implementation of `GroupProcess` tried to establish a single
ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
retry internally, but it only retries by attempting to reconnect to a list of
previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
updates to DNS configuration. Because DNS configuration can be quite dynamic,
we now close the current Zk handle and open a new one if we've seen a
successful `zookeeper_init` but haven't been connected within the ZooKeeper
session timeout.
Diffs (updated)
-----
src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d
src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
Diff: https://reviews.apache.org/r/42988/diff/
Testing
-------
make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
Also added a new unit test. Verified that the test fails w/o this patch applied and passes deterministically (`gtest_repeat=100`) with the patch applied.
Thanks,
Neil Conway
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/
-----------------------------------------------------------
(Updated Jan. 30, 2016, 1:16 a.m.)
Review request for mesos and Joris Van Remoortere.
Changes
-------
Added unit test.
Bugs: MESOS-4546
https://issues.apache.org/jira/browse/MESOS-4546
Repository: mesos
Description
-------
The previous implementation of `GroupProcess` tried to establish a single
ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
retry internally, but it only retries by attempting to reconnect to a list of
previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
updates to DNS configuration. Because DNS configuration can be quite dynamic,
we now close the current Zk handle and open a new one if we've seen a
successful `zookeeper_init` but haven't been connected within the ZooKeeper
session timeout.
Diffs (updated)
-----
src/tests/group_tests.cpp 77349465e0163c8aa6bed6deefe3f98efb442f3d
src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
Diff: https://reviews.apache.org/r/42988/diff/
Testing (updated)
-------
make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
Also added a new unit test. Verified that the test fails w/o this patch applied and passes deterministically (`gtest_repeat=100`) with the patch applied.
Thanks,
Neil Conway
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/
-----------------------------------------------------------
(Updated Jan. 30, 2016, midnight)
Review request for mesos and Joris Van Remoortere.
Changes
-------
Tweak comment.
Bugs: MESOS-4546
https://issues.apache.org/jira/browse/MESOS-4546
Repository: mesos
Description
-------
The previous implementation of `GroupProcess` tried to establish a single
ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
retry internally, but it only retries by attempting to reconnect to a list of
previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
updates to DNS configuration. Because DNS configuration can be quite dynamic,
we now close the current Zk handle and open a new one if we've seen a
successful `zookeeper_init` but haven't been connected within the ZooKeeper
session timeout.
Diffs (updated)
-----
src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
Diff: https://reviews.apache.org/r/42988/diff/
Testing
-------
make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
Thanks,
Neil Conway
Re: Review Request 42988: Changed ZooKeeper reconnection logic to
retry more aggressively.
Posted by Neil Conway <ne...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/42988/
-----------------------------------------------------------
(Updated Jan. 29, 2016, 11:57 p.m.)
Review request for mesos and Joris Van Remoortere.
Changes
-------
Tweak commit message.
Bugs: MESOS-4546
https://issues.apache.org/jira/browse/MESOS-4546
Repository: mesos
Description (updated)
-------
The previous implementation of `GroupProcess` tried to establish a single
ZooKeeper connection on startup, but didn't attempt to retry. ZooKeeper will
retry internally, but it only retries by attempting to reconnect to a list of
previously resolved IPs; it doesn't attempt to re-resolve those IPs to pickup
updates to DNS configuration. Because DNS configuration can be quite dynamic,
we now close the current Zk handle and open a new one if we've seen a
successful `zookeeper_init` but haven't been connected within the ZooKeeper
session timeout.
Diffs (updated)
-----
src/zookeeper/group.hpp cf82fec290a2fa9bec122539c2eb0f12b45c2fb2
src/zookeeper/group.cpp 2ae3193e0e138c90b205d45400d80e80853e1b99
src/zookeeper/zookeeper.cpp 3c4fdad972dcd1728c52a05970646c713dcf98c8
Diff: https://reviews.apache.org/r/42988/diff/
Testing
-------
make check, on both OSX and Arch Linux. Manually configured a situation in which the Mesos agent uses stale DNS information in a loop: validated that without the patch, we don't pickup DNS changes, whereas with the patch, we do.
Thanks,
Neil Conway