You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Thomas Langé <t....@criteo.com> on 2022/03/10 10:15:05 UTC

Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Hi,

We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration capability without any issue for Mesos 1.9. The only thing that should be handled carefully is the addition/removal of Zookeeper members when using dynamic reconf feature.

What do you mean by "mesos-master can handle a dynamic reconfiguration of the zk ensemble" ? To my understanding, Mesos will only connect to ZK to elect a leader through ZK primitives; I don't think there is a correlation with how ZK members are set in the cluster.

How do you remove/add members to the ZK member list? The issue you encounter might come from inconsistencies in ZK cluster.

Regards,

Thomas
________________________________
From: Charles-François Natali <cf...@gmail.com>
Sent: Wednesday, 9 March 2022 23:44
To: user <us...@mesos.apache.org>
Subject: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Hi Dan,

I don't think anyone has been looking at this, and i doubt we will, since we are quite low on resources.


Cheers,




On Tue, Mar 8, 2022, 19:01 Dan Leary <dl...@touchplan.io>> wrote:
Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic reconfiguration" capability.
https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.6.3%2FzookeeperReconfig.html&data=04%7C01%7Ct.lange%40criteo.com%7C070fc8a5cd6341e9a91008da021e7876%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C0%7C637824627127329223%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AjPUBCzFNsu0twnGptmACWDH8%2FTUktvVYKarGBRtGJI%3D&reserved=0>

Seems prima facie like mesos-master can handle a dynamic reconfig of the zk ensemble up to the point
where a new set of participants has been added to the ensemble and the old participants
have been demoted to non-voting followers.  But when the non-voting follower processes are
terminated the master logs seem to indicate that the masters keep trying and failing to reconnect
to the old zk leader, even though they've apparently received updates with the new ensemble participants.

Anybody have any insight into this?
Any plans to support zk dynamic reconfiguration in the future?
Seems like it could make for easier O/S maintenance of one's master/zk cluster hosts.

-Dan


Re: [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Posted by Dan Leary <dl...@touchplan.io>.
Thanks, I'll give that a try.
Perhaps I'll file a feature request to auto-update the zk endpoints on a
reconfig event so the mesos masters don't have to be restarted.


On Thu, Mar 10, 2022 at 12:49 PM Thomas Langé <t....@criteo.com> wrote:

> Hi,
>
> You answer is in your last message:
> >Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as
> > each zk participant gets reconfig'ed?
>
> From what I understand, when your issue happens, your ZK cluster is
> healthy but Mesos masters fails to connect.
> It seems to be because Mesos masters are still configured to contact the 3
> "legacy nodes". As long as they are in the ZK cluster, they will forward
> your request to ZK leader, so the whole setup works. When you remove them,
> mesos-master cannot know how to reach a valid ZK member to access the
> cluster.
> So, you need to update the --zk parameter to always contain members of the
> cluster (Mesos won't read ZK configuration to fetch new members and
> auto-update its "--zk endpoints").
>
> To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is
> not aware of those changes.
>
> Bw,
>
> Thomas
> ------------------------------
> *From:* Dan Leary <dl...@touchplan.io>
> *Sent:* Thursday, 10 March 2022 16:16
> *To:* user@mesos.apache.org <us...@mesos.apache.org>
> *Subject:* [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic
> Reconfiguration?
>
> Thomas-
>
> Encouraging news.  Appreciate the response.
>
> I've tried both non-incremental and incremental reconfigs with the same
> result.
> With 3 zk participants (quorum 2) we first add 3 observers.
> Non-incrementally we then remove a participant then add an observer as
> participant.
> Repeat twice, last time the current leading participant is the one removed.
> At this point the 3 mesos-masters all seem fine.
> My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG
> events and gets the updated list of zk participants just fine.
> But when we terminate the original zk servers that are now running as
> non-voting followers, the mesos-masters all seem to keep trying to
> reconnect to the now-dead former zk participants.
> Eventually heartbeats fail and the whole cluster shuts down.
> The masters log messages like:
>
> 2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827:
> Initiating client connection,
> host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=10000
> watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=<null>
> context=0x7f255c000bf8 flags=0
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2182] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183
> <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>]
> zk retcode=-4, errno=111(Connection refused): server refused to accept the
> client
>
>
> Where ports 2181, 2182, 2183 are the old participants, the new
> participants are on ports 2184, 2185, 2186  (single host test environment).
> Perhaps a mesos-master needs to be terminated and then restarted with an
> updated zk:// list as each zk participant gets reconfig'ed?
>
> -Dan
>
>
> On Thu, Mar 10, 2022 at 5:15 AM Thomas Langé <t....@criteo.com> wrote:
>
> Hi,
>
> We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration
> capability without any issue for Mesos 1.9. The only thing that should be
> handled carefully is the addition/removal of Zookeeper members when using
> dynamic reconf feature.
>
> What do you mean by "mesos-master can handle a dynamic reconfiguration of
> the zk ensemble" ? To my understanding, Mesos will only connect to ZK to
> elect a leader through ZK primitives; I don't think there is a correlation
> with how ZK members are set in the cluster.
>
> How do you remove/add members to the ZK member list? The issue you
> encounter might come from inconsistencies in ZK cluster.
>
> Regards,
>
> Thomas
> ------------------------------
> *From:* Charles-François Natali <cf...@gmail.com>
> *Sent:* Wednesday, 9 March 2022 23:44
> *To:* user <us...@mesos.apache.org>
> *Subject:* [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?
>
> Hi Dan,
>
> I don't think anyone has been looking at this, and i doubt we will, since
> we are quite low on resources.
>
>
> Cheers,
>
>
>
>
> On Tue, Mar 8, 2022, 19:01 Dan Leary <dl...@touchplan.io> wrote:
>
> Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic
> reconfiguration" capability.
> https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.6.3%2FzookeeperReconfig.html&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I8KdueeNebZ8%2FQ5%2Ffo8qy1OTk4fqG3VeIuS7VVI%2FlXk%3D&reserved=0>
>
> Seems prima facie like mesos-master can handle a dynamic reconfig of the
> zk ensemble up to the point
> where a new set of participants has been added to the ensemble and the old
> participants
> have been demoted to non-voting followers.  But when the non-voting
> follower processes are
> terminated the master logs seem to indicate that the masters keep trying
> and failing to reconnect
> to the old zk leader, even though they've apparently received updates with
> the new ensemble participants.
>
> Anybody have any insight into this?
> Any plans to support zk dynamic reconfiguration in the future?
> Seems like it could make for easier O/S maintenance of one's master/zk
> cluster hosts.
>
> -Dan
>
>

Re: [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Posted by Thomas Langé <t....@criteo.com>.
Hi,

You answer is in your last message:
>Perhaps a mesos-master needs to be terminated and then restarted with an updated zk:// list as
> each zk participant gets reconfig'ed?

From what I understand, when your issue happens, your ZK cluster is healthy but Mesos masters fails to connect.
It seems to be because Mesos masters are still configured to contact the 3 "legacy nodes". As long as they are in the ZK cluster, they will forward your request to ZK leader, so the whole setup works. When you remove them, mesos-master cannot know how to reach a valid ZK member to access the cluster.
So, you need to update the --zk parameter to always contain members of the cluster (Mesos won't read ZK configuration to fetch new members and auto-update its "--zk endpoints").

To summarize, dynamic reconfiguration is a purely ZK feature and Mesos is not aware of those changes.

Bw,

Thomas
________________________________
From: Dan Leary <dl...@touchplan.io>
Sent: Thursday, 10 March 2022 16:16
To: user@mesos.apache.org <us...@mesos.apache.org>
Subject: [BULK]Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Thomas-

Encouraging news.  Appreciate the response.

I've tried both non-incremental and incremental reconfigs with the same result.
With 3 zk participants (quorum 2) we first add 3 observers.
Non-incrementally we then remove a participant then add an observer as participant.
Repeat twice, last time the current leading participant is the one removed.
At this point the 3 mesos-masters all seem fine.
My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG events and gets the updated list of zk participants just fine.
But when we terminate the original zk servers that are now running as non-voting followers, the mesos-masters all seem to keep trying to reconnect to the now-dead former zk participants.
Eventually heartbeats fail and the whole cluster shuts down.
The masters log messages like:

2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827: Initiating client connection, host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=10000 watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=<null> context=0x7f255c000bf8 flags=0
2022-03-08 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:2182<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:2181<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:2183<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:2182<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2182%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=qXPSmRhHYp42VVlQWw3sMku%2Bs%2F6X95791AigZoctI2k%3D&reserved=0>] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [::1:2182] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:2181<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2181%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rgCQjUjTKN%2BW8mw3iAN4TP3iUBJ2DaumoGC0OT0t0FY%3D&reserved=0>] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
2022-03-08 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758: Socket [127.0.0.1:2183<https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2F127.0.0.1%3A2183%2F&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=XskL4ZT00cdVBYN7Md5cnbPD%2Fdfks%2FVv%2Bbq4PJIEHt8%3D&reserved=0>] zk retcode=-4, errno=111(Connection refused): server refused to accept the client

Where ports 2181, 2182, 2183 are the old participants, the new participants are on ports 2184, 2185, 2186  (single host test environment).
Perhaps a mesos-master needs to be terminated and then restarted with an updated zk:// list as each zk participant gets reconfig'ed?

-Dan


On Thu, Mar 10, 2022 at 5:15 AM Thomas Langé <t....@criteo.com>> wrote:
Hi,

We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration capability without any issue for Mesos 1.9. The only thing that should be handled carefully is the addition/removal of Zookeeper members when using dynamic reconf feature.

What do you mean by "mesos-master can handle a dynamic reconfiguration of the zk ensemble" ? To my understanding, Mesos will only connect to ZK to elect a leader through ZK primitives; I don't think there is a correlation with how ZK members are set in the cluster.

How do you remove/add members to the ZK member list? The issue you encounter might come from inconsistencies in ZK cluster.

Regards,

Thomas
________________________________
From: Charles-François Natali <cf...@gmail.com>>
Sent: Wednesday, 9 March 2022 23:44
To: user <us...@mesos.apache.org>>
Subject: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Hi Dan,

I don't think anyone has been looking at this, and i doubt we will, since we are quite low on resources.


Cheers,




On Tue, Mar 8, 2022, 19:01 Dan Leary <dl...@touchplan.io>> wrote:
Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic reconfiguration" capability.
https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.6.3%2FzookeeperReconfig.html&data=04%7C01%7Ct.lange%40criteo.com%7Ce4d5ad56e22d4724dedc08da02a8f292%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C1%7C637825221883469947%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=I8KdueeNebZ8%2FQ5%2Ffo8qy1OTk4fqG3VeIuS7VVI%2FlXk%3D&reserved=0>

Seems prima facie like mesos-master can handle a dynamic reconfig of the zk ensemble up to the point
where a new set of participants has been added to the ensemble and the old participants
have been demoted to non-voting followers.  But when the non-voting follower processes are
terminated the master logs seem to indicate that the masters keep trying and failing to reconnect
to the old zk leader, even though they've apparently received updates with the new ensemble participants.

Anybody have any insight into this?
Any plans to support zk dynamic reconfiguration in the future?
Seems like it could make for easier O/S maintenance of one's master/zk cluster hosts.

-Dan


Re: [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?

Posted by Dan Leary <dl...@touchplan.io>.
Thomas-

Encouraging news.  Appreciate the response.

I've tried both non-incremental and incremental reconfigs with the same
result.
With 3 zk participants (quorum 2) we first add 3 observers.
Non-incrementally we then remove a participant then add an observer as
participant.
Repeat twice, last time the current leading participant is the one removed.
At this point the 3 mesos-masters all seem fine.
My bespoke framework is fine too, it sees CONNECT, RECONNECT, and RECONFIG
events and gets the updated list of zk participants just fine.
But when we terminate the original zk servers that are now running as
non-voting followers, the mesos-masters all seem to keep trying to
reconnect to the now-dead former zk participants.
Eventually heartbeats fail and the whole cluster shuts down.
The masters log messages like:

2022-03-08 13:26:45,964:30032(0x7f25a3048700):ZOO_INFO@zookeeper_init@827:
> Initiating client connection,
> host=localhost:2181,localhost:2182,localhost:2183 sessionTimeout=10000
> watcher=0x7f25ba3af67e sessionId=0 sessionPasswd=<null>
> context=0x7f255c000bf8 flags=0
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182] zk retcode=-4, errno=111(Connection refused):
> server refused to accept the client
> 2022-03-08
> 13:26:45,964:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181] zk retcode=-4, errno=111(Connection refused):
> server refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f25565ea700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183] zk retcode=-4, errno=111(Connection refused):
> server refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2182] zk retcode=-4, errno=111(Connection refused):
> server refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2182] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2181] zk retcode=-4, errno=111(Connection refused):
> server refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2181] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [::1:2183] zk retcode=-4, errno=111(Connection refused): server
> refused to accept the client
> 2022-03-08
> 13:26:45,965:30032(0x7f2557ded700):ZOO_ERROR@handle_socket_error_msg@1758:
> Socket [127.0.0.1:2183] zk retcode=-4, errno=111(Connection refused):
> server refused to accept the client
>

Where ports 2181, 2182, 2183 are the old participants, the new participants
are on ports 2184, 2185, 2186  (single host test environment).
Perhaps a mesos-master needs to be terminated and then restarted with an
updated zk:// list as each zk participant gets reconfig'ed?

-Dan


On Thu, Mar 10, 2022 at 5:15 AM Thomas Langé <t....@criteo.com> wrote:

> Hi,
>
> We don't run mesos 1.11 but we use Zookeeper with dynamic reconfiguration
> capability without any issue for Mesos 1.9. The only thing that should be
> handled carefully is the addition/removal of Zookeeper members when using
> dynamic reconf feature.
>
> What do you mean by "mesos-master can handle a dynamic reconfiguration of
> the zk ensemble" ? To my understanding, Mesos will only connect to ZK to
> elect a leader through ZK primitives; I don't think there is a correlation
> with how ZK members are set in the cluster.
>
> How do you remove/add members to the ZK member list? The issue you
> encounter might come from inconsistencies in ZK cluster.
>
> Regards,
>
> Thomas
> ------------------------------
> *From:* Charles-François Natali <cf...@gmail.com>
> *Sent:* Wednesday, 9 March 2022 23:44
> *To:* user <us...@mesos.apache.org>
> *Subject:* [BULK]Re: Mesos and Zookeeper Dynamic Reconfiguration?
>
> Hi Dan,
>
> I don't think anyone has been looking at this, and i doubt we will, since
> we are quite low on resources.
>
>
> Cheers,
>
>
>
>
> On Tue, Mar 8, 2022, 19:01 Dan Leary <dl...@touchplan.io> wrote:
>
> Been doing some testing with mesos 1.11.0 and zookeeper's "dynamic
> reconfiguration" capability.
> https://zookeeper.apache.org/doc/r3.6.3/zookeeperReconfig.html
> <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzookeeper.apache.org%2Fdoc%2Fr3.6.3%2FzookeeperReconfig.html&data=04%7C01%7Ct.lange%40criteo.com%7C070fc8a5cd6341e9a91008da021e7876%7C2a35d8fd574d48e3927c8c398e225a01%7C1%7C0%7C637824627127329223%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=AjPUBCzFNsu0twnGptmACWDH8%2FTUktvVYKarGBRtGJI%3D&reserved=0>
>
> Seems prima facie like mesos-master can handle a dynamic reconfig of the
> zk ensemble up to the point
> where a new set of participants has been added to the ensemble and the old
> participants
> have been demoted to non-voting followers.  But when the non-voting
> follower processes are
> terminated the master logs seem to indicate that the masters keep trying
> and failing to reconnect
> to the old zk leader, even though they've apparently received updates with
> the new ensemble participants.
>
> Anybody have any insight into this?
> Any plans to support zk dynamic reconfiguration in the future?
> Seems like it could make for easier O/S maintenance of one's master/zk
> cluster hosts.
>
> -Dan
>
>