You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Donald Laidlaw <do...@me.com> on 2015/11/09 15:01:34 UTC

Zookeeper cluster changes

How do mesos masters and slaves react to zookeeper cluster changes? When the masters and slaves start they are given a set of addresses to connect to zookeeper. But over time, one of those zookeepers fails, and is replaced by a new server at a new address. How should this be handled in the mesos servers?

I am guessing that mesos does not automatically detect and react to that change. But obviously we should do something to keep the mesos servers happy as well. What should be do?

The obvious thing is to stop the mesos servers, one at a time, and restart them with the new configuration. But it would be really nice to be able to do this dynamically without restarting the server. After all, coordinating a rolling restart is a fairly hard job.

Any suggestions or pointers?

Best regards,
Don Laidlaw

Re: Zookeeper cluster changes

Posted by Donald Laidlaw <do...@me.com>.

That is great stuff Joseph!

What I am trying to understand at the moment is how the mesos (master and slave) use the list of zookeepers passed to it at startup. For example:

	zk://10.1.1.10:2181,10.1.2.10:2181,10.1.3.10:2181/mesos

At startup, does mesos attempt to connect to these servers in order and then use the first one that works? Later on, if the connection is lost, does mesos continue trying to reconnect with the servers in the list? Or does it fail fast as was mentioned earlier by Erik Weathers?

I ask, because that affects how I would like to try to recover the servers. If it is failing fast, then I can just check for changes to the ensemble at mesos startup. If it is not failing fast, I need more complex code to recognize the ensemble change then do a rolling restart.

Thanks!

Don Laidlaw
866 Cobequid Rd.
Lower Sackville, NS
B4C 4E1
Phone: 1-902-576-5185
Cell: 1-902-401-6771

> On Nov 11, 2015, at 4:29 PM, Joseph Smith <ya...@gmail.com> wrote:
> 
> We have live-migrated an entire cluster of 10s of thousands of Mesos Agents to point at a new ZK ensemble without killing our cluster <https://youtu.be/nNrh-gdu9m4?t=21m50s>, or the tasks were were running. (twice)
> 
> We started by shutting off all of the Mesos Masters. I’ve heard rumors that some people have found their Mesos Agents will kill themselves without a master, but this has never been my experience. If you find this to be the case, please reach out as I’d love to avoid that fate at all costs.
> 
> Once the masters were down, we submitted a change to modify the configuration for the agents (we set up an automatic restart of the slave for some configuration values such as this one to make it easier to roll out). It took our current configuration management system the better part of an hour to get the change propagated across the cluster, but while that was happening, the agents were happily running, and user tasks were serving traffic. Once we saw zk_watch_count (check under the mntr command) <https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands> increase to the expected number of agents on the new ensemble, we turned on the masters (also pointing at the new ensemble now) and the agents sent status updates back to the masters. If you haven’t taken a look at zktraffic <https://github.com/twitter/zktraffic>, I’d recommend it for improved visibility into your ensemble as well.
> 
> Please note- there’s a bug in the C ZooKeeper library <https://issues.apache.org/jira/browse/ZOOKEEPER-1998> where the members of an ensemble will only be resolved once. There isn’t conclusive proof that it affects the agents <https://issues.apache.org/jira/browse/MESOS-2681>. We were fine, but you may want to validate.
> 
>> On Nov 10, 2015, at 11:23 AM, Donald Laidlaw <donlaidlaw@me.com <ma...@me.com>> wrote:
>> 
>> I agree, you want to apply the changes gradually so as not to lose a quorum. The problem is automating this so that it happens in a lights-out environment, in the cloud, without some poor slob's pager going off in the middle of the night :)
>> 
>> While health checks can detect and replace a dead server reliably on any number of clouds, the new server comes up with a new IP address. This server can reliably join into zookeeper ensemble. However, it is tough to automate the rolling restart of the other mesos servers, both masters and slaves, that needs to occur to keep them happy. 
>> 
>> One thing I have not tried is to just ignore the change, and use something to detect the masters just prior to starting mesos. If they truly fail fast, then if they lose a zookeeper connection, then maybe they don’t care that they have been started with an out-of-date list of zookeeper servers.
>> 
>> What does mesos-master and mesos-slave do with a list of zookeeper servers to connect to? Just try them in order until one works, then use that one until it fails? If so, and it fails fast, then letting it continue to run with a stale list will have no ill effects. Or does it keep trying the servers in the list when a connection fails? 
>> 
>> Don Laidlaw
>> 
>> 
>>> On Nov 10, 2015, at 4:42 AM, Erik Weathers <eweathers@groupon.com <ma...@groupon.com>> wrote:
>>> 
>>> Keep in mind that mesos is designed to "fail fast".  So when there are problems (such as losing connectivity to the resolved ZooKeeper IP) the daemon(s) (master & slave) die.
>>> 
>>> Due to this design, we are all supposed to run the mesos daemons under "supervision", which means auto-restart after they crash.  This can be done with monit/god/runit/etc.
>>> 
>>> So, to perform maintenance on ZooKeeper, I would firstly ensure the mesos-master processes are running under "supervision" so that they restart quickly after a ZK connectivity failure occurs.  Then proceed with standard ZooKeeper maintenance (exhibitor-based or manual), pausing between downing of ZK servers to ensure you have "enough" mesos-master processes running.  (I *would* say a "pausing until you have a quorum of mesos-masters up", but if you only have 2 of 3 up and then take down the ZK that the leader is connected to, that would be temporarily bad.  So I'd make sure they're all up.)
>>> 
>>> - Erik
>>> 
>>> On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <marco@mesosphere.io <ma...@mesosphere.io>> wrote:
>>> The way I would do it in a production cluster would be *not* to use directly IP addresses for the ZK ensemble, but instead rely on some form of internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.prod.example.com <http://prod.example.com/> etc) and have the provisioning tooling (Chef, Puppet, Ansible, what have you) handle the setting of the hostname when restarting/replacing a failing/crashed ZK server.
>>> 
>>> This way your list of zk's to Mesos never changes, even though the FQN's will map to different IPs / VMs.
>>> 
>>> Obviously, this may not be always desirable / feasible (eg, if your prod environment does not support DNS resolution).
>>> 
>>> You are correct in that Mesos does not currently support dynamically changing the ZK's addresses, but I don't know whether that's a limitation of Mesos code or of the ZK C++ client driver.
>>> I'll look into it and let you know what I find (if anything).
>>> 
>>> --
>>> Marco Massenzio
>>> Distributed Systems Engineer
>>> http://codetrips.com <http://codetrips.com/>
>>> 
>>> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaidlaw@me.com <ma...@me.com>> wrote:
>>> How do mesos masters and slaves react to zookeeper cluster changes? When the masters and slaves start they are given a set of addresses to connect to zookeeper. But over time, one of those zookeepers fails, and is replaced by a new server at a new address. How should this be handled in the mesos servers?
>>> 
>>> I am guessing that mesos does not automatically detect and react to that change. But obviously we should do something to keep the mesos servers happy as well. What should be do?
>>> 
>>> The obvious thing is to stop the mesos servers, one at a time, and restart them with the new configuration. But it would be really nice to be able to do this dynamically without restarting the server. After all, coordinating a rolling restart is a fairly hard job.
>>> 
>>> Any suggestions or pointers?
>>> 
>>> Best regards,
>>> Don Laidlaw
>>> 
>>> 
>>> 
>>> 
>> 
>

Re: Zookeeper cluster changes

Posted by Joseph Smith <ya...@gmail.com>.

We have live-migrated an entire cluster of 10s of thousands of Mesos Agents to point at a new ZK ensemble without killing our cluster <https://youtu.be/nNrh-gdu9m4?t=21m50s>, or the tasks were were running. (twice)

We started by shutting off all of the Mesos Masters. I’ve heard rumors that some people have found their Mesos Agents will kill themselves without a master, but this has never been my experience. If you find this to be the case, please reach out as I’d love to avoid that fate at all costs.

Once the masters were down, we submitted a change to modify the configuration for the agents (we set up an automatic restart of the slave for some configuration values such as this one to make it easier to roll out). It took our current configuration management system the better part of an hour to get the change propagated across the cluster, but while that was happening, the agents were happily running, and user tasks were serving traffic. Once we saw zk_watch_count (check under the mntr command) <https://zookeeper.apache.org/doc/trunk/zookeeperAdmin.html#sc_zkCommands> increase to the expected number of agents on the new ensemble, we turned on the masters (also pointing at the new ensemble now) and the agents sent status updates back to the masters. If you haven’t taken a look at zktraffic <https://github.com/twitter/zktraffic>, I’d recommend it for improved visibility into your ensemble as well.

Please note- there’s a bug in the C ZooKeeper library <https://issues.apache.org/jira/browse/ZOOKEEPER-1998> where the members of an ensemble will only be resolved once. There isn’t conclusive proof that it affects the agents <https://issues.apache.org/jira/browse/MESOS-2681>. We were fine, but you may want to validate.

> On Nov 10, 2015, at 11:23 AM, Donald Laidlaw <do...@me.com> wrote:
> 
> I agree, you want to apply the changes gradually so as not to lose a quorum. The problem is automating this so that it happens in a lights-out environment, in the cloud, without some poor slob's pager going off in the middle of the night :)
> 
> While health checks can detect and replace a dead server reliably on any number of clouds, the new server comes up with a new IP address. This server can reliably join into zookeeper ensemble. However, it is tough to automate the rolling restart of the other mesos servers, both masters and slaves, that needs to occur to keep them happy. 
> 
> One thing I have not tried is to just ignore the change, and use something to detect the masters just prior to starting mesos. If they truly fail fast, then if they lose a zookeeper connection, then maybe they don’t care that they have been started with an out-of-date list of zookeeper servers.
> 
> What does mesos-master and mesos-slave do with a list of zookeeper servers to connect to? Just try them in order until one works, then use that one until it fails? If so, and it fails fast, then letting it continue to run with a stale list will have no ill effects. Or does it keep trying the servers in the list when a connection fails? 
> 
> Don Laidlaw
> 
> 
>> On Nov 10, 2015, at 4:42 AM, Erik Weathers <eweathers@groupon.com <ma...@groupon.com>> wrote:
>> 
>> Keep in mind that mesos is designed to "fail fast".  So when there are problems (such as losing connectivity to the resolved ZooKeeper IP) the daemon(s) (master & slave) die.
>> 
>> Due to this design, we are all supposed to run the mesos daemons under "supervision", which means auto-restart after they crash.  This can be done with monit/god/runit/etc.
>> 
>> So, to perform maintenance on ZooKeeper, I would firstly ensure the mesos-master processes are running under "supervision" so that they restart quickly after a ZK connectivity failure occurs.  Then proceed with standard ZooKeeper maintenance (exhibitor-based or manual), pausing between downing of ZK servers to ensure you have "enough" mesos-master processes running.  (I *would* say a "pausing until you have a quorum of mesos-masters up", but if you only have 2 of 3 up and then take down the ZK that the leader is connected to, that would be temporarily bad.  So I'd make sure they're all up.)
>> 
>> - Erik
>> 
>> On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <marco@mesosphere.io <ma...@mesosphere.io>> wrote:
>> The way I would do it in a production cluster would be *not* to use directly IP addresses for the ZK ensemble, but instead rely on some form of internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.prod.example.com <http://prod.example.com/> etc) and have the provisioning tooling (Chef, Puppet, Ansible, what have you) handle the setting of the hostname when restarting/replacing a failing/crashed ZK server.
>> 
>> This way your list of zk's to Mesos never changes, even though the FQN's will map to different IPs / VMs.
>> 
>> Obviously, this may not be always desirable / feasible (eg, if your prod environment does not support DNS resolution).
>> 
>> You are correct in that Mesos does not currently support dynamically changing the ZK's addresses, but I don't know whether that's a limitation of Mesos code or of the ZK C++ client driver.
>> I'll look into it and let you know what I find (if anything).
>> 
>> --
>> Marco Massenzio
>> Distributed Systems Engineer
>> http://codetrips.com <http://codetrips.com/>
>> 
>> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaidlaw@me.com <ma...@me.com>> wrote:
>> How do mesos masters and slaves react to zookeeper cluster changes? When the masters and slaves start they are given a set of addresses to connect to zookeeper. But over time, one of those zookeepers fails, and is replaced by a new server at a new address. How should this be handled in the mesos servers?
>> 
>> I am guessing that mesos does not automatically detect and react to that change. But obviously we should do something to keep the mesos servers happy as well. What should be do?
>> 
>> The obvious thing is to stop the mesos servers, one at a time, and restart them with the new configuration. But it would be really nice to be able to do this dynamically without restarting the server. After all, coordinating a rolling restart is a fairly hard job.
>> 
>> Any suggestions or pointers?
>> 
>> Best regards,
>> Don Laidlaw
>> 
>> 
>> 
>> 
>

Re: Zookeeper cluster changes

Posted by Donald Laidlaw <do...@me.com>.

I agree, you want to apply the changes gradually so as not to lose a quorum. The problem is automating this so that it happens in a lights-out environment, in the cloud, without some poor slob's pager going off in the middle of the night :)

While health checks can detect and replace a dead server reliably on any number of clouds, the new server comes up with a new IP address. This server can reliably join into zookeeper ensemble. However, it is tough to automate the rolling restart of the other mesos servers, both masters and slaves, that needs to occur to keep them happy. 

One thing I have not tried is to just ignore the change, and use something to detect the masters just prior to starting mesos. If they truly fail fast, then if they lose a zookeeper connection, then maybe they don’t care that they have been started with an out-of-date list of zookeeper servers.

What does mesos-master and mesos-slave do with a list of zookeeper servers to connect to? Just try them in order until one works, then use that one until it fails? If so, and it fails fast, then letting it continue to run with a stale list will have no ill effects. Or does it keep trying the servers in the list when a connection fails? 

Don Laidlaw


> On Nov 10, 2015, at 4:42 AM, Erik Weathers <ew...@groupon.com> wrote:
> 
> Keep in mind that mesos is designed to "fail fast".  So when there are problems (such as losing connectivity to the resolved ZooKeeper IP) the daemon(s) (master & slave) die.
> 
> Due to this design, we are all supposed to run the mesos daemons under "supervision", which means auto-restart after they crash.  This can be done with monit/god/runit/etc.
> 
> So, to perform maintenance on ZooKeeper, I would firstly ensure the mesos-master processes are running under "supervision" so that they restart quickly after a ZK connectivity failure occurs.  Then proceed with standard ZooKeeper maintenance (exhibitor-based or manual), pausing between downing of ZK servers to ensure you have "enough" mesos-master processes running.  (I *would* say a "pausing until you have a quorum of mesos-masters up", but if you only have 2 of 3 up and then take down the ZK that the leader is connected to, that would be temporarily bad.  So I'd make sure they're all up.)
> 
> - Erik
> 
> On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <marco@mesosphere.io <ma...@mesosphere.io>> wrote:
> The way I would do it in a production cluster would be *not* to use directly IP addresses for the ZK ensemble, but instead rely on some form of internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.prod.example.com <http://prod.example.com/> etc) and have the provisioning tooling (Chef, Puppet, Ansible, what have you) handle the setting of the hostname when restarting/replacing a failing/crashed ZK server.
> 
> This way your list of zk's to Mesos never changes, even though the FQN's will map to different IPs / VMs.
> 
> Obviously, this may not be always desirable / feasible (eg, if your prod environment does not support DNS resolution).
> 
> You are correct in that Mesos does not currently support dynamically changing the ZK's addresses, but I don't know whether that's a limitation of Mesos code or of the ZK C++ client driver.
> I'll look into it and let you know what I find (if anything).
> 
> --
> Marco Massenzio
> Distributed Systems Engineer
> http://codetrips.com <http://codetrips.com/>
> 
> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <donlaidlaw@me.com <ma...@me.com>> wrote:
> How do mesos masters and slaves react to zookeeper cluster changes? When the masters and slaves start they are given a set of addresses to connect to zookeeper. But over time, one of those zookeepers fails, and is replaced by a new server at a new address. How should this be handled in the mesos servers?
> 
> I am guessing that mesos does not automatically detect and react to that change. But obviously we should do something to keep the mesos servers happy as well. What should be do?
> 
> The obvious thing is to stop the mesos servers, one at a time, and restart them with the new configuration. But it would be really nice to be able to do this dynamically without restarting the server. After all, coordinating a rolling restart is a fairly hard job.
> 
> Any suggestions or pointers?
> 
> Best regards,
> Don Laidlaw
> 
> 
> 
>

Re: Zookeeper cluster changes

Posted by Erik Weathers <ew...@groupon.com>.

Keep in mind that mesos is designed to "fail fast".  So when there are
problems (such as losing connectivity to the resolved ZooKeeper IP) the
daemon(s) (master & slave) die.

Due to this design, we are all supposed to run the mesos daemons under
"supervision", which means auto-restart after they crash.  This can be done
with monit/god/runit/etc.

So, to perform maintenance on ZooKeeper, I would firstly ensure the
mesos-master processes are running under "supervision" so that they restart
quickly after a ZK connectivity failure occurs.  Then proceed with standard
ZooKeeper maintenance (exhibitor-based or manual), pausing between downing
of ZK servers to ensure you have "enough" mesos-master processes running.
 (I *would* say a "pausing until you have a quorum of mesos-masters up",
but if you only have 2 of 3 up and then take down the ZK that the leader is
connected to, that would be temporarily bad.  So I'd make sure they're all
up.)

- Erik

On Mon, Nov 9, 2015 at 11:07 PM, Marco Massenzio <ma...@mesosphere.io>
wrote:

> The way I would do it in a production cluster would be *not* to use
> directly IP addresses for the ZK ensemble, but instead rely on some form of
> internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.
> prod.example.com etc) and have the provisioning tooling (Chef, Puppet,
> Ansible, what have you) handle the setting of the hostname when
> restarting/replacing a failing/crashed ZK server.
>
> This way your list of zk's to Mesos never changes, even though the FQN's
> will map to different IPs / VMs.
>
> Obviously, this may not be always desirable / feasible (eg, if your prod
> environment does not support DNS resolution).
>
> You are correct in that Mesos does not currently support dynamically
> changing the ZK's addresses, but I don't know whether that's a limitation
> of Mesos code or of the ZK C++ client driver.
> I'll look into it and let you know what I find (if anything).
>
> --
> *Marco Massenzio*
> Distributed Systems Engineer
> http://codetrips.com
>
> On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <do...@me.com> wrote:
>
>> How do mesos masters and slaves react to zookeeper cluster changes? When
>> the masters and slaves start they are given a set of addresses to connect
>> to zookeeper. But over time, one of those zookeepers fails, and is replaced
>> by a new server at a new address. How should this be handled in the mesos
>> servers?
>>
>> I am guessing that mesos does not automatically detect and react to that
>> change. But obviously we should do something to keep the mesos servers
>> happy as well. What should be do?
>>
>> The obvious thing is to stop the mesos servers, one at a time, and
>> restart them with the new configuration. But it would be really nice to be
>> able to do this dynamically without restarting the server. After all,
>> coordinating a rolling restart is a fairly hard job.
>>
>> Any suggestions or pointers?
>>
>> Best regards,
>> Don Laidlaw
>>
>>
>>
>

Re: Zookeeper cluster changes

Posted by Marco Massenzio <ma...@mesosphere.io>.

The way I would do it in a production cluster would be *not* to use
directly IP addresses for the ZK ensemble, but instead rely on some form of
internal DNS and use internally-resolvable hostnames (eg, {zk1, zk2, ...}.
prod.example.com etc) and have the provisioning tooling (Chef, Puppet,
Ansible, what have you) handle the setting of the hostname when
restarting/replacing a failing/crashed ZK server.

This way your list of zk's to Mesos never changes, even though the FQN's
will map to different IPs / VMs.

Obviously, this may not be always desirable / feasible (eg, if your prod
environment does not support DNS resolution).

You are correct in that Mesos does not currently support dynamically
changing the ZK's addresses, but I don't know whether that's a limitation
of Mesos code or of the ZK C++ client driver.
I'll look into it and let you know what I find (if anything).

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Mon, Nov 9, 2015 at 6:01 AM, Donald Laidlaw <do...@me.com> wrote:

> How do mesos masters and slaves react to zookeeper cluster changes? When
> the masters and slaves start they are given a set of addresses to connect
> to zookeeper. But over time, one of those zookeepers fails, and is replaced
> by a new server at a new address. How should this be handled in the mesos
> servers?
>
> I am guessing that mesos does not automatically detect and react to that
> change. But obviously we should do something to keep the mesos servers
> happy as well. What should be do?
>
> The obvious thing is to stop the mesos servers, one at a time, and restart
> them with the new configuration. But it would be really nice to be able to
> do this dynamically without restarting the server. After all, coordinating
> a rolling restart is a fairly hard job.
>
> Any suggestions or pointers?
>
> Best regards,
> Don Laidlaw
>
>
>

Re: Zookeeper cluster changes

Posted by Donald Laidlaw <do...@me.com>.

Yeah, I know about Exhibitor and how it handles zookeeper ensemble changes.

My question was about how to handle the Mesos servers.

What do you have to do with Mesos, when the zookeeper ensemble changes, to keep the mesos servers happy and healthy? 

Don Laidlaw
866 Cobequid Rd.
Lower Sackville, NS
B4C 4E1
Phone: 1-902-576-5185
Cell: 1-902-401-6771

> On Nov 9, 2015, at 12:28 PM, tommy xiao <xi...@gmail.com> wrote:
> 
> Good News, Netflix release a tools can do it:
> https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change <https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change>
> 
>  have a try it.
> 
> 2015-11-09 22:01 GMT+08:00 Donald Laidlaw <donlaidlaw@me.com <ma...@me.com>>:
> How do mesos masters and slaves react to zookeeper cluster changes? When the masters and slaves start they are given a set of addresses to connect to zookeeper. But over time, one of those zookeepers fails, and is replaced by a new server at a new address. How should this be handled in the mesos servers?
> 
> I am guessing that mesos does not automatically detect and react to that change. But obviously we should do something to keep the mesos servers happy as well. What should be do?
> 
> The obvious thing is to stop the mesos servers, one at a time, and restart them with the new configuration. But it would be really nice to be able to do this dynamically without restarting the server. After all, coordinating a rolling restart is a fairly hard job.
> 
> Any suggestions or pointers?
> 
> Best regards,
> Don Laidlaw
> 
> 
> 
> 
> 
> -- 
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com <http://gmail.com/>

Re: Zookeeper cluster changes

Posted by tommy xiao <xi...@gmail.com>.

Good News, Netflix release a tools can do it:
https://github.com/Netflix/exhibitor/wiki/Rolling-Ensemble-Change

 have a try it.

2015-11-09 22:01 GMT+08:00 Donald Laidlaw <do...@me.com>:

> How do mesos masters and slaves react to zookeeper cluster changes? When
> the masters and slaves start they are given a set of addresses to connect
> to zookeeper. But over time, one of those zookeepers fails, and is replaced
> by a new server at a new address. How should this be handled in the mesos
> servers?
>
> I am guessing that mesos does not automatically detect and react to that
> change. But obviously we should do something to keep the mesos servers
> happy as well. What should be do?
>
> The obvious thing is to stop the mesos servers, one at a time, and restart
> them with the new configuration. But it would be really nice to be able to
> do this dynamically without restarting the server. After all, coordinating
> a rolling restart is a fairly hard job.
>
> Any suggestions or pointers?
>
> Best regards,
> Don Laidlaw
>
>
>


-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com