You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by An an Zhao <fl...@gmail.com> on 2015/06/30 12:27:31 UTC

It's more user friendly that masters don't shut down the slave when re-registering timeout

Hi,
    For now, master would kill the slave when re-registering timeout
according to the document.

> If the slave takes longer than this timeout to re-register, the master
shuts down the slave, which in turn shuts down any live executors/tasks.

* 1. * I think it's more friendly and directly that the slave only kill the
executors without exiting, after that the slave start register.
     On the other hand, It would take some effort to support this, maybe
it's not worth.
      What's your opinion?

*2. *The slave has a flag   recovery_timeout  which is 15min  by default.
Also the slave will fail to re-register and kill the executors when it
takes longer than the health check timeout ( which is 75s).   So the
executors are useless after 75s.
   * I'm wondering why the recovery_timeout is 15min by default. I think
 that 75s is enough.*  Is this a good idea?


   Thanks for your time.

Best regards.

Re: It's more user friendly that masters don't shut down the slave when re-registering timeout

Posted by Adam Bordelon <ad...@mesosphere.io>.

Many setups use something like systemd to ensure that if the slave is
shutdown/killed, it will start up again, causing it to register as a new
slaveId. This should solve your first point, An.

On Tue, Jun 30, 2015 at 8:20 AM, Roger Ignazio <me...@rogerignazio.com> wrote:

> I recently posted a similar question to the user list to better understand
> how slave recovery works. You can read the thread at
> http://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/browser
>
> Quoting Vinod from that thread:
>
> > 'recovery_timeout' was added to make sure that if a slave
> > is down for a long time (>10 mins), the executors commit suicide. It is
> > better for the executor/task to die than keep running because the
> framework
> > might have already launched another replica of that instance. This was
> not
> > tied to the 75s timeout (hard coded) because it is possible for a slave
> to
> > successfully re-register with a master after 75s (e.g., both master and
> > slave are down for 5 min).
>
> Adam also replied with a ticket that will allow the 75s ping timeout to be
> configurable in future releases (appears to be 0.23.0 and onward):
> https://issues.apache.org/jira/browse/MESOS-2110
>
> As for shutting down the mesos-slave daemon, I (personally) don't think
> that it's really a problem. There are various tools (Puppet, Monit, etc)
> that allow you to define a service's desired state.
>
> -- Roger
>
> On Tue, Jun 30, 2015 at 3:27 AM, An an Zhao <fl...@gmail.com> wrote:
>
> > Hi,
> >     For now, master would kill the slave when re-registering timeout
> > according to the document.
> >
> > > If the slave takes longer than this timeout to re-register, the master
> > shuts down the slave, which in turn shuts down any live executors/tasks.
> >
> > * 1. * I think it's more friendly and directly that the slave only kill
> the
> > executors without exiting, after that the slave start register.
> >      On the other hand, It would take some effort to support this, maybe
> > it's not worth.
> >       What's your opinion?
> >
> > *2. *The slave has a flag   recovery_timeout  which is 15min  by default.
> > Also the slave will fail to re-register and kill the executors when it
> > takes longer than the health check timeout ( which is 75s).   So the
> > executors are useless after 75s.
> >    * I'm wondering why the recovery_timeout is 15min by default. I think
> >  that 75s is enough.*  Is this a good idea?
> >
> >
> >    Thanks for your time.
> >
> > Best regards.
> >
>

Re: It's more user friendly that masters don't shut down the slave when re-registering timeout

Posted by Roger Ignazio <me...@rogerignazio.com>.

I recently posted a similar question to the user list to better understand
how slave recovery works. You can read the thread at
http://mail-archives.apache.org/mod_mbox/mesos-user/201506.mbox/browser

Quoting Vinod from that thread:

> 'recovery_timeout' was added to make sure that if a slave
> is down for a long time (>10 mins), the executors commit suicide. It is
> better for the executor/task to die than keep running because the
framework
> might have already launched another replica of that instance. This was not
> tied to the 75s timeout (hard coded) because it is possible for a slave to
> successfully re-register with a master after 75s (e.g., both master and
> slave are down for 5 min).

Adam also replied with a ticket that will allow the 75s ping timeout to be
configurable in future releases (appears to be 0.23.0 and onward):
https://issues.apache.org/jira/browse/MESOS-2110

As for shutting down the mesos-slave daemon, I (personally) don't think
that it's really a problem. There are various tools (Puppet, Monit, etc)
that allow you to define a service's desired state.

-- Roger

On Tue, Jun 30, 2015 at 3:27 AM, An an Zhao <fl...@gmail.com> wrote:

> Hi,
>     For now, master would kill the slave when re-registering timeout
> according to the document.
>
> > If the slave takes longer than this timeout to re-register, the master
> shuts down the slave, which in turn shuts down any live executors/tasks.
>
> * 1. * I think it's more friendly and directly that the slave only kill the
> executors without exiting, after that the slave start register.
>      On the other hand, It would take some effort to support this, maybe
> it's not worth.
>       What's your opinion?
>
> *2. *The slave has a flag   recovery_timeout  which is 15min  by default.
> Also the slave will fail to re-register and kill the executors when it
> takes longer than the health check timeout ( which is 75s).   So the
> executors are useless after 75s.
>    * I'm wondering why the recovery_timeout is 15min by default. I think
>  that 75s is enough.*  Is this a good idea?
>
>
>    Thanks for your time.
>
> Best regards.
>