You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Megha Sharma <ms...@apple.com> on 2016/11/15 15:37:21 UTC

MESOS-6233 Allow agents to re-register post a host reboot

Hi All,

We have been working on the design for Restartable tasks (	MESOS-3545) and allowing agents to recover and re-register post reboot is a pre-requisite for that.
Agent today doesn’t recover its state that includes its SlaveID post a host reboot, it short-circuits the recovery upon discovering the reboot and registers with the master as a new agent. With Partition Awareness, the mesos master even allows agents which have failed master’s health check pings (unreachable agents) to re-register with it and reconcile the tasks/executors. The executors on a rebooted host are anyway terminated so there is no harm in letting such an agent recover and re-register with the master using its old SlaveID.
Would like to hear from the folks here if you see any operational concerns with letting the agents recover post a host reboot.

MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223

Many Thanks
Megha Sharma

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by Joris Van Remoortere <jo...@mesosphere.io>.

>
> So one thing that was brought up during offline conversations was that if
> the host reboot is associated with hardware change (e.g., a new memory
> stick):


>    - With the change: the agent could run into incompatible agent info
>    due to resource change and flap
>    <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280> indefinitely
>    until the operator intervenes.
>
> Can you elaborate on this?

Would you run into this because you don't explicitly specify the memory
resource in the agent configuration? I think we highly recommend that you
do this in production to prevent accidental incompatibility of resources
even without an actual hardware change. Historically there were some issues
reported where the kernel reported a slightly different amount of memory
after reboot.

—
*Joris Van Remoortere*
Mesosphere

On Mon, Nov 28, 2016 at 6:09 PM, Yan Xu <xu...@apple.com> wrote:

> So one thing that was brought up during offline conversations was that if
> the host reboot is associated with hardware change (e.g., a new memory
> stick):
>
>
>    - Currently: the agent would skip the recovery (and the chance of
>    running into incompatible agent info) and register as a new agent.
>    - With the change: the agent could run into incompatible agent info
>    due to resource change and flap
>    <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
>    indefinitely until the operator intervenes.
>
>
> To mitigate this and maintain the current behavior, we can have the agent
> remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
> failure but only after the host has rebooted. This way the agent can
> restart as a new agent without operator intervention.
>
> Any thoughts?
>
> BTW this speaks to the need for MESOS-1739.
>
> Yan
>
> On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <ms...@apple.com> wrote:
>
>> Hi All,
>>
>> We have been working on the design for Restartable tasks (
>> MESOS-3545) and allowing agents to recover and re-register post reboot is a
>> pre-requisite for that.
>> Agent today doesn’t recover its state that includes its SlaveID post a
>> host reboot, it short-circuits the recovery upon discovering the reboot and
>> registers with the master as a new agent. With Partition Awareness, the
>> mesos master even allows agents which have failed master’s health check
>> pings (unreachable agents) to re-register with it and reconcile the
>> tasks/executors. The executors on a rebooted host are anyway terminated so
>> there is no harm in letting such an agent recover and re-register with the
>> master using its old SlaveID.
>> Would like to hear from the folks here if you see any operational
>> concerns with letting the agents recover post a host reboot.
>>
>> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>>
>> Many Thanks
>> Megha Sharma
>>
>>
>>
>

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by haosdent <ha...@gmail.com>.

> we can have the agent remove `rm -f <work_dir>/meta/slaves/latest`
automatically upon recovery failure but only after the host has rebooted.
This sounds dangerous. When the different of AgentInfo is caused by
operator's typo, I think the operator would prefer to correct them and try
to start agent again. Rather than remove them automatically.

But if we decide to do that, please make sure email this behavior change to
the mailing lists in a separate email. Thank you!

On Wed, Nov 30, 2016 at 6:24 AM, tommy xiao <xi...@gmail.com> wrote:

> agree with james's options.
>
> 2016-11-30 0:48 GMT+08:00 James Peach <jo...@gmail.com>:
>
> >
> > > On Nov 28, 2016, at 6:09 PM, Yan Xu <xu...@apple.com> wrote:
> > >
> > > So one thing that was brought up during offline conversations was that
> > if the host reboot is associated with hardware change (e.g., a new memory
> > stick):
> > >
> > >       • Currently: the agent would skip the recovery (and the chance of
> > running into incompatible agent info) and register as a new agent.
> > >       • With the change: the agent could run into incompatible agent
> > info due to resource change and flap indefinitely until the operator
> > intervenes.
> > >
> > > To mitigate this and maintain the current behavior, we can have the
> > agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon
> > recovery failure but only after the host has rebooted. This way the agent
> > can restart as a new agent without operator intervention.
> > >
> > > Any thoughts?
> >
> > I still think you need a mechanism for the master/agent to tell you
> > whether it will honor the restart policy. Without this, you have to lock
> > the framework to a Mesos version.
> >
> > An empty RestartPolicy is also problematic since it precludes using
> > RestartPolicy in pods. If you later want to restart a task inside a pod
> but
> > not across agent restarts you would have no way to express that.
> >
> > J
>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>



-- 
Best Regards,
Haosdent Huang

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by haosdent <ha...@gmail.com>.

> we can have the agent remove `rm -f <work_dir>/meta/slaves/latest`
automatically upon recovery failure but only after the host has rebooted.
This sounds dangerous. When the different of AgentInfo is caused by
operator's typo, I think the operator would prefer to correct them and try
to start agent again. Rather than remove them automatically.

But if we decide to do that, please make sure email this behavior change to
the mailing lists in a separate email. Thank you!

On Wed, Nov 30, 2016 at 6:24 AM, tommy xiao <xi...@gmail.com> wrote:

> agree with james's options.
>
> 2016-11-30 0:48 GMT+08:00 James Peach <jo...@gmail.com>:
>
> >
> > > On Nov 28, 2016, at 6:09 PM, Yan Xu <xu...@apple.com> wrote:
> > >
> > > So one thing that was brought up during offline conversations was that
> > if the host reboot is associated with hardware change (e.g., a new memory
> > stick):
> > >
> > >       • Currently: the agent would skip the recovery (and the chance of
> > running into incompatible agent info) and register as a new agent.
> > >       • With the change: the agent could run into incompatible agent
> > info due to resource change and flap indefinitely until the operator
> > intervenes.
> > >
> > > To mitigate this and maintain the current behavior, we can have the
> > agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon
> > recovery failure but only after the host has rebooted. This way the agent
> > can restart as a new agent without operator intervention.
> > >
> > > Any thoughts?
> >
> > I still think you need a mechanism for the master/agent to tell you
> > whether it will honor the restart policy. Without this, you have to lock
> > the framework to a Mesos version.
> >
> > An empty RestartPolicy is also problematic since it precludes using
> > RestartPolicy in pods. If you later want to restart a task inside a pod
> but
> > not across agent restarts you would have no way to express that.
> >
> > J
>
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>



-- 
Best Regards,
Haosdent Huang

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by tommy xiao <xi...@gmail.com>.

agree with james's options.

2016-11-30 0:48 GMT+08:00 James Peach <jo...@gmail.com>:

>
> > On Nov 28, 2016, at 6:09 PM, Yan Xu <xu...@apple.com> wrote:
> >
> > So one thing that was brought up during offline conversations was that
> if the host reboot is associated with hardware change (e.g., a new memory
> stick):
> >
> >       • Currently: the agent would skip the recovery (and the chance of
> running into incompatible agent info) and register as a new agent.
> >       • With the change: the agent could run into incompatible agent
> info due to resource change and flap indefinitely until the operator
> intervenes.
> >
> > To mitigate this and maintain the current behavior, we can have the
> agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon
> recovery failure but only after the host has rebooted. This way the agent
> can restart as a new agent without operator intervention.
> >
> > Any thoughts?
>
> I still think you need a mechanism for the master/agent to tell you
> whether it will honor the restart policy. Without this, you have to lock
> the framework to a Mesos version.
>
> An empty RestartPolicy is also problematic since it precludes using
> RestartPolicy in pods. If you later want to restart a task inside a pod but
> not across agent restarts you would have no way to express that.
>
> J




-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by tommy xiao <xi...@gmail.com>.

agree with james's options.

2016-11-30 0:48 GMT+08:00 James Peach <jo...@gmail.com>:

>
> > On Nov 28, 2016, at 6:09 PM, Yan Xu <xu...@apple.com> wrote:
> >
> > So one thing that was brought up during offline conversations was that
> if the host reboot is associated with hardware change (e.g., a new memory
> stick):
> >
> >       • Currently: the agent would skip the recovery (and the chance of
> running into incompatible agent info) and register as a new agent.
> >       • With the change: the agent could run into incompatible agent
> info due to resource change and flap indefinitely until the operator
> intervenes.
> >
> > To mitigate this and maintain the current behavior, we can have the
> agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon
> recovery failure but only after the host has rebooted. This way the agent
> can restart as a new agent without operator intervention.
> >
> > Any thoughts?
>
> I still think you need a mechanism for the master/agent to tell you
> whether it will honor the restart policy. Without this, you have to lock
> the framework to a Mesos version.
>
> An empty RestartPolicy is also problematic since it precludes using
> RestartPolicy in pods. If you later want to restart a task inside a pod but
> not across agent restarts you would have no way to express that.
>
> J




-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by James Peach <jo...@gmail.com>.

> On Nov 28, 2016, at 6:09 PM, Yan Xu <xu...@apple.com> wrote:
> 
> So one thing that was brought up during offline conversations was that if the host reboot is associated with hardware change (e.g., a new memory stick):
> 
> 	• Currently: the agent would skip the recovery (and the chance of running into incompatible agent info) and register as a new agent.
> 	• With the change: the agent could run into incompatible agent info due to resource change and flap indefinitely until the operator intervenes.
> 
> To mitigate this and maintain the current behavior, we can have the agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery failure but only after the host has rebooted. This way the agent can restart as a new agent without operator intervention. 
> 
> Any thoughts?

I still think you need a mechanism for the master/agent to tell you whether it will honor the restart policy. Without this, you have to lock the framework to a Mesos version.

An empty RestartPolicy is also problematic since it precludes using RestartPolicy in pods. If you later want to restart a task inside a pod but not across agent restarts you would have no way to express that.

J

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by James Peach <jo...@gmail.com>.

> On Nov 28, 2016, at 6:09 PM, Yan Xu <xu...@apple.com> wrote:
> 
> So one thing that was brought up during offline conversations was that if the host reboot is associated with hardware change (e.g., a new memory stick):
> 
> 	• Currently: the agent would skip the recovery (and the chance of running into incompatible agent info) and register as a new agent.
> 	• With the change: the agent could run into incompatible agent info due to resource change and flap indefinitely until the operator intervenes.
> 
> To mitigate this and maintain the current behavior, we can have the agent remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery failure but only after the host has rebooted. This way the agent can restart as a new agent without operator intervention. 
> 
> Any thoughts?

I still think you need a mechanism for the master/agent to tell you whether it will honor the restart policy. Without this, you have to lock the framework to a Mesos version.

An empty RestartPolicy is also problematic since it precludes using RestartPolicy in pods. If you later want to restart a task inside a pod but not across agent restarts you would have no way to express that.

J

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by Joris Van Remoortere <jo...@mesosphere.io>.

>
> So one thing that was brought up during offline conversations was that if
> the host reboot is associated with hardware change (e.g., a new memory
> stick):


>    - With the change: the agent could run into incompatible agent info
>    due to resource change and flap
>    <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280> indefinitely
>    until the operator intervenes.
>
> Can you elaborate on this?

Would you run into this because you don't explicitly specify the memory
resource in the agent configuration? I think we highly recommend that you
do this in production to prevent accidental incompatibility of resources
even without an actual hardware change. Historically there were some issues
reported where the kernel reported a slightly different amount of memory
after reboot.

—
*Joris Van Remoortere*
Mesosphere

On Mon, Nov 28, 2016 at 6:09 PM, Yan Xu <xu...@apple.com> wrote:

> So one thing that was brought up during offline conversations was that if
> the host reboot is associated with hardware change (e.g., a new memory
> stick):
>
>
>    - Currently: the agent would skip the recovery (and the chance of
>    running into incompatible agent info) and register as a new agent.
>    - With the change: the agent could run into incompatible agent info
>    due to resource change and flap
>    <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
>    indefinitely until the operator intervenes.
>
>
> To mitigate this and maintain the current behavior, we can have the agent
> remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
> failure but only after the host has rebooted. This way the agent can
> restart as a new agent without operator intervention.
>
> Any thoughts?
>
> BTW this speaks to the need for MESOS-1739.
>
> Yan
>
> On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <ms...@apple.com> wrote:
>
>> Hi All,
>>
>> We have been working on the design for Restartable tasks (
>> MESOS-3545) and allowing agents to recover and re-register post reboot is a
>> pre-requisite for that.
>> Agent today doesn’t recover its state that includes its SlaveID post a
>> host reboot, it short-circuits the recovery upon discovering the reboot and
>> registers with the master as a new agent. With Partition Awareness, the
>> mesos master even allows agents which have failed master’s health check
>> pings (unreachable agents) to re-register with it and reconcile the
>> tasks/executors. The executors on a rebooted host are anyway terminated so
>> there is no harm in letting such an agent recover and re-register with the
>> master using its old SlaveID.
>> Would like to hear from the folks here if you see any operational
>> concerns with letting the agents recover post a host reboot.
>>
>> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>>
>> Many Thanks
>> Megha Sharma
>>
>>
>>
>

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by Yan Xu <xu...@apple.com>.

So one thing that was brought up during offline conversations was that if
the host reboot is associated with hardware change (e.g., a new memory
stick):


   - Currently: the agent would skip the recovery (and the chance of
   running into incompatible agent info) and register as a new agent.
   - With the change: the agent could run into incompatible agent info due
   to resource change and flap
   <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
   indefinitely until the operator intervenes.


To mitigate this and maintain the current behavior, we can have the agent
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
failure but only after the host has rebooted. This way the agent can
restart as a new agent without operator intervention.

Any thoughts?

BTW this speaks to the need for MESOS-1739.

Yan

On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <ms...@apple.com> wrote:

> Hi All,
>
> We have been working on the design for Restartable tasks (
> MESOS-3545) and allowing agents to recover and re-register post reboot is a
> pre-requisite for that.
> Agent today doesn’t recover its state that includes its SlaveID post a
> host reboot, it short-circuits the recovery upon discovering the reboot and
> registers with the master as a new agent. With Partition Awareness, the
> mesos master even allows agents which have failed master’s health check
> pings (unreachable agents) to re-register with it and reconcile the
> tasks/executors. The executors on a rebooted host are anyway terminated so
> there is no harm in letting such an agent recover and re-register with the
> master using its old SlaveID.
> Would like to hear from the folks here if you see any operational concerns
> with letting the agents recover post a host reboot.
>
> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>
> Many Thanks
> Megha Sharma
>
>
>

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by Yan Xu <xu...@apple.com>.

So one thing that was brought up during offline conversations was that if
the host reboot is associated with hardware change (e.g., a new memory
stick):


   - Currently: the agent would skip the recovery (and the chance of
   running into incompatible agent info) and register as a new agent.
   - With the change: the agent could run into incompatible agent info due
   to resource change and flap
   <https://github.com/apache/mesos/blob/58f63747f185995d7f9cbfca9d240e2d60053184/src/slave/slave.cpp#L5280>
   indefinitely until the operator intervenes.


To mitigate this and maintain the current behavior, we can have the agent
remove `rm -f <work_dir>/meta/slaves/latest` automatically upon recovery
failure but only after the host has rebooted. This way the agent can
restart as a new agent without operator intervention.

Any thoughts?

BTW this speaks to the need for MESOS-1739.

Yan

On Tue, Nov 15, 2016 at 7:37 AM, Megha Sharma <ms...@apple.com> wrote:

> Hi All,
>
> We have been working on the design for Restartable tasks (
> MESOS-3545) and allowing agents to recover and re-register post reboot is a
> pre-requisite for that.
> Agent today doesn’t recover its state that includes its SlaveID post a
> host reboot, it short-circuits the recovery upon discovering the reboot and
> registers with the master as a new agent. With Partition Awareness, the
> mesos master even allows agents which have failed master’s health check
> pings (unreachable agents) to re-register with it and reconcile the
> tasks/executors. The executors on a rebooted host are anyway terminated so
> there is no harm in letting such an agent recover and re-register with the
> master using its old SlaveID.
> Would like to hear from the folks here if you see any operational concerns
> with letting the agents recover post a host reboot.
>
> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>
> Many Thanks
> Megha Sharma
>
>
>

Re: MESOS-6233 Allow agents to re-register post a host reboot

Posted by X Brick <ng...@gmail.com>.

here is a hacking way to fix it in the current version. backup the
boot_id(it should exist in your $work_dir/meta/boot_id) file when mesos
agent(or slave) start, and restore it with the backup file when agent/slave
restart, slave id will not change. it works fine for ours cluster.

i hope it could help you.

2016-11-15 23:37 GMT+08:00 Megha Sharma <ms...@apple.com>:

> Hi All,
>
> We have been working on the design for Restartable tasks (
> MESOS-3545) and allowing agents to recover and re-register post reboot is a
> pre-requisite for that.
> Agent today doesn’t recover its state that includes its SlaveID post a
> host reboot, it short-circuits the recovery upon discovering the reboot and
> registers with the master as a new agent. With Partition Awareness, the
> mesos master even allows agents which have failed master’s health check
> pings (unreachable agents) to re-register with it and reconcile the
> tasks/executors. The executors on a rebooted host are anyway terminated so
> there is no harm in letting such an agent recover and re-register with the
> master using its old SlaveID.
> Would like to hear from the folks here if you see any operational concerns
> with letting the agents recover post a host reboot.
>
> MESOS JIRA: https://issues.apache.org/jira/browse/MESOS-6223
>
> Many Thanks
> Megha Sharma
>
>
>