You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2016/10/06 22:48:20 UTC

[jira] [Comment Edited] (MESOS-6223) Allow agents to re-register post a host reboot

    [ https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553432#comment-15553432 ] 

Yan Xu edited comment on MESOS-6223 at 10/6/16 10:48 PM:
---------------------------------------------------------

[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks post-reboot (MESOS-3545, will have design doc out soon) via either the approach in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special case sounds to me an optimization which will no longer hold true with tasks being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to reregister but does allow the agent to register is [when the agent's ip or hostname has changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228] (hostname change already prevents the agent from restarting). I can imagine we'd want to force the agent to get rid of its {{work_dir/<slaves>/slave_id}} but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?


was (Author: xujyan):
[~neilc] [~vinodkone] I can think of ways we can implement restarting tasks post-reboot (MESOS-3545, will have design doc out soon) via either the approach in this ticket or in MESOS-5368 but this one feels simpler. Reboot as a special case sounds to me an optimization which will no longer hold true with tasks being restarted. Then the question is 

1) Should the agent ID *always* change after a reboot?
2) Does the agent ID *ever has to* change when its {{work_dir}} hasn't changed?

1) Sounds like no.

For 2), on the master the only error case where we disallow an agent to reregister but does allow the agent to register is [when the agent's ip or hostname has changed|https://github.com/apache/mesos/blob/3902b051f2cff59c55535dae08ebd4223833b0a0/src/master/master.cpp#L5228]. I can imagine we'd want to force the agent to get rid of its {{work_dir/<slaves>/slave_id}} but keep the checkpointed resources etc.?

To summarize, seems like we can keep both this ticket and MESOS-5368, but change MESOS-5368 to not change the session ID in the reboot case?

Thoughts?

> Allow agents to re-register post a host reboot
> ----------------------------------------------
>
>                 Key: MESOS-6223
>                 URL: https://issues.apache.org/jira/browse/MESOS-6223
>             Project: Mesos
>          Issue Type: Improvement
>          Components: slave
>            Reporter: Megha
>
> Agent does’t recover its state post a host reboot, it registers with the master and gets a new SlaveID. With partition awareness, the agents are now allowed to re-register after they have been marked Unreachable. The executors are anyway terminated on the agent when it reboots so there is no harm in letting the agent keep its SlaveID, re-register with the master and reconcile the lost executors. This is a pre-requisite for supporting persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)