You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2015/12/02 21:55:11 UTC

[jira] [Commented] (MESOS-4049) Allow user to control behavior of partitioned agents/tasks

    [ https://issues.apache.org/jira/browse/MESOS-4049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036599#comment-15036599 ] 

Vinod Kone commented on MESOS-4049:
-----------------------------------

+100

> Allow user to control behavior of partitioned agents/tasks
> ----------------------------------------------------------
>
>                 Key: MESOS-4049
>                 URL: https://issues.apache.org/jira/browse/MESOS-4049
>             Project: Mesos
>          Issue Type: Improvement
>          Components: master, slave
>            Reporter: Neil Conway
>              Labels: mesosphere
>
> At present, if an agent is partitioned away from the master, the master waits for a period of time (see MESOS-4048) before deciding that the agent is dead. Then it marks the agent as lost, sends {{TASK_LOST}} messages for all the tasks running on the agent, and instructs the agent to shutdown.
> Although this behavior is desirable for some/many users, it is not ideal for everyone. For example:
> * Some users might want to aggressively start a new replacement task (e.g., after one or two ping timeouts are missed); then when the old copy of the task comes back, they might want to make an intelligent decision about how to reconcile this situation (e.g., kill old, kill new, allow both to continue running).
> * Some frameworks might want different behavior from other frameworks, or to treat some tasks differently from other tasks. For example, if a task has a huge amount of state that would need to be regenerated to spin up another instance, the user might want to wait longer before starting a new task to increase the chance that the old task will reappear.
> To do this, we'd need to change task state so that a task can go from {{RUNNING}} to a new state (say {{UNKNOWN}} or {{WANDERING}}), and then from that state back to {{RUNNING}} (or perhaps we could keep the current "mark-lost-after-timeout" behavior as an option, in which case {{UNKNOWN}} could also transition to {{LOST}}). The agent would also keep its old {{slaveId}} when it reconnects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)