You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2016/10/17 22:58:59 UTC

[jira] [Updated] (MESOS-5396) After failover, master does not remove agents with same UPID

     [ https://issues.apache.org/jira/browse/MESOS-5396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler updated MESOS-5396:
-----------------------------------
    Priority: Critical  (was: Major)

Bumping the priority. Note that this situation can extend far beyond the {{--\[agent|slave]_reregister_timeout}} due to the removal rate limiting. One example occurred when the cluster experienced a large-scale power loss and so a large number of agents are removed (such that with the rate limit applied it would have taken O(days) to remove all of them). If the framework does not provide the SlaveID during explicit reconciliation, no progress can be made until all of the removals complete.

If the patch is straightforward enough, backports would be great.

[~neilc] should this be assigned to you still?

> After failover, master does not remove agents with same UPID
> ------------------------------------------------------------
>
>                 Key: MESOS-5396
>                 URL: https://issues.apache.org/jira/browse/MESOS-5396
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>            Priority: Critical
>              Labels: mesosphere
>
> Scenario:
> * master fails over
> * an agent host is restarted; the agent attempts to register with Mesos using the same UPID as the previous agent instance; this means it will get a new agent ID
> * framework isn't notified about the status of the tasks on the *old* slaveID until the slave_reregister_timeout expires (10 mins)
> This isn't necessarily wrong, but it is suboptimal: when the slave attempts to register with the same UPID that was used by the previous slave instance, we know that a *reregistration* attempt for the old <UPID, slaveID> pair will never be seen. Hence we can declare the old slaveID to be gone-forever and notify frameworks appropriately, without waiting for the full slave_reregister_timeout to expire.
> Note that we already implement the proposed behavior for the case when the master does *not* failover (https://github.com/apache/mesos/blob/0.28.1/src/master/master.cpp#L4162-L4172).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)