You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2016/12/14 21:18:58 UTC

[jira] [Commented] (MESOS-6785) CHECK failure on duplicate task IDs

    [ https://issues.apache.org/jira/browse/MESOS-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749508#comment-15749508 ] 

Neil Conway commented on MESOS-6785:
------------------------------------

Notes:

* We can't prevent task IDs being reused unless we do something drastic, like constraining how frameworks are allowed to pick task IDs or having the master assign task IDs.
* Hence, we must either tolerate multiple tasks with the same ID (on different agents), or terminate one of them. (Note that there might be many duplicates of a given task ID on different partitioned agents -- we'd want the master to eventually terminate all-but-one of them, assuming they all eventually re-register).
* Allowing multiple tasks with the same ID on different agents seems like a breaking semantic change -- frameworks probably use Task IDs as unique identifiers, for good reason.

Assuming we want to terminate all-but-one of the copies of the task:

* When an agent re-registers and we discover that it is running a task whose ID is being used by another task on a different agent, we want to kill one of the tasks (hard to guarantee we always kill the "oldest" or "newest" copy of the task, since the agents might re-register in arbitrary order). It is unclear how to signal this situation to the framework: if we report "task X has been killed", the framework won't be able to tell which instance of the task "X" refers to.
* The task we want to kill may have generated one or more status updates at the agent while it was partitioned. We don't want to propagate those status updates to the framework (to avoid confusing it).
* To deal with the status update problem, we could:
 ## Send a special "kill" signal to the agent (likely as part of the {{SlaveReregisteredMessage}}); this would notify the agent to terminate the task without generating any status updates for it, and to drop any pending status updates without waiting for ACKs.
*** In this scheme, the master would never add the duplicate task on the re-registering agent to its in-memory state; this avoids the {{CHECK}} failure.
*** Because the kill signal would be delivered as part of the re-registration message, I think we could be sure that the master wouldn't receive any status updates for the task in the meantime (but if it did, we could arrange for the master to drop them).
 ## Or, we could have the master ACK and drop the resulting status updates from the agent, without passing them along to the framework.
*** This might be challenging, because the master might "forget" (due to master failover) that the copy of the task on the agent is "bad" and should be terminated in this special manner. So it might be possible to have a situation in which _some_ of the status updates for a copy of the task are dropped; then the master fails over and after re-registration, a _different_ version of the task is picked to be killed, so we'd effectively have silently dropped some of the status updates from the "legitimate" copy of the task.

Note that even if we fix the master crash, this situation is likely to be problematic for frameworks. For example, suppose the framework launches task X on agent A1, then task X on agent A2, then the framework itself fails. When it reconnects, it finds a single copy of X running -- it could be _either_ the X on A1 or A2. Without having the framework also remember the agent ID where it launched the task, the framework can't determine which "X" is currently running. (And if we require frameworks to identify tasks via the pair <task ID, agent ID>, we might as well just declare that task IDs are no longer globally unique and be done with it.)

> CHECK failure on duplicate task IDs
> -----------------------------------
>
>                 Key: MESOS-6785
>                 URL: https://issues.apache.org/jira/browse/MESOS-6785
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> The master crashes with a CHECK failure in the following scenario:
> # Framework launches task X on agent A1. The framework may or may not be partition-aware; let's assume it is not partition-aware.
> # A1 becomes partitioned from the master.
> # Framework launches task X on agent A2.
> # Master fails over.
> # Agents A1 and A2 both re-register with the master. Because the master has failed over, the task on A1 is _not_ terminated ("non-strict registry semantics").
> This results in two running tasks with the same ID, which causes a master {{CHECK}} failure among other badness:
> {noformat}
> master.hpp:2299] Check failed: !tasks.contains(task->task_id()) Duplicate task b88153a2-571a-41e7-9e9b-c297fef4f3cd of framework eaef1879-8cc9-412f-928d-86c9925a7abb-0000
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)