You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2016/11/29 22:38:59 UTC

[jira] [Updated] (MESOS-6619) Duplicate elements in "completed_tasks"

     [ https://issues.apache.org/jira/browse/MESOS-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neil Conway updated MESOS-6619:
-------------------------------
    Description: 
Scenario:

# Framework starts non-partition-aware task T on agent A
# Agent A is partitioned. Task T is marked as a "completed task" in the {{Framework}} struct of the master, as part of {{Framework::removeTask}}.
# Agent A re-registers with the master. The tasks running on A are re-added to their respective frameworks on the master as running tasks.
# In {{Master::\_reregisterSlave}}, the master sends a {{ShutdownFrameworkMessage}} for all non-partition-aware frameworks running on the agent. The master then does {{removeTask}} for each task managed by one of these frameworks, which results in calling {{Framework::removeTask}}, which adds _another_ task to {{completed_tasks}}. Note that {{completed_tasks}} does not attempt to detect/suppress duplicates, so this results in two elements in the {{completed_tasks}} collection.

Similar problems occur when a partition-aware task is running on a partitioned agent that re-registers: the result is a task in the {{tasks}} list _and_ a task in the {{completed_tasks}} list.

Possible fixes/changes:

* Adding a task to the {{completed_tasks}} list when an agent becomes partitioned is debatable; certainly for partition-aware tasks, the task is not "completed". We might consider adding an "{{unreachable_tasks}}" list to the HTTP endpoints.
* Regardless of whether we continue to use {{completed_tasks}} or add a new collection, we should ensure the consistency of that data structure after agent re-registration.

  was:
Scenario:

# Framework starts non-partition-aware task T on agent A
# Agent A is partitioned. Task T is marked as a "completed task" in the {{Framework}} struct of the master, as part of {{Framework::removeTask}}.
# Agent A re-registers with the master. The tasks running on A are re-added to their respective frameworks on the master as running tasks.
# In {{Master::_reregisterSlave}}, the master sends a {{ShutdownFrameworkMessage}} for all non-partition-aware frameworks running on the agent. The master then does {{removeTask}} for each task managed by one of these frameworks, which results in calling {{Framework::removeTask}}, which adds _another_ task to {{completed_tasks}}. Note that {{completed_tasks}} does not attempt to detect/suppress duplicates, so this results in two elements in the {{completed_tasks}} collection.

Similar problems occur when a partition-aware task is running on a partitioned agent that re-registers: the result is a task in the {{tasks}} list _and_ a task in the {{completed_tasks}} list.

Possible fixes/changes:

* Adding a task to the {{completed_tasks}} list when an agent becomes partitioned is debatable; certainly for partition-aware tasks, the task is not "completed". We might consider adding an "{{unreachable_tasks}}" list to the HTTP endpoints.
* Regardless of whether we continue to use {{completed_tasks}} or add a new collection, we should ensure the consistency of that data structure after agent re-registration.


> Duplicate elements in "completed_tasks"
> ---------------------------------------
>
>                 Key: MESOS-6619
>                 URL: https://issues.apache.org/jira/browse/MESOS-6619
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>            Reporter: Neil Conway
>            Assignee: Neil Conway
>              Labels: mesosphere
>
> Scenario:
> # Framework starts non-partition-aware task T on agent A
> # Agent A is partitioned. Task T is marked as a "completed task" in the {{Framework}} struct of the master, as part of {{Framework::removeTask}}.
> # Agent A re-registers with the master. The tasks running on A are re-added to their respective frameworks on the master as running tasks.
> # In {{Master::\_reregisterSlave}}, the master sends a {{ShutdownFrameworkMessage}} for all non-partition-aware frameworks running on the agent. The master then does {{removeTask}} for each task managed by one of these frameworks, which results in calling {{Framework::removeTask}}, which adds _another_ task to {{completed_tasks}}. Note that {{completed_tasks}} does not attempt to detect/suppress duplicates, so this results in two elements in the {{completed_tasks}} collection.
> Similar problems occur when a partition-aware task is running on a partitioned agent that re-registers: the result is a task in the {{tasks}} list _and_ a task in the {{completed_tasks}} list.
> Possible fixes/changes:
> * Adding a task to the {{completed_tasks}} list when an agent becomes partitioned is debatable; certainly for partition-aware tasks, the task is not "completed". We might consider adding an "{{unreachable_tasks}}" list to the HTTP endpoints.
> * Regardless of whether we continue to use {{completed_tasks}} or add a new collection, we should ensure the consistency of that data structure after agent re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)