You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2016/11/21 17:47:58 UTC

[jira] [Created] (MESOS-6619) Duplicate elements in "completed_tasks"

Neil Conway created MESOS-6619:
----------------------------------

             Summary: Duplicate elements in "completed_tasks"
                 Key: MESOS-6619
                 URL: https://issues.apache.org/jira/browse/MESOS-6619
             Project: Mesos
          Issue Type: Bug
          Components: master
            Reporter: Neil Conway
            Assignee: Neil Conway


Scenario:

# Framework starts non-partition-aware task T on agent A
# Agent A is partitioned. Task T is marked as a "completed task" in the {{Framework}} struct of the master, as part of {{Framework::removeTask}}.
# Agent A re-registers with the master. The tasks running on A are re-added to their respective frameworks on the master as running tasks.
# In {{Master::_reregisterSlave}}, the master sends a {{ShutdownFrameworkMessage}} for all non-partition-aware frameworks running on the agent. The master then does {{removeTask}} for each task managed by one of these frameworks, which results in calling {{Framework::removeTask}}, which adds _another_ task to {{completed_tasks}}. Note that {{completed_tasks}} does not attempt to detect/suppress duplicates, so this results in two elements in the {{completed_tasks}} collection.

Similar problems occur when a partition-aware task is running on a partitioned agent that re-registers: the result is a task in the {{tasks}} list _and_ a task in the {{completed_tasks}} list.

Possible fixes/changes:

* Adding a task to the {{completed_tasks}} list when an agent becomes partitioned is debatable; certainly for partition-aware tasks, the task is not "completed". We might consider adding an "{{unreachable_tasks}}" list to the HTTP endpoints.
* Regardless of whether we continue to use {{completed_tasks}} or add a new collection, we should ensure the consistency of that data structure after agent re-registration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)