You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2016/05/13 02:46:12 UTC

[jira] [Created] (MESOS-5378) Terminating a framework during master failover leads to orphaned tasks

Joseph Wu created MESOS-5378:
--------------------------------

             Summary: Terminating a framework during master failover leads to orphaned tasks
                 Key: MESOS-5378
                 URL: https://issues.apache.org/jira/browse/MESOS-5378
             Project: Mesos
          Issue Type: Bug
          Components: framework, master
    Affects Versions: 0.28.1, 0.27.2
            Reporter: Joseph Wu


Repro steps:

1) Setup:
{code}
bin/mesos-master.sh --work_dir=/tmp/master
bin/mesos-slave.sh --work_dir=/tmp/slave --master=localhost:5050
src/mesos-execute --checkpoint --command="sleep 1000" --master=localhost:5050 --name="test"
{code}

2) Kill all three from (1), in the order they were started.

3) Restart the master and agent.  Do not restart the framework.

Result)
* The agent will reconnect to an orphaned task.
* The Web UI will report no memory usage
* {{curl localhost:5050/metrics/snapshot}} will say:  {{"master/mem_used": 128,}}

Cause) 
When a framework registers with the master, it provides a {{failover_timeout}}, in case the framework disconnects.  If the framework disconnects and does not reconnect within this {{failover_timeout}}, the master will kill all tasks belonging to the framework.

However, the master does not persist this {{failover_timeout}} across master failover.  The master will "forget" about a framework if:
1) The master dies before {{failover_timeout}} passes.
2) The framework dies while the master is dead.

When the master comes back up, the agent will re-register.  The agent will report the orphaned task(s).  Because the master failed over, it does not know these tasks are orphans (i.e. it thinks the frameworks might re-register).

Proposed solution)
The master should save the {{FrameworkID}} and {{failover_timeout}} in the registry.  Upon recovery, the master should resume the {{failover_timeout}} timers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)