You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Niklas Quarfot Nielsen (JIRA)" <ji...@apache.org> on 2014/09/19 02:17:33 UTC
[jira] [Created] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected

Niklas Quarfot Nielsen created MESOS-1817:
---------------------------------------------

             Summary: Completed tasks remains in TASK_RUNNING when framework is disconnected
                 Key: MESOS-1817
                 URL: https://issues.apache.org/jira/browse/MESOS-1817
             Project: Mesos
          Issue Type: Bug
            Reporter: Niklas Quarfot Nielsen


We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. This hogs the cluster and gives users a inconsistent view of the cluster state. Going to the slave, the task is finished. Going to the master, the task is still in a non-terminal state. When the scheduler reattaches or the failover timeout expires, the tasks finishes correctly. The current workflow of this scheduler has a long fail-over timeout, but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the framework instance, the master reports the tasks as running even after several minutes: http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave knows that it completed: http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state).
There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update manage on the slaves will keep trying to send the front of status update stream to the master (which in turn forwards it to the framework). If the first status update after the disconnect is terminal, things work out fine; the master pick the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master will never know that the task finished (or failed) before the framework reconnects.

During a discussion on the dev mailing list (http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=aWg@mail.gmail.com%3e) we enumerated a couple of options to solve this problem.

First off, having two ack-cycles: one between masters and slaves and one between masters and frameworks, would be ideal. We would be able to replay the statuses in order while keeping the master state current. However, this requires us to persist the master state in a replicated storage.

As a first pass, we can make sure that the tasks caught in a running state doesn't hog the cluster when completed and the framework being disconnected.

Here is a proof-of-concept to work out of: https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/

A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68

Which makes it possible for the status update manager to set the field, if the latest status was terminal: https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501

I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478

I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before bringing it for up review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)