You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Will Rouesnel (JIRA)" <ji...@apache.org> on 2016/10/11 04:25:20 UTC

[jira] [Created] (MESOS-6358) Add watchdog timeout/action for mesos tasks which do not exit

Will Rouesnel created MESOS-6358:
------------------------------------

             Summary: Add watchdog timeout/action for mesos tasks which do not exit
                 Key: MESOS-6358
                 URL: https://issues.apache.org/jira/browse/MESOS-6358
             Project: Mesos
          Issue Type: Improvement
          Components: slave
            Reporter: Will Rouesnel
            Priority: Minor


When running with the docker containerizer, we've observed the scenario where a subproces of the docker container becomes a zombie due to a kernel bug (i.e. is completely unkillable).

The effect of this was that Mesos kept reporting the task as running via it's API but not as existing to calls to delete it (being made by Marathon) as the actual docker-runc/docker-container process never exited (since none of the child processes exited waiting on the misbehaving subprocess).

Mesos should include a parameter to deal with this situation - I woudl propose --task_kill_watchdog_timeout and --task_kill_watchdog_binary

The idea would be that if a task still exists beyond the the length of timeout *after* the hard kill signal has been sent, then Mesos executes the watchdog binary action.

The usage of such a process would be to allow alerting of an exceptional situation (if possible) and actioning to ensure worker nodes can stay up on average (i.e. if encountered, crash the node, let a BMC watchdog reboot it).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)