You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Will Rouesnel (JIRA)" <ji...@apache.org> on 2016/10/11 04:25:20 UTC
[jira] [Created] (MESOS-6358) Add watchdog timeout/action for mesos
tasks which do not exit
Will Rouesnel created MESOS-6358:
------------------------------------
Summary: Add watchdog timeout/action for mesos tasks which do not exit
Key: MESOS-6358
URL: https://issues.apache.org/jira/browse/MESOS-6358
Project: Mesos
Issue Type: Improvement
Components: slave
Reporter: Will Rouesnel
Priority: Minor
When running with the docker containerizer, we've observed the scenario where a subproces of the docker container becomes a zombie due to a kernel bug (i.e. is completely unkillable).
The effect of this was that Mesos kept reporting the task as running via it's API but not as existing to calls to delete it (being made by Marathon) as the actual docker-runc/docker-container process never exited (since none of the child processes exited waiting on the misbehaving subprocess).
Mesos should include a parameter to deal with this situation - I woudl propose --task_kill_watchdog_timeout and --task_kill_watchdog_binary
The idea would be that if a task still exists beyond the the length of timeout *after* the hard kill signal has been sent, then Mesos executes the watchdog binary action.
The usage of such a process would be to allow alerting of an exceptional situation (if possible) and actioning to ensure worker nodes can stay up on average (i.e. if encountered, crash the node, let a BMC watchdog reboot it).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)