You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2015/12/10 02:51:11 UTC

[jira] [Assigned] (MESOS-4106) The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.

     [ https://issues.apache.org/jira/browse/MESOS-4106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler reassigned MESOS-4106:
--------------------------------------

    Assignee: Benjamin Mahler

[~haosdent@gmail.com]: From my testing so far, yes. I will send a fix and re-enable the test from MESOS-1613.

> The health checker may fail to inform the executor to kill an unhealthy task after max_consecutive_failures.
> ------------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-4106
>                 URL: https://issues.apache.org/jira/browse/MESOS-4106
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0, 0.20.1, 0.21.1, 0.21.2, 0.22.1, 0.22.2, 0.23.0, 0.23.1, 0.24.0, 0.24.1, 0.25.0
>            Reporter: Benjamin Mahler
>            Assignee: Benjamin Mahler
>            Priority: Blocker
>
> This was reported by [~tan] experimenting with health checks. Many tasks were launched with the following health check, taken from the container stdout/stderr:
> {code}
> Launching health check process: /usr/local/libexec/mesos/mesos-health-check --executor=(1)@127.0.0.1:39629 --health_check_json={"command":{"shell":true,"value":"false"},"consecutive_failures":1,"delay_seconds":0.0,"grace_period_seconds":1.0,"interval_seconds":1.0,"timeout_seconds":1.0} --task_id=sleepy-2
> {code}
> This should have led to all tasks getting killed due to {{\-\-consecutive_failures}} being set, however, only some tasks get killed, while other remain running.
> It turns out that the health check binary does a {{send}} and promptly exits. Unfortunately, this may lead to a message drop since libprocess may not have sent this message over the socket by the time the process exits.
> We work around this in the command executor with a manual sleep, which has been around since the svn days. See [here|https://github.com/apache/mesos/blob/0.14.0/src/launcher/executor.cpp#L288-L290].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)