You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2018/01/12 00:22:14 UTC

[jira] [Comment Edited] (MESOS-8125) Agent should properly handle recovering an executor when its pid is reused

    [ https://issues.apache.org/jira/browse/MESOS-8125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323279#comment-16323279 ] 

Vinod Kone edited comment on MESOS-8125 at 1/12/18 12:22 AM:
-------------------------------------------------------------

Did some investigation with [~xujyan] and [~megha.sharma]

Findings:

This issue doesn't seem to affect mesos containerizer because of the way launcher destroy short circuits when the cgroup doesn't exist.

In docker containerizer, we don't realize that a pid is invalid when the corresponding docker container is exited. We should fix `_recover` to do `container->status.set(None())` when the container->pid is None().

Also looks like docker containerizer doesn't recover the executor pid!? 


was (Author: vinodkone):
Did some investigation with [~xujyan] and [~megha.sharma]

Findings:

--> This issue doesn't seem to affect mesos containerizer because of the way launcher destroy short circuits when the cgroup doesn't exist.

--> In docker containerizer, we don't realize that a pid is invalid when the corresponding docker container is exited. We should fix `_recover` to do `container->status.set(None())` when the container->pid is None().

--> Also looks like docker containerizer doesn't recover the executor pid!? 

> Agent should properly handle recovering an executor when its pid is reused
> --------------------------------------------------------------------------
>
>                 Key: MESOS-8125
>                 URL: https://issues.apache.org/jira/browse/MESOS-8125
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Gastón Kleiman
>            Assignee: Megha Sharma
>            Priority: Critical
>
> We know that all executors will be gone once the host on which an agent is running is rebooted, so there's no need to try to recover these executors.
> Trying to recover stopped executors can lead to problems if another process is assigned the same pid that the executor had before the reboot. In this case the agent will unsuccessfully try to reregister with the executor, and then transition it to a {{TERMINATING}} state. The executor will sadly get stuck in that state, and the tasks that it started will get stuck in whatever state they were in at the time of the reboot.
> One way of getting rid of stuck executors is to remove the {{latest}} symlink under {{work_dir/meta/slaves/latest/frameworks/<framework id>/executors/<executor id>/runs}.
> Here's how to reproduce this issue:
> # Start a task using the Docker containerizer (the same will probably happen with the command executor).
> # Stop the corresponding Mesos agent while the task is running.
> # Change the executor's checkpointed forked pid, which is located in the meta directory, e.g., {{/var/lib/mesos/slave/meta/slaves/latest/frameworks/19faf6e0-3917-48ab-8b8e-97ec4f9ed41e-0001/executors/foo.13faee90-b5f0-11e7-8032-e607d2b4348c/runs/latest/pids/forked.pid}}. I used pid 2, which is normally used by {{kthreadd}}.
> # Reboot the host



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)