You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2019/04/01 12:32:02 UTC

[jira] [Comment Edited] (MESOS-9501) Mesos executor fails to terminate and gets stuck after agent host reboot.

    [ https://issues.apache.org/jira/browse/MESOS-9501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16806707#comment-16806707 ] 

Qian Zhang edited comment on MESOS-9501 at 4/1/19 12:31 PM:
------------------------------------------------------------

This issue can actually happen even without an agent reboot (highly unlikely but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can simulate this by manually changing the task's checkpointed pid in the meta dir to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see below) for the command task which has actually finished already. This is obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
  message: 'Unreachable agent re-reregistered'
  source: SOURCE_MASTER
  reason: REASON_AGENT_REREGISTERED{code}


was (Author: qianzhang):
This issue can actually happen even without an agent reboot (highly unlikely but technically possible):
 # Use `mesos-execute` to launch a command task (e.g., `sleep 60`) with checkpoint enabled.
 # Stop agent process.
 # After the task finishes, wait for a new process to reuse its pid. We can simulate this by manually changing the task's checkpointed pid in the meta dir to an existing process's pid.
 # Start agent process.

Then we will see `mesos-execute` receives a TASK_RUNNING status update (see below) for the command task which has actually finished already. This is obviously a bug since a finished task is considered running by Mesos.
{code:java}
Received status update TASK_RUNNING for task 'test'
message: 'Unreachable agent re-reregistered'
source: SOURCE_MASTER
reason: REASON_AGENT_REREGISTERED{code}

> Mesos executor fails to terminate and gets stuck after agent host reboot.
> -------------------------------------------------------------------------
>
>                 Key: MESOS-9501
>                 URL: https://issues.apache.org/jira/browse/MESOS-9501
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.5.1, 1.6.1, 1.7.0
>            Reporter: Meng Zhu
>            Assignee: Qian Zhang
>            Priority: Critical
>             Fix For: 1.4.3, 1.5.2, 1.6.2, 1.7.1, 1.8.0
>
>
> When an agent host reboots, all of its containers are gone but the agent will still try to recover from its checkpointed state after reboot.
> The agent will soon discover that all the cgroup hierarchies are gone and assume (correctly) that the containers are destroyed.
> However, when trying to terminate the executor, the agent will first try to wait for the exit status of its container:
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L2631
> Agent dose so by `waitpid` on the checkpointed child process pid. If, after the agent host reboot, a new process with the same pid gets spawned, then the parent will wait for the wrong child process. This could get stuck until the wrongly waited-for  process is somehow exited, see `ReaperProcess::wait()`: https://github.com/apache/mesos/blob/master/3rdparty/libprocess/src/reap.cpp#L88-L114
> This will block the executor termination as well as future task status update (e.g. master might still think the task is running).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)