You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (Jira)" <ji...@apache.org> on 2020/02/03 15:55:00 UTC
[jira] [Comment Edited] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

    [ https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029047#comment-17029047 ] 

Andrei Budnik edited comment on MESOS-9847 at 2/3/20 3:54 PM:
--------------------------------------------------------------

{code:java}
commit 457c38967bf9a53c1c5cd2743385937a26f413f6
Author: Andrei Budnik <ab...@apache.org>
Date:   Wed Jan 29 13:35:02 2020 +0100

    Changed termination logic of the Docker executor.

    Previously, the Docker executor terminated itself after a task's
    container had terminated. This could lead to termination of the
    executor before processing of a terminal status update by the agent.
    In order to mitigate this issue, the executor slept for one second to
    give a chance to send all status updates and receive all status update
    acknowledgments before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the Docker executor after receiving a terminal
    status update acknowledgment. Also, this patch increases the timeout
    from one second to one minute for fail-safety.

    Review: https://reviews.apache.org/r/72055
{code}


was (Author: abudnik):

{code:java}
commit 683dfc1ffb0b1ca758a07d19ab3badd8cac62dc7
Author: Andrei Budnik <ab...@apache.org>
Date: Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
 had terminated. This could lead to termination of the executor before
 processing of a terminal status update by the agent. In order
 to mitigate this issue, the executor slept for one second to give a
 chance to send all status updates and receive all status update
 acknowledgements before terminating itself. This might have led to
 various race conditions in some circumstances (e.g., on a slow host).
 This patch terminates the default executor if all status updates have
 been acknowledged by the agent and no running containers left.
 Also, this patch increases the timeout from one second to one minute
 for fail-safety.

Review: https://reviews.apache.org/r/72029

commit 457c38967bf9a53c1c5cd2743385937a26f413f6
Author: Andrei Budnik <ab...@apache.org>
Date: Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
 container had terminated. This could lead to termination of the
 executor before processing of a terminal status update by the agent.
 In order to mitigate this issue, the executor slept for one second to
 give a chance to send all status updates and receive all status update
 acknowledgments before terminating itself. This might have led to
 various race conditions in some circumstances (e.g., on a slow host).
 This patch terminates the Docker executor after receiving a terminal
 status update acknowledgment. Also, this patch increases the timeout
 from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}

> Docker executor doesn't wait for status updates to be ack'd before shutting down.
> ---------------------------------------------------------------------------------
>
>                 Key: MESOS-9847
>                 URL: https://issues.apache.org/jira/browse/MESOS-9847
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor
>            Reporter: Meng Zhu
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerization
>             Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.10, 1.9.1
>
>
> The docker executor doesn't wait for pending status updates to be acknowledged before shutting down, instead it sleeps for one second and then terminates:
> {noformat}
>   void _stop()
>   {
>     // A hack for now ... but we need to wait until the status update
>     // is sent to the slave before we shut ourselves down.
>     // TODO(tnachen): Remove this hack and also the same hack in the
>     // command executor when we have the new HTTP APIs to wait until
>     // an ack.
>     os::sleep(Seconds(1));
>     driver.get()->stop();
>   }
> {noformat}
> This would result in racing between task status update (e.g. TASK_FINISHED) and executor exit. The latter would lead agent generating a `TASK_FAILED` status update by itself, leading to the confusing case where the agent handles two different terminal status updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)