You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (Jira)" <ji...@apache.org> on 2020/02/03 15:54:00 UTC
[jira] [Commented] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

    [ https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029051#comment-17029051 ] 

Andrei Budnik commented on MESOS-9847:
--------------------------------------

1.5.x
{code:java}
commit ff98f12a50a56c13688b87068a116d1d08142f49
Author: Andrei Budnik <ab...@apache.org>
Date:   Wed Jan 29 13:35:02 2020 +0100

    Changed termination logic of the Docker executor.

    Previously, the Docker executor terminated itself after a task's
    container had terminated. This could lead to termination of the
    executor before processing of a terminal status update by the agent.
    In order to mitigate this issue, the executor slept for one second to
    give a chance to send all status updates and receive all status update
    acknowledgments before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the Docker executor after receiving a terminal
    status update acknowledgment. Also, this patch increases the timeout
    from one second to one minute for fail-safety.

    Review: https://reviews.apache.org/r/72055
{code}

1.6.x
{code:java}
commit f511f25be9d850ee9b65fc3ec5f54d149beb2f19
Author: Andrei Budnik <ab...@apache.org>
Date:   Wed Jan 29 13:35:02 2020 +0100

    Changed termination logic of the Docker executor.

    Previously, the Docker executor terminated itself after a task's
    container had terminated. This could lead to termination of the
    executor before processing of a terminal status update by the agent.
    In order to mitigate this issue, the executor slept for one second to
    give a chance to send all status updates and receive all status update
    acknowledgments before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the Docker executor after receiving a terminal
    status update acknowledgment. Also, this patch increases the timeout
    from one second to one minute for fail-safety.

    Review: https://reviews.apache.org/r/72055
{code}

1.7.x
{code:java}
commit 6a7da284d1b89f8a144ed2f896f005a5ee9d4aea
Author: Andrei Budnik <ab...@apache.org>
Date:   Wed Jan 29 13:35:02 2020 +0100

    Changed termination logic of the Docker executor.

    Previously, the Docker executor terminated itself after a task's
    container had terminated. This could lead to termination of the
    executor before processing of a terminal status update by the agent.
    In order to mitigate this issue, the executor slept for one second to
    give a chance to send all status updates and receive all status update
    acknowledgments before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the Docker executor after receiving a terminal
    status update acknowledgment. Also, this patch increases the timeout
    from one second to one minute for fail-safety.

    Review: https://reviews.apache.org/r/72055
{code}

1.8.x
{code:java}
commit 1bd0b37a7e522d63319db426dae7068b901eaea6
Author: Andrei Budnik <ab...@apache.org>
Date:   Wed Jan 29 13:35:02 2020 +0100

    Changed termination logic of the Docker executor.

    Previously, the Docker executor terminated itself after a task's
    container had terminated. This could lead to termination of the
    executor before processing of a terminal status update by the agent.
    In order to mitigate this issue, the executor slept for one second to
    give a chance to send all status updates and receive all status update
    acknowledgments before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the Docker executor after receiving a terminal
    status update acknowledgment. Also, this patch increases the timeout
    from one second to one minute for fail-safety.

    Review: https://reviews.apache.org/r/72055
{code}


1.9.x
{code:java}
commit 3d60cba39d0377a7dc19b4c47f3bb0807418fe50
Author: Andrei Budnik <ab...@apache.org>
Date:   Wed Jan 29 13:35:02 2020 +0100

    Changed termination logic of the Docker executor.

    Previously, the Docker executor terminated itself after a task's
    container had terminated. This could lead to termination of the
    executor before processing of a terminal status update by the agent.
    In order to mitigate this issue, the executor slept for one second to
    give a chance to send all status updates and receive all status update
    acknowledgments before terminating itself. This might have led to
    various race conditions in some circumstances (e.g., on a slow host).
    This patch terminates the Docker executor after receiving a terminal
    status update acknowledgment. Also, this patch increases the timeout
    from one second to one minute for fail-safety.

    Review: https://reviews.apache.org/r/72055
{code}

> Docker executor doesn't wait for status updates to be ack'd before shutting down.
> ---------------------------------------------------------------------------------
>
>                 Key: MESOS-9847
>                 URL: https://issues.apache.org/jira/browse/MESOS-9847
>             Project: Mesos
>          Issue Type: Bug
>          Components: executor
>            Reporter: Meng Zhu
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerization
>             Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.10, 1.9.1
>
>
> The docker executor doesn't wait for pending status updates to be acknowledged before shutting down, instead it sleeps for one second and then terminates:
> {noformat}
>   void _stop()
>   {
>     // A hack for now ... but we need to wait until the status update
>     // is sent to the slave before we shut ourselves down.
>     // TODO(tnachen): Remove this hack and also the same hack in the
>     // command executor when we have the new HTTP APIs to wait until
>     // an ack.
>     os::sleep(Seconds(1));
>     driver.get()->stop();
>   }
> {noformat}
> This would result in racing between task status update (e.g. TASK_FINISHED) and executor exit. The latter would lead agent generating a `TASK_FAILED` status update by itself, leading to the confusing case where the agent handles two different terminal status updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)