You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (Jira)" <ji...@apache.org> on 2020/02/03 15:55:00 UTC
[jira] [Comment Edited] (MESOS-9847) Docker executor doesn't wait
for status updates to be ack'd before shutting down.
[ https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17029047#comment-17029047 ]
Andrei Budnik edited comment on MESOS-9847 at 2/3/20 3:54 PM:
--------------------------------------------------------------
{code:java}
commit 457c38967bf9a53c1c5cd2743385937a26f413f6
Author: Andrei Budnik <ab...@apache.org>
Date: Wed Jan 29 13:35:02 2020 +0100
Changed termination logic of the Docker executor.
Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.
Review: https://reviews.apache.org/r/72055
{code}
was (Author: abudnik):
{code:java}
commit 683dfc1ffb0b1ca758a07d19ab3badd8cac62dc7
Author: Andrei Budnik <ab...@apache.org>
Date: Wed Jan 29 19:07:50 2020 +0100
Changed termination logic of the default executor.
Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch increases the timeout from one second to one minute
for fail-safety.
Review: https://reviews.apache.org/r/72029
commit 457c38967bf9a53c1c5cd2743385937a26f413f6
Author: Andrei Budnik <ab...@apache.org>
Date: Wed Jan 29 13:35:02 2020 +0100
Changed termination logic of the Docker executor.
Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.
Review: https://reviews.apache.org/r/72055
{code}
> Docker executor doesn't wait for status updates to be ack'd before shutting down.
> ---------------------------------------------------------------------------------
>
> Key: MESOS-9847
> URL: https://issues.apache.org/jira/browse/MESOS-9847
> Project: Mesos
> Issue Type: Bug
> Components: executor
> Reporter: Meng Zhu
> Assignee: Andrei Budnik
> Priority: Major
> Labels: containerization
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.10, 1.9.1
>
>
> The docker executor doesn't wait for pending status updates to be acknowledged before shutting down, instead it sleeps for one second and then terminates:
> {noformat}
> void _stop()
> {
> // A hack for now ... but we need to wait until the status update
> // is sent to the slave before we shut ourselves down.
> // TODO(tnachen): Remove this hack and also the same hack in the
> // command executor when we have the new HTTP APIs to wait until
> // an ack.
> os::sleep(Seconds(1));
> driver.get()->stop();
> }
> {noformat}
> This would result in racing between task status update (e.g. TASK_FINISHED) and executor exit. The latter would lead agent generating a `TASK_FAILED` status update by itself, leading to the confusing case where the agent handles two different terminal status updates.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)