You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2016/05/18 18:59:13 UTC

[jira] [Commented] (MESOS-5195) Docker executor: task logs lost on shutdown

    [ https://issues.apache.org/jira/browse/MESOS-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289598#comment-15289598 ] 

Vinod Kone commented on MESOS-5195:
-----------------------------------

[~timchen] Can you shepherd this?

> Docker executor: task logs lost on shutdown
> -------------------------------------------
>
>                 Key: MESOS-5195
>                 URL: https://issues.apache.org/jira/browse/MESOS-5195
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker
>    Affects Versions: 0.27.2
>         Environment: Linux 4.4.2 "Ubuntu 14.04.2 LTS"
>            Reporter: Steven Schlansker
>
> When you try to kill a task running in the Docker executor (in our case via Singularity), the task shuts down cleanly but the last logs to standard out / standard error are lost in teardown.
> For example, we run dumb-init.  With debugging on, you can see it should write:
> {noformat}
> DEBUG("Forwarded signal %d to children.\n", signum);
> {noformat}
> If you attach strace to the process, you can see it clearly writes the text to stderr.  But that message is lost and never is written to the sandbox 'stderr' file.
> We believe the issue starts here, in Docker executor.cpp:
> {code}
>   void shutdown(ExecutorDriver* driver)
>   {
>     cout << "Shutting down" << endl;
>     if (run.isSome() && !killed) {
>       // The docker daemon might still be in progress starting the
>       // container, therefore we kill both the docker run process
>       // and also ask the daemon to stop the container.
>       // Making a mutable copy of the future so we can call discard.
>       Future<Nothing>(run.get()).discard();
>       stop = docker->stop(containerName, stopTimeout);
>       killed = true;
>     }
>   }
> {code}
> Notice how the "run" future is discarded *before* the Docker daemon is told to stop -- now what will discarding it do?
> {code}
> void commandDiscarded(const Subprocess& s, const string& cmd)
> {
>   VLOG(1) << "'" << cmd << "' is being discarded";
>   os::killtree(s.pid(), SIGKILL);
> }
> {code}
> Oops, just sent SIGKILL to the entire process tree...
> You can see another (harmless?) side effect in the Docker daemon logs, it never gets a chance to kill the task:
> {noformat}
> ERROR Handler for DELETE /v1.22/containers/mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233 returned error: No such container: mesos-f3bb39fe-8fd9-43d2-80a6-93df6a76807e-S2.0c509380-c326-4ff7-bb68-86a37b54f233
> {noformat}
> I suspect that the fix is wait for 'docker->stop()' to complete before discarding the 'run' future.
> Happy to provide more information if necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)