You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Till Toenshoff (JIRA)" <ji...@apache.org> on 2014/05/29 02:22:02 UTC
[jira] [Updated] (MESOS-1431) io::splice usage needs special care -
especially in connection with process::subprocess
[ https://issues.apache.org/jira/browse/MESOS-1431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Toenshoff updated MESOS-1431:
----------------------------------
Assignee: Till Toenshoff
> io::splice usage needs special care - especially in connection with process::subprocess
> ---------------------------------------------------------------------------------------
>
> Key: MESOS-1431
> URL: https://issues.apache.org/jira/browse/MESOS-1431
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.19.0
> Reporter: Till Toenshoff
> Assignee: Till Toenshoff
> Labels: bug, splice, subprocess
>
> When using {{process::subproces}} in connection with {{io::splice}}, make sure you work with extra care.
> Consider this piece of code (ripped from EC), which looks rather unspectacular - very straightforward use of subprocess and splice.
> {noformat}
> // Fork exec of external process.
> Try<Subprocess> external = process::subprocess(
> execute,
> environment);
> if (external.isError()) {
> return Error("Failed to execute external containerizer: " +
> external.error());
> }
> // Set stderr into non-blocking mode.
> Try<Nothing> nonblock = os::nonblock(external.get().err());
> if (nonblock.isError()) {
> return Error("Failed to accept nonblock: " + nonblock.error());
> }
> // Redirect output (stderr) from the subprocess to a log
> // file.
> Try<int> err = os::open(
> sandbox.isSome() ? path::join(sandbox.get().directory, "stderr")
> : "/dev/null",
> O_WRONLY | O_CREAT | O_APPEND | O_NONBLOCK,
> S_IRUSR | S_IWUSR | S_IRGRP | S_IRWXO);
> if (err.isError()) {
> return Error(
> "Failed to redirect stderr: Failed to open: " +
> err.error());
> }
> Try<Nothing> cloexec = os::cloexec(err.get());
> if (cloexec.isError()) {
> os::close(err.get());
> return Error(
> "Failed to redirect stderr: Failed to set close-on-exec: " +
> cloexec.error());
> }
> io::splice(external.get().err(), err.get())
> .onAny(&os::close, err.get()));
> {noformat}
> The above code is buggy as subprocess will close {{external.get().err()}} once the child got reaped. So the FD we are reading (splicing) from potentially gets closed before the splicer was able to get the {{EOF}}. That in turn will cause libev to continue polling that FD (remember, it never reached a final state) and that is where havoc breaks lose as we will now see data getting lost that is sent from reused FDs.
>
> If you do not {{dup}} the subprocess returned FDs before using an {{io::splice}} on it, thus not taking full ownership of the FD, you will end up with fancy problems.
> So a fix would be replacing the last two lines by this:
> {noformat}
> // Duplicate 'from' so that we're in control of it's lifetime,
> // exceeding the lifetime of the reaped subprocess. It is needed
> // to make sure the splicer had a chance to read and process the
> // EOF.
> int fd = dup(external.get().err());
> if (fd == -1) {
> os::close(err.get());
> return ErrnoError("Failed to duplicate stderr file descriptor");
> }
> io::splice(fd, err.get())
> .onAny(bind(&_invoke, fd, err.get()));
> {noformat}
> And also adding an onAny triggered function (_invoke in this case) that is closing both, the dup'ed FD and the log-file FD.
> The bad news are that the EC is not the only implementation that contains such buggy code. There is at least one more such bug found within mesos_containerizer.cpp:680 and we should now all check for more such use-cases and fix them before releasing.
--
This message was sent by Atlassian JIRA
(v6.2#6252)