You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Erich Nachbar (JIRA)" <ji...@apache.org> on 2012/11/01 18:07:12 UTC

[jira] [Updated] (MESOS-303) mesos slave crashes during framework termination

     [ https://issues.apache.org/jira/browse/MESOS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erich Nachbar updated MESOS-303:
--------------------------------

    Description: 
Hi,

I'm running Spark 0.6.0 on Mesos trunk (5230fea125b0b) and see my mesos slaves terminating when a Spark job is aborted (CTRL-C).

The logs only show a Segfault message, but I obtained a backtrace through gdb to give a little more context.
Mesos passes all checks (make check) except for the linux container.
Mesos was built using: ./configure.ubuntu-natty-64 --with-zookeeper --with-webui
Mesos slave command: mesos-slave --master=zk://szk0:2181/mesos

Here are the last few lines leading up to the segfault using gdb:

2012-10-31 22:15:35,698:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
2012-10-31 22:15:39,047:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 13 ms
2012-10-31 22:15:42,385:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 15 ms
I1031 22:15:45.434877 29511 slave.cpp:652] Asked to shut down framework 201210312057-1560611338-5050-24091-0009
I1031 22:15:45.435017 29511 slave.cpp:656] Shutting down framework 201210312057-1560611338-5050-24091-0009
I1031 22:15:45.435387 29511 slave.cpp:1102] Shutting down executor 'default' of framework 201210312057-1560611338-5050-24091-0009
2012-10-31 22:15:45,707:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
2012-10-31 22:15:49,044:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
I1031 22:15:50.437018 29495 slave.cpp:1131] Killing executor 'default' of framework 201210312057-1560611338-5050-24091-0009
I1031 22:15:50.439749 29502 gc.cpp:97] Scheduling /tmp/mesos/slaves/201210312057-1560611338-5050-24091-22/frameworks/201210312057-1560611338-5050-24091-0009/executors/default/runs/74aa6767-e45c-40db-8bfd-5aaf9960fabe for removal
/usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe
/usr/local/libexec/mesos/killtree.sh: line 135: echo: write error: Broken pipe
root@shd0:~/mesos_git# /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
/usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
/usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe


-------------------------------------------------------------------------------------------
Here is the backtrace from gdb:

#0  0x0000000000000000 in ?? ()
#1  0x00007ffff74dbaf6 in mesos::internal::slave::Executor::~Executor() ()
   from /usr/local/lib/libmesos-0.9.0.so
#2  0x00007ffff74ec00c in __gnu_cxx::new_allocator<mesos::internal::slave::Executor>::destroy(mesos::internal::slave::Executor*) () from /usr/local/lib/libmesos-0.9.0.so
#3  0x00007ffff74e3bd5 in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::_M_clear() ()
   from /usr/local/lib/libmesos-0.9.0.so
#4  0x00007ffff74de3df in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~_List_base() ()
   from /usr/local/lib/libmesos-0.9.0.so
#5  0x00007ffff74dc670 in std::list<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~list() () from /usr/local/lib/libmesos-0.9.0.so
#6  0x00007ffff74dc7fb in mesos::internal::slave::Framework::~Framework() ()
   from /usr/local/lib/libmesos-0.9.0.so
#7  0x00007ffff74d87d5 in mesos::internal::slave::Slave::shutdownExecutorTimeout(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#8  0x00007ffff7501313 in std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)>::operator()(mesos::internal::slave::Slave*, mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) const
    () from /usr/local/lib/libmesos-0.9.0.so
#9  0x00007ffff74fd404 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::__call<mesos::internal::slave::Slave*&, 0, 1, 2, 3>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>), std::tr1::_Index_tuple<0, 1, 2, 3>) () from /usr/local/lib/libmesos-0.9.0.so
#10 0x00007ffff74f7956 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::operator()<mesos::internal::slave::Slave*>(mesos::internal::slave::Slave*&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#11 0x00007ffff74f12dc in std::tr1::_Function_handler<void ()(mesos::internal::slave::Slave*), std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)> >::_M_invoke(std::tr1::_Any_data const&, mesos::internal::slave::Slave*) () from /usr/local/lib/libmesos-0.9.0.so
#12 0x00007ffff74ed58a in std::tr1::function<void ()(mesos::internal::slave::Slave*)>::operator()(mesos::internal::slave::Slave*) const () from /usr/local/lib/libmesos-0.9.0.so
#13 0x00007ffff74e508d in void process::internal::vdispatcher<mesos::internal::slave::Slave>(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >) () from /usr/local/lib/libmesos-0.9.0.so
#14 0x00007ffff74f9be9 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::__call<process::ProcessBase*&, 0, 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), std::tr1::_Index_tuple<0, 1>) ()
   from /usr/local/lib/libmesos-0.9.0.so
#15 0x00007ffff74f3ce4 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::operator()<process::ProcessBase*>(process::ProcessBase*&) () from /usr/local/lib/libmesos-0.9.0.so
#16 0x00007ffff74ed676 in std::tr1::_Function_handler<void ()(process::ProcessBase*), std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*) () from /usr/local/lib/libmesos-0.9.0.so
#17 0x00007ffff76eecd0 in std::tr1::function<void ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from /usr/local/lib/libmesos-0.9.0.so
#18 0x00007ffff76da56b in process::ProcessBase::visit(process::DispatchEvent const&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#19 0x00007ffff76df1a4 in process::DispatchEvent::visit(process::EventVisitor*) const ()
   from /usr/local/lib/libmesos-0.9.0.so
#20 0x00007ffff738a85e in process::ProcessBase::serve(process::Event const&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#21 0x00007ffff76d7ccb in process::ProcessManager::resume(process::ProcessBase*) ()
   from /usr/local/lib/libmesos-0.9.0.so
#22 0x00007ffff76cf6f7 in process::schedule(void*) ()
   from /usr/local/lib/libmesos-0.9.0.so
#23 0x00007ffff51fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#24 0x00007ffff4f45fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#25 0x0000000000000000 in ?? ()
A debugging session is active.

I discussed with Florian the issue and did some investigations into the code. It seems that the problematic section of the code has received some fairly major patch:

diff --git a/src/slave/process_based_isolation_module.cpp b/src/slave/process_based_isolation_module.cpp
index 7448326..b0b6a81 100644
--- a/src/slave/process_based_isolation_module.cpp
+++ b/src/slave/process_based_isolation_module.cpp
@@ -18,6 +18,7 @@

#include <errno.h>
#include <signal.h>
+#include <stdio.h> // For perror.
#include <string.h>

#include <map>
@@ -150,29 +151,33 @@ void ProcessBasedIsolationModule::launchExecutor(
    dispatch(slave, &Slave::executorStarted,
             frameworkId, executorId, pid);
  } else {
-    // In child process, make cleanup easier.
+    // In child process, we make cleanup easier by putting process
+    // into it's own session. DO NOT USE GLOG!
+    close(pipes[0]);
+
    // NOTE: We setsid() in a loop because setsid() might fail if another
    // process has the same process group id as the calling process.
-    close(pipes[0]);
    while ((pid = setsid()) == -1) {
-      PLOG(ERROR) << "Could not put executor in own session, "
-                  << "forking another process and retrying";
+      perror("Could not put executor in own session");
+
+      std::cerr << "Forking another process and retrying ..." << std::endl;

      if ((pid = fork()) == -1) {
-        LOG(ERROR) << "Failed to fork to launch executor";
-        exit(-1);
+        perror("Failed to fork to launch executor");
+        abort();
      }

      if (pid) {
        // In parent process.
        // It is ok to suicide here, though process reaper signals the exit,
        // because the process isolation module ignores unknown processes.
-        exit(-1);
+        exit(0);
      }
    }

    if (write(pipes[1], &pid, sizeof(pid)) != sizeof(pid)) {
-      PLOG(FATAL) << "Failed to write PID on pipe";
+      perror("Failed to write PID on pipe");
+      abort();
    }

    close(pipes[1]);
@@ -182,7 +187,8 @@ void ProcessBasedIsolationModule::launchExecutor(
                             executorInfo, directory);

    if

-----------------------------------------

We are a bit with our backs against the wall due to the fact that the old released Mesos 0.9 requires restarting the whole cluster in case of a master failure (which we have had a few) losing all running jobs.







  was:
Hi,

I'm running Spark 0.6.0 on Mesos trunk (5230fea125b0b) and see my mesos slaves terminating when the Spark job is aborted (CTRL-C).

The logs only show a Segfault message, but I obtained a backtrace through gdb to give a little more context.
Mesos passes all checks (make check) except for the linux container.
Mesos was built using: ./configure.ubuntu-natty-64 --with-zookeeper --with-webui
Mesos slave command: mesos-slave --master=zk://szk0:2181/mesos

Here are the last few lines leading up to the segfault using gdb:

2012-10-31 22:15:35,698:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
2012-10-31 22:15:39,047:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 13 ms
2012-10-31 22:15:42,385:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 15 ms
I1031 22:15:45.434877 29511 slave.cpp:652] Asked to shut down framework 201210312057-1560611338-5050-24091-0009
I1031 22:15:45.435017 29511 slave.cpp:656] Shutting down framework 201210312057-1560611338-5050-24091-0009
I1031 22:15:45.435387 29511 slave.cpp:1102] Shutting down executor 'default' of framework 201210312057-1560611338-5050-24091-0009
2012-10-31 22:15:45,707:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
2012-10-31 22:15:49,044:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
I1031 22:15:50.437018 29495 slave.cpp:1131] Killing executor 'default' of framework 201210312057-1560611338-5050-24091-0009
I1031 22:15:50.439749 29502 gc.cpp:97] Scheduling /tmp/mesos/slaves/201210312057-1560611338-5050-24091-22/frameworks/201210312057-1560611338-5050-24091-0009/executors/default/runs/74aa6767-e45c-40db-8bfd-5aaf9960fabe for removal
/usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe
/usr/local/libexec/mesos/killtree.sh: line 135: echo: write error: Broken pipe
root@shd0:~/mesos_git# /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
/usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
/usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe


-------------------------------------------------------------------------------------------
Here is the backtrace from gdb:

#0  0x0000000000000000 in ?? ()
#1  0x00007ffff74dbaf6 in mesos::internal::slave::Executor::~Executor() ()
   from /usr/local/lib/libmesos-0.9.0.so
#2  0x00007ffff74ec00c in __gnu_cxx::new_allocator<mesos::internal::slave::Executor>::destroy(mesos::internal::slave::Executor*) () from /usr/local/lib/libmesos-0.9.0.so
#3  0x00007ffff74e3bd5 in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::_M_clear() ()
   from /usr/local/lib/libmesos-0.9.0.so
#4  0x00007ffff74de3df in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~_List_base() ()
   from /usr/local/lib/libmesos-0.9.0.so
#5  0x00007ffff74dc670 in std::list<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~list() () from /usr/local/lib/libmesos-0.9.0.so
#6  0x00007ffff74dc7fb in mesos::internal::slave::Framework::~Framework() ()
   from /usr/local/lib/libmesos-0.9.0.so
#7  0x00007ffff74d87d5 in mesos::internal::slave::Slave::shutdownExecutorTimeout(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#8  0x00007ffff7501313 in std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)>::operator()(mesos::internal::slave::Slave*, mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) const
    () from /usr/local/lib/libmesos-0.9.0.so
#9  0x00007ffff74fd404 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::__call<mesos::internal::slave::Slave*&, 0, 1, 2, 3>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>), std::tr1::_Index_tuple<0, 1, 2, 3>) () from /usr/local/lib/libmesos-0.9.0.so
#10 0x00007ffff74f7956 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::operator()<mesos::internal::slave::Slave*>(mesos::internal::slave::Slave*&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#11 0x00007ffff74f12dc in std::tr1::_Function_handler<void ()(mesos::internal::slave::Slave*), std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)> >::_M_invoke(std::tr1::_Any_data const&, mesos::internal::slave::Slave*) () from /usr/local/lib/libmesos-0.9.0.so
#12 0x00007ffff74ed58a in std::tr1::function<void ()(mesos::internal::slave::Slave*)>::operator()(mesos::internal::slave::Slave*) const () from /usr/local/lib/libmesos-0.9.0.so
#13 0x00007ffff74e508d in void process::internal::vdispatcher<mesos::internal::slave::Slave>(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >) () from /usr/local/lib/libmesos-0.9.0.so
#14 0x00007ffff74f9be9 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::__call<process::ProcessBase*&, 0, 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), std::tr1::_Index_tuple<0, 1>) ()
   from /usr/local/lib/libmesos-0.9.0.so
#15 0x00007ffff74f3ce4 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::operator()<process::ProcessBase*>(process::ProcessBase*&) () from /usr/local/lib/libmesos-0.9.0.so
#16 0x00007ffff74ed676 in std::tr1::_Function_handler<void ()(process::ProcessBase*), std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*) () from /usr/local/lib/libmesos-0.9.0.so
#17 0x00007ffff76eecd0 in std::tr1::function<void ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from /usr/local/lib/libmesos-0.9.0.so
#18 0x00007ffff76da56b in process::ProcessBase::visit(process::DispatchEvent const&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#19 0x00007ffff76df1a4 in process::DispatchEvent::visit(process::EventVisitor*) const ()
   from /usr/local/lib/libmesos-0.9.0.so
#20 0x00007ffff738a85e in process::ProcessBase::serve(process::Event const&) ()
   from /usr/local/lib/libmesos-0.9.0.so
#21 0x00007ffff76d7ccb in process::ProcessManager::resume(process::ProcessBase*) ()
   from /usr/local/lib/libmesos-0.9.0.so
#22 0x00007ffff76cf6f7 in process::schedule(void*) ()
   from /usr/local/lib/libmesos-0.9.0.so
#23 0x00007ffff51fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#24 0x00007ffff4f45fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#25 0x0000000000000000 in ?? ()
A debugging session is active.

I discussed with Florian the issue and did some investigations into the code. It seems that the problematic section of the code has received some fairly major patch:

diff --git a/src/slave/process_based_isolation_module.cpp b/src/slave/process_based_isolation_module.cpp
index 7448326..b0b6a81 100644
--- a/src/slave/process_based_isolation_module.cpp
+++ b/src/slave/process_based_isolation_module.cpp
@@ -18,6 +18,7 @@

#include <errno.h>
#include <signal.h>
+#include <stdio.h> // For perror.
#include <string.h>

#include <map>
@@ -150,29 +151,33 @@ void ProcessBasedIsolationModule::launchExecutor(
    dispatch(slave, &Slave::executorStarted,
             frameworkId, executorId, pid);
  } else {
-    // In child process, make cleanup easier.
+    // In child process, we make cleanup easier by putting process
+    // into it's own session. DO NOT USE GLOG!
+    close(pipes[0]);
+
    // NOTE: We setsid() in a loop because setsid() might fail if another
    // process has the same process group id as the calling process.
-    close(pipes[0]);
    while ((pid = setsid()) == -1) {
-      PLOG(ERROR) << "Could not put executor in own session, "
-                  << "forking another process and retrying";
+      perror("Could not put executor in own session");
+
+      std::cerr << "Forking another process and retrying ..." << std::endl;

      if ((pid = fork()) == -1) {
-        LOG(ERROR) << "Failed to fork to launch executor";
-        exit(-1);
+        perror("Failed to fork to launch executor");
+        abort();
      }

      if (pid) {
        // In parent process.
        // It is ok to suicide here, though process reaper signals the exit,
        // because the process isolation module ignores unknown processes.
-        exit(-1);
+        exit(0);
      }
    }

    if (write(pipes[1], &pid, sizeof(pid)) != sizeof(pid)) {
-      PLOG(FATAL) << "Failed to write PID on pipe";
+      perror("Failed to write PID on pipe");
+      abort();
    }

    close(pipes[1]);
@@ -182,7 +187,8 @@ void ProcessBasedIsolationModule::launchExecutor(
                             executorInfo, directory);

    if

-----------------------------------------

We are a bit with our backs against the wall due to the fact that the old released Mesos 0.9 requires restarting the whole cluster in case of a master failure (which we have had a few) losing all running jobs.







    
> mesos slave crashes during framework termination
> ------------------------------------------------
>
>                 Key: MESOS-303
>                 URL: https://issues.apache.org/jira/browse/MESOS-303
>             Project: Mesos
>          Issue Type: Bug
>         Environment: Ubuntu 11.04
>            Reporter: Erich Nachbar
>            Priority: Critical
>
> Hi,
> I'm running Spark 0.6.0 on Mesos trunk (5230fea125b0b) and see my mesos slaves terminating when a Spark job is aborted (CTRL-C).
> The logs only show a Segfault message, but I obtained a backtrace through gdb to give a little more context.
> Mesos passes all checks (make check) except for the linux container.
> Mesos was built using: ./configure.ubuntu-natty-64 --with-zookeeper --with-webui
> Mesos slave command: mesos-slave --master=zk://szk0:2181/mesos
> Here are the last few lines leading up to the segfault using gdb:
> 2012-10-31 22:15:35,698:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-31 22:15:39,047:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 13 ms
> 2012-10-31 22:15:42,385:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 15 ms
> I1031 22:15:45.434877 29511 slave.cpp:652] Asked to shut down framework 201210312057-1560611338-5050-24091-0009
> I1031 22:15:45.435017 29511 slave.cpp:656] Shutting down framework 201210312057-1560611338-5050-24091-0009
> I1031 22:15:45.435387 29511 slave.cpp:1102] Shutting down executor 'default' of framework 201210312057-1560611338-5050-24091-0009
> 2012-10-31 22:15:45,707:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-31 22:15:49,044:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1031 22:15:50.437018 29495 slave.cpp:1131] Killing executor 'default' of framework 201210312057-1560611338-5050-24091-0009
> I1031 22:15:50.439749 29502 gc.cpp:97] Scheduling /tmp/mesos/slaves/201210312057-1560611338-5050-24091-22/frameworks/201210312057-1560611338-5050-24091-0009/executors/default/runs/74aa6767-e45c-40db-8bfd-5aaf9960fabe for removal
> /usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe
> /usr/local/libexec/mesos/killtree.sh: line 135: echo: write error: Broken pipe
> root@shd0:~/mesos_git# /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
> /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
> /usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe
> -------------------------------------------------------------------------------------------
> Here is the backtrace from gdb:
> #0  0x0000000000000000 in ?? ()
> #1  0x00007ffff74dbaf6 in mesos::internal::slave::Executor::~Executor() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #2  0x00007ffff74ec00c in __gnu_cxx::new_allocator<mesos::internal::slave::Executor>::destroy(mesos::internal::slave::Executor*) () from /usr/local/lib/libmesos-0.9.0.so
> #3  0x00007ffff74e3bd5 in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::_M_clear() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #4  0x00007ffff74de3df in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~_List_base() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #5  0x00007ffff74dc670 in std::list<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~list() () from /usr/local/lib/libmesos-0.9.0.so
> #6  0x00007ffff74dc7fb in mesos::internal::slave::Framework::~Framework() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #7  0x00007ffff74d87d5 in mesos::internal::slave::Slave::shutdownExecutorTimeout(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #8  0x00007ffff7501313 in std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)>::operator()(mesos::internal::slave::Slave*, mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) const
>     () from /usr/local/lib/libmesos-0.9.0.so
> #9  0x00007ffff74fd404 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::__call<mesos::internal::slave::Slave*&, 0, 1, 2, 3>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>), std::tr1::_Index_tuple<0, 1, 2, 3>) () from /usr/local/lib/libmesos-0.9.0.so
> #10 0x00007ffff74f7956 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::operator()<mesos::internal::slave::Slave*>(mesos::internal::slave::Slave*&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #11 0x00007ffff74f12dc in std::tr1::_Function_handler<void ()(mesos::internal::slave::Slave*), std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)> >::_M_invoke(std::tr1::_Any_data const&, mesos::internal::slave::Slave*) () from /usr/local/lib/libmesos-0.9.0.so
> #12 0x00007ffff74ed58a in std::tr1::function<void ()(mesos::internal::slave::Slave*)>::operator()(mesos::internal::slave::Slave*) const () from /usr/local/lib/libmesos-0.9.0.so
> #13 0x00007ffff74e508d in void process::internal::vdispatcher<mesos::internal::slave::Slave>(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >) () from /usr/local/lib/libmesos-0.9.0.so
> #14 0x00007ffff74f9be9 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::__call<process::ProcessBase*&, 0, 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), std::tr1::_Index_tuple<0, 1>) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #15 0x00007ffff74f3ce4 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::operator()<process::ProcessBase*>(process::ProcessBase*&) () from /usr/local/lib/libmesos-0.9.0.so
> #16 0x00007ffff74ed676 in std::tr1::_Function_handler<void ()(process::ProcessBase*), std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*) () from /usr/local/lib/libmesos-0.9.0.so
> #17 0x00007ffff76eecd0 in std::tr1::function<void ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from /usr/local/lib/libmesos-0.9.0.so
> #18 0x00007ffff76da56b in process::ProcessBase::visit(process::DispatchEvent const&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #19 0x00007ffff76df1a4 in process::DispatchEvent::visit(process::EventVisitor*) const ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #20 0x00007ffff738a85e in process::ProcessBase::serve(process::Event const&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #21 0x00007ffff76d7ccb in process::ProcessManager::resume(process::ProcessBase*) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #22 0x00007ffff76cf6f7 in process::schedule(void*) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #23 0x00007ffff51fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
> #24 0x00007ffff4f45fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #25 0x0000000000000000 in ?? ()
> A debugging session is active.
> I discussed with Florian the issue and did some investigations into the code. It seems that the problematic section of the code has received some fairly major patch:
> diff --git a/src/slave/process_based_isolation_module.cpp b/src/slave/process_based_isolation_module.cpp
> index 7448326..b0b6a81 100644
> --- a/src/slave/process_based_isolation_module.cpp
> +++ b/src/slave/process_based_isolation_module.cpp
> @@ -18,6 +18,7 @@
> #include <errno.h>
> #include <signal.h>
> +#include <stdio.h> // For perror.
> #include <string.h>
> #include <map>
> @@ -150,29 +151,33 @@ void ProcessBasedIsolationModule::launchExecutor(
>     dispatch(slave, &Slave::executorStarted,
>              frameworkId, executorId, pid);
>   } else {
> -    // In child process, make cleanup easier.
> +    // In child process, we make cleanup easier by putting process
> +    // into it's own session. DO NOT USE GLOG!
> +    close(pipes[0]);
> +
>     // NOTE: We setsid() in a loop because setsid() might fail if another
>     // process has the same process group id as the calling process.
> -    close(pipes[0]);
>     while ((pid = setsid()) == -1) {
> -      PLOG(ERROR) << "Could not put executor in own session, "
> -                  << "forking another process and retrying";
> +      perror("Could not put executor in own session");
> +
> +      std::cerr << "Forking another process and retrying ..." << std::endl;
>       if ((pid = fork()) == -1) {
> -        LOG(ERROR) << "Failed to fork to launch executor";
> -        exit(-1);
> +        perror("Failed to fork to launch executor");
> +        abort();
>       }
>       if (pid) {
>         // In parent process.
>         // It is ok to suicide here, though process reaper signals the exit,
>         // because the process isolation module ignores unknown processes.
> -        exit(-1);
> +        exit(0);
>       }
>     }
>     if (write(pipes[1], &pid, sizeof(pid)) != sizeof(pid)) {
> -      PLOG(FATAL) << "Failed to write PID on pipe";
> +      perror("Failed to write PID on pipe");
> +      abort();
>     }
>     close(pipes[1]);
> @@ -182,7 +187,8 @@ void ProcessBasedIsolationModule::launchExecutor(
>                              executorInfo, directory);
>     if
> -----------------------------------------
> We are a bit with our backs against the wall due to the fact that the old released Mesos 0.9 requires restarting the whole cluster in case of a master failure (which we have had a few) losing all running jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira