You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Hindman (JIRA)" <ji...@apache.org> on 2012/11/01 23:00:12 UTC

[jira] [Comment Edited] (MESOS-303) mesos slave crashes during framework termination

    [ https://issues.apache.org/jira/browse/MESOS-303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13489067#comment-13489067 ] 

Benjamin Hindman edited comment on MESOS-303 at 11/1/12 9:59 PM:
-----------------------------------------------------------------

Thanks for reporting this Erich. I'm currently trying to reproduce locally. It would be incredibly helpful if you can attach the other threads backtraces as well. You can do that from gdb with the command 'thread apply all bt'.
                
      was (Author: benjaminhindman):
    Thanks for reporting this Erich. I'm currently trying to reproduce locally. It would be incredibly helpful if you can attach the other threads backtraces as well. You can do that from gdb with the command 'thread all apply bt'.
                  
> mesos slave crashes during framework termination
> ------------------------------------------------
>
>                 Key: MESOS-303
>                 URL: https://issues.apache.org/jira/browse/MESOS-303
>             Project: Mesos
>          Issue Type: Bug
>         Environment: Ubuntu 11.04
>            Reporter: Erich Nachbar
>            Priority: Critical
>
> Hi,
> I'm running Spark 0.6.0 on Mesos trunk (5230fea125b0b) and see my mesos slaves terminating when a Spark job is aborted (CTRL-C).
> The logs only show a Segfault message, but I obtained a backtrace through gdb to give a little more context.
> Mesos passes all checks (make check) except for the linux container.
> Mesos was built using: ./configure.ubuntu-natty-64 --with-zookeeper --with-webui
> Mesos slave command: mesos-slave --master=zk://szk0:2181/mesos
> Here are the last few lines leading up to the segfault using gdb:
> 2012-10-31 22:15:35,698:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-31 22:15:39,047:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 13 ms
> 2012-10-31 22:15:42,385:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 15 ms
> I1031 22:15:45.434877 29511 slave.cpp:652] Asked to shut down framework 201210312057-1560611338-5050-24091-0009
> I1031 22:15:45.435017 29511 slave.cpp:656] Shutting down framework 201210312057-1560611338-5050-24091-0009
> I1031 22:15:45.435387 29511 slave.cpp:1102] Shutting down executor 'default' of framework 201210312057-1560611338-5050-24091-0009
> 2012-10-31 22:15:45,707:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> 2012-10-31 22:15:49,044:29485(0x7fffe0ac5700):ZOO_DEBUG@zookeeper_process@1983: Got ping response in 0 ms
> I1031 22:15:50.437018 29495 slave.cpp:1131] Killing executor 'default' of framework 201210312057-1560611338-5050-24091-0009
> I1031 22:15:50.439749 29502 gc.cpp:97] Scheduling /tmp/mesos/slaves/201210312057-1560611338-5050-24091-22/frameworks/201210312057-1560611338-5050-24091-0009/executors/default/runs/74aa6767-e45c-40db-8bfd-5aaf9960fabe for removal
> /usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe
> /usr/local/libexec/mesos/killtree.sh: line 135: echo: write error: Broken pipe
> root@shd0:~/mesos_git# /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
> /usr/local/libexec/mesos/killtree.sh: line 124: printf: write error: Broken pipe
> /usr/local/libexec/mesos/killtree.sh: line 229: echo: write error: Broken pipe
> -------------------------------------------------------------------------------------------
> Here is the backtrace from gdb:
> #0  0x0000000000000000 in ?? ()
> #1  0x00007ffff74dbaf6 in mesos::internal::slave::Executor::~Executor() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #2  0x00007ffff74ec00c in __gnu_cxx::new_allocator<mesos::internal::slave::Executor>::destroy(mesos::internal::slave::Executor*) () from /usr/local/lib/libmesos-0.9.0.so
> #3  0x00007ffff74e3bd5 in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::_M_clear() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #4  0x00007ffff74de3df in std::_List_base<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~_List_base() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #5  0x00007ffff74dc670 in std::list<mesos::internal::slave::Executor, std::allocator<mesos::internal::slave::Executor> >::~list() () from /usr/local/lib/libmesos-0.9.0.so
> #6  0x00007ffff74dc7fb in mesos::internal::slave::Framework::~Framework() ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #7  0x00007ffff74d87d5 in mesos::internal::slave::Slave::shutdownExecutorTimeout(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #8  0x00007ffff7501313 in std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)>::operator()(mesos::internal::slave::Slave*, mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&) const
>     () from /usr/local/lib/libmesos-0.9.0.so
> #9  0x00007ffff74fd404 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::__call<mesos::internal::slave::Slave*&, 0, 1, 2, 3>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*&>), std::tr1::_Index_tuple<0, 1, 2, 3>) () from /usr/local/lib/libmesos-0.9.0.so
> #10 0x00007ffff74f7956 in std::tr1::result_of<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>)>::type, std::tr1::result_of<std::tr1::_Mu<mesos::FrameworkID, false, false> ()(mesos::FrameworkID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<mesos::ExecutorID, false, false> ()(mesos::ExecutorID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type, std::tr1::result_of<std::tr1::_Mu<UUID, false, false> ()(UUID, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<mesos::internal::slave::Slave*>))>::type)>::type std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)>::operator()<mesos::internal::slave::Slave*>(mesos::internal::slave::Slave*&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #11 0x00007ffff74f12dc in std::tr1::_Function_handler<void ()(mesos::internal::slave::Slave*), std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::Slave::*)(mesos::FrameworkID const&, mesos::ExecutorID const&, UUID const&)> ()(std::tr1::_Placeholder<1>, mesos::FrameworkID, mesos::ExecutorID, UUID)> >::_M_invoke(std::tr1::_Any_data const&, mesos::internal::slave::Slave*) () from /usr/local/lib/libmesos-0.9.0.so
> #12 0x00007ffff74ed58a in std::tr1::function<void ()(mesos::internal::slave::Slave*)>::operator()(mesos::internal::slave::Slave*) const () from /usr/local/lib/libmesos-0.9.0.so
> #13 0x00007ffff74e508d in void process::internal::vdispatcher<mesos::internal::slave::Slave>(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >) () from /usr/local/lib/libmesos-0.9.0.so
> #14 0x00007ffff74f9be9 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::__call<process::ProcessBase*&, 0, 1>(std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ( const&)(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*&>), std::tr1::_Index_tuple<0, 1>) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #15 0x00007ffff74f3ce4 in std::tr1::result_of<void (*()(std::tr1::result_of<std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>)>::type, std::tr1::result_of<std::tr1::_Mu<std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, false, false> ()(std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >, std::tr1::_Mu<std::tr1::_Placeholder<1>, false, true> ()(std::tr1::_Placeholder<1>, std::tr1::tuple<process::ProcessBase*>))>::type))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::type std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)>::operator()<process::ProcessBase*>(process::ProcessBase*&) () from /usr/local/lib/libmesos-0.9.0.so
> #16 0x00007ffff74ed676 in std::tr1::_Function_handler<void ()(process::ProcessBase*), std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::Slave*)> >)> >::_M_invoke(std::tr1::_Any_data const&, process::ProcessBase*) () from /usr/local/lib/libmesos-0.9.0.so
> #17 0x00007ffff76eecd0 in std::tr1::function<void ()(process::ProcessBase*)>::operator()(process::ProcessBase*) const () from /usr/local/lib/libmesos-0.9.0.so
> #18 0x00007ffff76da56b in process::ProcessBase::visit(process::DispatchEvent const&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #19 0x00007ffff76df1a4 in process::DispatchEvent::visit(process::EventVisitor*) const ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #20 0x00007ffff738a85e in process::ProcessBase::serve(process::Event const&) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #21 0x00007ffff76d7ccb in process::ProcessManager::resume(process::ProcessBase*) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #22 0x00007ffff76cf6f7 in process::schedule(void*) ()
>    from /usr/local/lib/libmesos-0.9.0.so
> #23 0x00007ffff51fbd8c in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
> #24 0x00007ffff4f45fdd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #25 0x0000000000000000 in ?? ()
> A debugging session is active.
> I discussed with Florian the issue and did some investigations into the code. It seems that the problematic section of the code has received some fairly major patch:
> diff --git a/src/slave/process_based_isolation_module.cpp b/src/slave/process_based_isolation_module.cpp
> index 7448326..b0b6a81 100644
> --- a/src/slave/process_based_isolation_module.cpp
> +++ b/src/slave/process_based_isolation_module.cpp
> @@ -18,6 +18,7 @@
> #include <errno.h>
> #include <signal.h>
> +#include <stdio.h> // For perror.
> #include <string.h>
> #include <map>
> @@ -150,29 +151,33 @@ void ProcessBasedIsolationModule::launchExecutor(
>     dispatch(slave, &Slave::executorStarted,
>              frameworkId, executorId, pid);
>   } else {
> -    // In child process, make cleanup easier.
> +    // In child process, we make cleanup easier by putting process
> +    // into it's own session. DO NOT USE GLOG!
> +    close(pipes[0]);
> +
>     // NOTE: We setsid() in a loop because setsid() might fail if another
>     // process has the same process group id as the calling process.
> -    close(pipes[0]);
>     while ((pid = setsid()) == -1) {
> -      PLOG(ERROR) << "Could not put executor in own session, "
> -                  << "forking another process and retrying";
> +      perror("Could not put executor in own session");
> +
> +      std::cerr << "Forking another process and retrying ..." << std::endl;
>       if ((pid = fork()) == -1) {
> -        LOG(ERROR) << "Failed to fork to launch executor";
> -        exit(-1);
> +        perror("Failed to fork to launch executor");
> +        abort();
>       }
>       if (pid) {
>         // In parent process.
>         // It is ok to suicide here, though process reaper signals the exit,
>         // because the process isolation module ignores unknown processes.
> -        exit(-1);
> +        exit(0);
>       }
>     }
>     if (write(pipes[1], &pid, sizeof(pid)) != sizeof(pid)) {
> -      PLOG(FATAL) << "Failed to write PID on pipe";
> +      perror("Failed to write PID on pipe");
> +      abort();
>     }
>     close(pipes[1]);
> @@ -182,7 +187,8 @@ void ProcessBasedIsolationModule::launchExecutor(
>                              executorInfo, directory);
>     if
> -----------------------------------------
> We are a bit with our backs against the wall due to the fact that the old released Mesos 0.9 requires restarting the whole cluster in case of a master failure (which we have had a few) losing all running jobs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira