You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2013/03/15 04:44:13 UTC

[jira] [Commented] (MESOS-394) Don't do ExecutorLauncher in forked process but exec first instead.

    [ https://issues.apache.org/jira/browse/MESOS-394?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13603081#comment-13603081 ] 

Vinod Kone commented on MESOS-394:
----------------------------------

Looks like an instance of the above. The executor is getting hung when it is forked by the slave.

This was happening in one of the slave recovery test run (when run repeatedly in a loop). Strange the deadlock happens when we are doing checkpointing in the forked process, and ostringstream is trying to get a pthread lock! (or i'm not reading these traces right).

Run1:
--------

Thread 1 (process 77074):
#0  0x00007fff93b6e122 in __psynch_mutexwait ()
#1  0x00007fff8d2cbd9d in pthread_mutex_lock ()
#2  0x00007fff8d023442 in std::locale::locale ()
#3  0x00007fff8d04589f in std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream ()
#4  0x000000010cc5a949 in stringify<int> () at gtest.h:459
#5  0x000000010e243154 in __gnu_cxx::new_allocator<std::pair<std::string const, Option<std::string> > >::destroy () at /usr/include/c++/4.2.1/ext/new_allocator.h:860
#6  0x000000010e243154 in mesos::internal::slave::ProcessBasedIsolationModule::launchExecutor () at functional_iterate.h:402
#7  0x000000010e1cf3b0 in std::tr1::_Mem_fn<void (mesos::internal::slave::IsolationModule::*)(mesos::SlaveID const&, mesos::FrameworkID const&, mesos::FrameworkInfo const&, mesos::ExecutorInfo const&, UUID const&, std::string const&, mesos::internal::Resources const&, Option<std::string> const&)>::operator() () at functional_iterate.h:860
#8  0x000000010e1d71f3 in std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::IsolationModule::*)(mesos::SlaveID const&, mesos::FrameworkID const&, mesos::FrameworkInfo const&, mesos::ExecutorInfo const&, UUID const&, std::string const&, mesos::internal::Resources const&, Option<std::string> const&)> ()(std::tr1::_Placeholder<1>, mesos::SlaveID, mesos::FrameworkID, mesos::FrameworkInfo, mesos::ExecutorInfo, UUID, std::string, mesos::internal::Resources, Option<std::string>)>::operator()<mesos::internal::slave::IsolationModule*> () at functional_iterate.h:860
#9  0x000000010e1d722b in std::tr1::_Function_handler<void ()(mesos::internal::slave::IsolationModule*), std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::IsolationModule::*)(mesos::SlaveID const&, mesos::FrameworkID const&, mesos::FrameworkInfo const&, mesos::ExecutorInfo const&, UUID const&, std::string const&, mesos::internal::Resources const&, Option<std::string> const&)> ()(std::tr1::_Placeholder<1>, mesos::SlaveID, mesos::FrameworkID, mesos::FrameworkInfo, mesos::ExecutorInfo, UUID, std::string, mesos::internal::Resources, Option<std::string>)> >::_M_invoke () at functional_iterate.h:860
#10 0x000000010e1eaf96 in __gnu_cxx::new_allocator<std::pair<std::string const, Option<std::string> > >::destroy () at /usr/include/c++/4.2.1/ext/new_allocator.h:441
#11 0x000000010e1eaf96 in std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)>::operator() () at stl_deque.h:402
#12 0x000000010e20159e in __gnu_cxx::new_allocator<std::pair<std::string const, Option<std::string> > >::destroy () at /usr/include/c++/4.2.1/ext/new_allocator.h:441
#13 0x000000010e20159e in process::internal::vdispatcher<mesos::internal::slave::IsolationModule> () at stl_deque.h:402
#14 0x000000010e1e5e30 in __gnu_cxx::new_allocator<std::pair<std::string const, Option<std::string> > >::destroy () at /usr/include/c++/4.2.1/ext/new_allocator.h:69
#15 0x000000010e1e5e30 in std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >)>::operator()<process::ProcessBase*> () at stl_pair.h:402
#16 0x000000010e1e5bbb in __gnu_cxx::new_allocator<std::pair<std::string const, Option<std::string> > >::destroy () at /usr/include/c++/4.2.1/ext/new_allocator.h:69
#17 0x000000010e1e5bbb in std::tr1::_Function_handler<void ()(process::ProcessBase*), std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >)> >::_M_invoke () at stl_pair.h:402
#18 0x000000010e41ce8a in std::tr1::function<void ()(process::ProcessBase*)>::operator() () at zookeeper.cpp:526
#19 0x000000010e3ab17c in process::ProcessBase::visit () at boost_shared_ptr.h:152
#20 0x000000010e3d315f in process::DispatchEvent::visit () at boost_shared_ptr.h:152
#21 0x000000010d12caab in process::ProcessBase::serve () at gtest-internal.h:529
#22 0x000000010e3bf370 in process::ProcessManager::resume () at boost_shared_ptr.h:152
#23 0x000000010e3c03b6 in process::schedule () at boost_shared_ptr.h:152
#24 0x00007fff8d2c6742 in _pthread_start ()
#25 0x00007fff8d2b3181 in thread_start ()


Run 2
---------
Thread 1 (process 46451):
#0  0x00007fff93b6e122 in __psynch_mutexwait ()
#1  0x00007fff8d2cbd9d in pthread_mutex_lock ()
#2  0x00007fff8d023442 in std::locale::locale ()
#3  0x00007fff8d04589f in std::basic_ostringstream<char, std::char_traits<char>, std::allocator<char> >::basic_ostringstream ()
#4  0x000000010efa9a99 in __gnu_cxx::new_allocator<std::pair<std::string const, Option<std::string> > >::destroy () at /usr/include/c++/4.2.1/ext/new_allocator.h:17
#5  0x000000010efa9a99 in stringify<int> (t=46451) at stringify.hpp:402
#6  0x000000010fc48fbd in mesos::internal::slave::ProcessBasedIsolationModule::launchExecutor (this=0x110e4fbc0, slaveId=@0x110e4fbc0, frameworkId=@0x110e4fbc0, frameworkInfo=@0x110e4fbc0, executorInfo=@0x110e4fbc0, _=@0x5, directory=@0x7f7f5991c9f8, resources=@0x7f7f5991ca00, path=@0x7f7f5991ca38) at process_based_isolation_module.cpp:209
#7  0x000000010fc1c98c in std::tr1::_Function_handler<void ()(mesos::internal::slave::IsolationModule*), std::tr1::_Bind<std::tr1::_Mem_fn<void (mesos::internal::slave::IsolationModule::*)(mesos::SlaveID const&, mesos::FrameworkID const&, mesos::FrameworkInfo const&, mesos::ExecutorInfo const&, UUID const&, std::string const&, mesos::internal::Resources const&, Option<std::string> const&)> ()(std::tr1::_Placeholder<1>, mesos::SlaveID, mesos::FrameworkID, mesos::FrameworkInfo, mesos::ExecutorInfo, UUID, std::string, mesos::internal::Resources, Option<std::string>)> >::_M_invoke (__functor=@0x7fff7cca2c10, __a1=0x7f7f5991c8e8) at functional_iterate.h:214
#8  0x000000010fc22476 in std::tr1::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count () at boost_shared_ptr.h:133
#9  0x000000010fc22476 in std::tr1::__shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)>, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr () at /usr/include/c++/4.2.1/tr1/boost_shared_ptr.h:504
#10 0x000000010fc22476 in std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >)>::operator()<process::ProcessBase*> (this=0x7fff7cca2c10, __u1=@0x7f7f5991c8e8) at boost_shared_ptr.h:974
#11 0x000000010fc20f38 in std::tr1::_Function_handler<void ()(process::ProcessBase*), std::tr1::_Bind<void (*()(std::tr1::_Placeholder<1>, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >))(process::ProcessBase*, std::tr1::shared_ptr<std::tr1::function<void ()(mesos::internal::slave::IsolationModule*)> >)> >::_M_invoke (__a1=0x7f7f59912330, __functor=@0x7fff7cca2c10) at functional_iterate.h:502
#12 0x000000010fdd9c8a in std::tr1::function<void ()(process::ProcessBase*)>::operator() () at stl_deque.h:441
#13 0x000000010fd57c0c in process::ProcessBase::visit () at boost_shared_ptr.h:152
#14 0x000000010fd8051f in process::DispatchEvent::visit () at boost_shared_ptr.h:152
#15 0x000000010fd6be00 in process::ProcessManager::resume () at boost_shared_ptr.h:152
#16 0x000000010fd6ce46 in process::schedule () at boost_shared_ptr.h:152
#17 0x00007fff8d2c6742 in _pthread_start ()
#18 0x00007fff8d2b3181 in thread_start ()

                
> Don't do ExecutorLauncher in forked process but exec first instead.
> -------------------------------------------------------------------
>
>                 Key: MESOS-394
>                 URL: https://issues.apache.org/jira/browse/MESOS-394
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benjamin Hindman
>
> We've run into numerous issues where we've code executed in forked processes has deadlocked because resources (i.e., locks) from the parent process were not cleaned up (i.e., unlocked) in the forked process. Rather than continue this trend, we should always attempt to minimize the code executed in a forked process and if we're doing anything fancy do an exec right away. In particular, we should only be calling async-signal-safe functions in forked code.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira