You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andreas Peters (Jira)" <ji...@apache.org> on 2021/10/04 10:08:00 UTC
[jira] [Commented] (MESOS-10231) Mesos master crashes during framework teardown

    [ https://issues.apache.org/jira/browse/MESOS-10231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423868#comment-17423868 ] 

Andreas Peters commented on MESOS-10231:
----------------------------------------

Can u show us the configuration how u start spark? Would be helpfull to try it out byself.

> Mesos master crashes during framework teardown
> ----------------------------------------------
>
>                 Key: MESOS-10231
>                 URL: https://issues.apache.org/jira/browse/MESOS-10231
>             Project: Mesos
>          Issue Type: Bug
>          Components: framework, master
>    Affects Versions: 1.9.0
>         Environment: CentOS Linux release 7.9.2009
> Mesos version - 1.9.0
>            Reporter: Divyansh Jamuaar
>            Priority: Major
>
> I have setup a Mesos cluster with a single Mesos Master and I submit spark jobs to it in "cluster" mode.
> After running few spark jobs correctly, the Mesos master crashes while trying to shutdown one of the Spark frameworks with the following error -
>  
> {code:java}
> F0928 14:34:57.678421 2093314 framework.cpp:671] Check failed: totalOfferedResources.filter(allocatedToRole).empty() 
> *** Check failure stack trace: ***
>     @     0x7f1e024ded2e  google::LogMessage::Fail()
>     @     0x7f1e024dec8d  google::LogMessage::SendToLog()
>     @     0x7f1e024de637  google::LogMessage::Flush()
>     @     0x7f1e024e191c  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f1dff93978d  mesos::internal::master::Framework::untrackUnderRole()
>     @     0x7f1dffad004b  mesos::internal::master::Master::removeFramework()
>     @     0x7f1dfface859  mesos::internal::master::Master::teardown()
>     @     0x7f1dffa8ba25  mesos::internal::master::Master::receive()
>     @     0x7f1dffb2f1cf  ProtobufProcess<>::handlerMutM<>()
>     @     0x7f1dffbe6809  std::__invoke_impl<>()
>     @     0x7f1dffbdae22  std::__invoke<>()
>     @     0x7f1dffbc8079  _ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDEONS0_9scheduler4CallEES8_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEES4_SD_St12_PlaceholderILi1EESO_ILi2EEEE6__callIvJS8_SL_EJLm0ELm1ELm2ELm3EEEET_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
>     @     0x7f1dffbaaae5  std::_Bind<>::operator()<>()
>     @     0x7f1dffb833c9  std::_Function_handler<>::_M_invoke()
>     @     0x7f1dff330281  std::function<>::operator()()
>     @     0x7f1dffb13329  ProtobufProcess<>::consume()
>     @     0x7f1dffa85436  mesos::internal::master::Master::_consume()
>     @     0x7f1dffa84ad5  mesos::internal::master::Master::consume()
>     @     0x7f1dffafb9ae  _ZNO7process12MessageEvent7consumeEPNS_13EventConsumerE
>     @     0x564c359f7002  process::ProcessBase::serve()
>     @     0x7f1e023a7bbd  process::ProcessManager::resume()
>     @     0x7f1e023a407c  _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
>     @     0x7f1e023cf1ba  _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
>     @     0x7f1e023cd9c9  _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
>     @     0x7f1e023cc482  _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
>     @     0x7f1e023cb53b  _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
>     @     0x7f1e023ca3c4  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
>     @     0x7f1e051f419d  execute_native_thread_routine
>     @     0x7f1df4200ea5  start_thread
>     @     0x7f1df3f2996d  __clone
> {code}
>  
>  
> It seems like an assertion check is failing which is categorized as fatal but I am not able to figure out the root cause of this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)