You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Charles Natali (Jira)" <ji...@apache.org> on 2021/10/16 11:51:00 UTC
[jira] [Assigned] (MESOS-9657) Launching a command task twice can crash the agent

     [ https://issues.apache.org/jira/browse/MESOS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Charles Natali reassigned MESOS-9657:
-------------------------------------

    Fix Version/s: 1.12.0
         Assignee: Charles Natali
       Resolution: Fixed

> Launching a command task twice can crash the agent
> --------------------------------------------------
>
>                 Key: MESOS-9657
>                 URL: https://issues.apache.org/jira/browse/MESOS-9657
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Assignee: Charles Natali
>            Priority: Major
>             Fix For: 1.12.0
>
>
> When launching a command task, we verify that the framework has no existing executor for that task:
> {noformat}
>       // We are dealing with command task; a new command executor will be
>       // launched.
>       CHECK(executor == nullptr);
> {noformat}
> and afterwards an executor is created with the same executor id as the task id:
> {noformat}
>   // (slave.cpp)
>   // Either the master explicitly requests launching a new executor
>   // or we are in the legacy case of launching one if there wasn't
>   // one already. Either way, let's launch executor now.
>   if (executor == nullptr) {
>     Try<Executor*> added = framework->addExecutor(executorInfo);
>   [...]
> {noformat}
> This means that if we relaunch the task with the same task id before the executor is removed, it will crash the agent:
> {noformat}
> F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr 
> *** Check failure stack trace: ***
>     @     0x7feb29a407af  google::LogMessage::Flush()
>     @     0x7feb29a43c3f  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7feb28a5a886  mesos::internal::slave::Slave::__run()
>     @     0x7feb28af4f0e  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_
>     @     0x7feb2998a620  process::ProcessBase::consume()
>     @     0x7feb29987675  process::ProcessManager::resume()
>     @     0x7feb299a2d2b  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8EEEEE6_M_runEv
>     @     0x7feb2632f523  (unknown)
>     @     0x7feb25e40594  start_thread
>     @     0x7feb25b73e6f  __GI___clone
> Aborted (core dumped)
> {noformat}
> Instead of crashing, the agent should just drop the task with an appropriate error in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)