You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Aram (Jira)" <ji...@apache.org> on 2020/12/07 12:57:00 UTC

[jira] [Commented] (MESOS-9657) Launching a command task twice can crash the agent

    [ https://issues.apache.org/jira/browse/MESOS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17245188#comment-17245188 ] 

Aram  commented on MESOS-9657:
------------------------------

Guys, we are also hitting the issue, do you have any ETA or at least potential release milestone to solve the bug? 
We are running 1.9.0 version, but code is the same in master. 


{code:java}
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: I1207 11:16:05.194779 29877 slave.cpp:2130] Got assigned task 'ct:1607336160000:0:TAKS_NAME:' for framework 7c85369d-7ed2-4634-afe1-f59ba555e427-0274
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: I1207 11:16:05.196017 29877 slave.cpp:2504] Authorizing task 'ct:1607336160000:0:TAKS_NAME:' for framework 7c85369d-7ed2-4634-afe1-f59ba555e427-0274
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: I1207 11:16:05.198606 29877 slave.cpp:2977] Launching task 'ct:1607336160000:0:TAKS_NAME:' for framework 7c85369d-7ed2-4634-afe1-f59ba555e427-0274
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: F1207 11:16:05.198640 29877 slave.cpp:2990] Check failed: executor == nullptr
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: *** Check failure stack trace: ***
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eb7b1d google::LogMessage::Fail()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eb9dfd google::LogMessage::SendToLog()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eb76ab google::LogMessage::Flush()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89eba859 google::LogMessageFatal::~LogMessageFatal()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8900a27d mesos::internal::slave::Slave::__run()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8902db71 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89df5b91 process::ProcessBase::consume()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89e1af77 process::ProcessManager::resume()
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac89e1eb36 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8a0d54d0 execute_native_thread_routine
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac8621240b start_thread
Dec 07 11:16:05 ip-10-103-6-133 mesos-slave[29863]: @ 0x7fac859aee7f __GI___clone
Dec 07 11:16:05 ip-10-103-6-133 systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT{code}

> Launching a command task twice can crash the agent
> --------------------------------------------------
>
>                 Key: MESOS-9657
>                 URL: https://issues.apache.org/jira/browse/MESOS-9657
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Priority: Major
>
> When launching a command task, we verify that the framework has no existing executor for that task:
> {noformat}
>       // We are dealing with command task; a new command executor will be
>       // launched.
>       CHECK(executor == nullptr);
> {noformat}
> and afterwards an executor is created with the same executor id as the task id:
> {noformat}
>   // (slave.cpp)
>   // Either the master explicitly requests launching a new executor
>   // or we are in the legacy case of launching one if there wasn't
>   // one already. Either way, let's launch executor now.
>   if (executor == nullptr) {
>     Try<Executor*> added = framework->addExecutor(executorInfo);
>   [...]
> {noformat}
> This means that if we relaunch the task with the same task id before the executor is removed, it will crash the agent:
> {noformat}
> F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr 
> *** Check failure stack trace: ***
>     @     0x7feb29a407af  google::LogMessage::Flush()
>     @     0x7feb29a43c3f  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7feb28a5a886  mesos::internal::slave::Slave::__run()
>     @     0x7feb28af4f0e  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_
>     @     0x7feb2998a620  process::ProcessBase::consume()
>     @     0x7feb29987675  process::ProcessManager::resume()
>     @     0x7feb299a2d2b  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8EEEEE6_M_runEv
>     @     0x7feb2632f523  (unknown)
>     @     0x7feb25e40594  start_thread
>     @     0x7feb25b73e6f  __GI___clone
> Aborted (core dumped)
> {noformat}
> Instead of crashing, the agent should just drop the task with an appropriate error in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)