You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by GitBox <gi...@apache.org> on 2021/09/11 08:51:46 UTC

[GitHub] [mesos] cf-natali opened a new pull request #408: Fixed an agent crash in case of duplicate task ID.

cf-natali opened a new pull request #408:
URL: https://github.com/apache/mesos/pull/408


   When using the command executor, if a task is started with the same ID
   as a previous task for which the agent still has the executor, reject
   the task instead of crashing.
   
   Closes MESOS-9657.
   
   Test before:
   
   ```
   [==========] Running 1 test from 1 test case.                                                                                                                                                                      
   [----------] Global test environment set-up.                                                                                                                                                                       
   [----------] 1 test from SlaveTest                                                                                                                                                                                 
   [ RUN      ] SlaveTest.DuplicateTaskIdCommandExecutor                                                                                                                                                              
   I0808 20:06:01.514927 1825472 exec.cpp:164] Version: 1.12.0                                                                                                                                                        
   I0808 20:06:01.529322 1825471 exec.cpp:237] Executor registered on agent 10a82bd5-feff-49fe-b666-967f6751915b-S0
   I0808 20:06:01.532748 1825465 executor.cpp:190] Received SUBSCRIBED event
   I0808 20:06:01.533928 1825465 executor.cpp:194] Subscribed executor on thinkpad
   I0808 20:06:01.534307 1825465 executor.cpp:190] Received LAUNCH event
   I0808 20:06:01.535918 1825465 executor.cpp:722] Starting task duplicate-task-id
   I0808 20:06:01.556455 1825465 executor.cpp:740] Forked command at 1825474
   --author
   I0808 20:06:01.715481 1825465 executor.cpp:1041] Command exited with status 0 (pid: 1825474)
   F0808 20:06:02.030985 1825394 slave.cpp:3194] Check failed: executor == nullptr 
   *** Check failure stack trace: ***
       @     0x7fc121cde750  google::LogMessage::Fail() 
       @     0x7fc121cde6af  google::LogMessage::SendToLog()
       @     0x7fc121cde05a  google::LogMessage::Flush()
       @     0x7fc121ce1334  google::LogMessageFatal::~LogMessageFatal()
       @     0x7fc11fdecee3  mesos::internal::slave::Slave::__run()
       @     0x7fc11fe8dc42  _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDES
   aISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_
       @     0x7fc11ff84803  _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19Resour
   ceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl
   7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_ 
       @     0x7fc11ff734a9  _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorIN
   S5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt
   12_PlaceholderILi1EEEE13invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEEDTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwar
   dIS12_Efp2_EEEEOSX_OS11_N5cpp1416integer_sequenceImJXspT2_EEEEOS12_
       @     0x7fc11ff6b32c  _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorI
   NS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bS
   t12_PlaceholderILi1EEEEclIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7EEEE_Ecl16forward_as_tuplespcl7forwardIT_Efp_EEEEDpOS
   25_
       @     0x7fc11ff65f71  _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfo
   EERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_
   SI_SM_SS_SV_bSt12_PlaceholderILi1EEEEEJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_
       @     0x7fc11ff632f3  _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupIn
   foEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_
   SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EEEEEJS1Z_EEEvOS10_DpOT0_
       @     0x7fc11ff5ffce  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionI
   NSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbEbSG_SJ_SO_SS_SY_S11_bEEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSE_OSH_OSM_
   OSQ_OSW_OSZ_ObS3_E_JSE_SH_SM_SQ_SW_SZ_bSt12_PlaceholderILi1EEEEEEclEOS3_
       @     0x7fc121c15b03  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
       @     0x7fc121bda5c8  process::ProcessBase::consume()
       @     0x7fc121c02f30  _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
       @     0x5600da474f66  process::ProcessBase::serve()
       @     0x7fc121bd6f7b  process::ProcessManager::resume()
       @     0x7fc121bd33b0  _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
       @     0x7fc121bfe7b6  _ZSt13__invoke_implIvZN7process14ProcessManager12init_threadsEvEUlvE_JEET_St14__invoke_otherOT0_DpOT1_
       @     0x7fc121bfcfe7  _ZSt8__invokeIZN7process14ProcessManager12init_threadsEvEUlvE_JEENSt15__invoke_resultIT_JDpT0_EE4typeEOS4_DpOS5_
       @     0x7fc121bfbace  _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEE9_M_invokeIJLm0EEEEvSt12_Index_tupleIJXspT_EEE
       @     0x7fc121bfabc0  _ZNSt6thread8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEclEv
       @     0x7fc121bf9aa0  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvEUlvE_EEEEE6_M_runEv
       @     0x7fc113126ed0  (unknown)
       @     0x7fc112ef8ea7  start_thread
       @     0x7fc112e28def  clone
   I0808 20:06:02.718780 1825473 process.cpp:935] Stopped the socket accept loop
   Aborted (core dumped)
   ```
   
   @asekretenko @qianzhangxa 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] cf-natali commented on pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on pull request #408:
URL: https://github.com/apache/mesos/pull/408#issuecomment-933005385


   Unless anyone opposes I'll merge this tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] cf-natali merged pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
cf-natali merged pull request #408:
URL: https://github.com/apache/mesos/pull/408


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] cf-natali commented on pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on pull request #408:
URL: https://github.com/apache/mesos/pull/408#issuecomment-927349879


   @asekretenko @qianzhangxa So what do you think, OK for me to merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] cf-natali commented on pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on pull request #408:
URL: https://github.com/apache/mesos/pull/408#issuecomment-923319528


   @asekretenko Are you happy with the MR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] asekretenko commented on a change in pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
asekretenko commented on a change in pull request #408:
URL: https://github.com/apache/mesos/pull/408#discussion_r706588131



##########
File path: src/slave/slave.cpp
##########
@@ -3191,7 +3191,17 @@ void Slave::__run(
     if (taskGroup.isNone() && task->has_command()) {
       // We are dealing with command task; a new command executor will be
       // launched.
-      CHECK(executor == nullptr);
+      // It is possible for an executor with this ID to already exist, if the
+      // TaskID was re-used - see MESOS-9657. If this happens, we have no
+      // choice but to drop the task.
+      if (executor != nullptr) {
+        sendTaskDroppedUpdate(
+            TaskStatus::REASON_TASK_INVALID,
+            "Master wants to launch executor, but one already exists "

Review comment:
       Looks like in some/many cases it is the framework which is responsible for creating this situation? Prehaps something not attributing the error explicitly to the master could be better, like "Cannot reuse an already existing executor for a command task" ?
   
   Or do we have a similar check in master (unreliable, as master is not the source of truth about executors), and this only happens when the master is not aware that an executor with this ID already exists?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] cf-natali commented on a change in pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #408:
URL: https://github.com/apache/mesos/pull/408#discussion_r706805644



##########
File path: src/slave/slave.cpp
##########
@@ -3191,7 +3191,17 @@ void Slave::__run(
     if (taskGroup.isNone() && task->has_command()) {
       // We are dealing with command task; a new command executor will be
       // launched.
-      CHECK(executor == nullptr);
+      // It is possible for an executor with this ID to already exist, if the
+      // TaskID was re-used - see MESOS-9657. If this happens, we have no
+      // choice but to drop the task.
+      if (executor != nullptr) {
+        sendTaskDroppedUpdate(
+            TaskStatus::REASON_TASK_INVALID,
+            "Master wants to launch executor, but one already exists "

Review comment:
       > Or do we have a similar check in master (unreliable, as master is not the source of truth about executors), and this only happens when the master is not aware that an executor with this ID already exists?
   
   Yes, exactly that - there is a check for task uniqueness in the [master validation](https://github.com/apache/mesos/blob/af6922288f0dd9f530305641a4f88c3bc9b48dab/src/master/validation.cpp#L1415).
   It only checks for task ID though.
   
   "Cannot reuse an already existing executor for a command task" sounds good though, let me change to that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [mesos] cf-natali commented on a change in pull request #408: Fixed an agent crash in case of duplicate task ID.

Posted by GitBox <gi...@apache.org>.
cf-natali commented on a change in pull request #408:
URL: https://github.com/apache/mesos/pull/408#discussion_r706805644



##########
File path: src/slave/slave.cpp
##########
@@ -3191,7 +3191,17 @@ void Slave::__run(
     if (taskGroup.isNone() && task->has_command()) {
       // We are dealing with command task; a new command executor will be
       // launched.
-      CHECK(executor == nullptr);
+      // It is possible for an executor with this ID to already exist, if the
+      // TaskID was re-used - see MESOS-9657. If this happens, we have no
+      // choice but to drop the task.
+      if (executor != nullptr) {
+        sendTaskDroppedUpdate(
+            TaskStatus::REASON_TASK_INVALID,
+            "Master wants to launch executor, but one already exists "

Review comment:
       > Or do we have a similar check in master (unreliable, as master is not the source of truth about executors), and this only happens when the master is not aware that an executor with this ID already exists?
   
   Yes, exactly that - there is a check for task uniqueness in the [master validation](https://github.com/apache/mesos/blob/af6922288f0dd9f530305641a4f88c3bc9b48dab/src/master/validation.cpp#L1415).
   
   "Cannot reuse an already existing executor for a command task" sounds good though, if you prefer I can change it to that?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@mesos.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org