You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Armand Grillet (JIRA)" <ji...@apache.org> on 2017/10/23 16:24:00 UTC

[jira] [Commented] (MESOS-7991) fatal, check failed !framework->recovered()

    [ https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16215386#comment-16215386 ] 

Armand Grillet commented on MESOS-7991:
---------------------------------------

This could happen if we have master failover, agent re-registers and then again re-registers (https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/slave/slave.cpp#L1629). The statement in https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8070 thus does not seem correct and the change https://github.com/apache/mesos/blob/b13c4c3683fd6bad702a7fb9e24cfc3414b921da/src/master/master.cpp#L8073 from the review request https://reviews.apache.org/r/53897/ that happened to follow this comment should be removed.

The strange thing is that the tasks are known to the master but not to the agent according to the logs (master.cpp:7568), the fact that the agent kept its id but not its tasks seem unlikely. Could you give more context around the agent, the registration attempt and also the master logs since the failover and the agent logs around that timeframe?

We should write a test reproducing the issue -(having a master + agent, launching a task, restarting master, block framework re-registration, let agent re-registers twice by spoofing the second re-registration)- and then remove the line 8073.

> fatal, check failed !framework->recovered()
> -------------------------------------------
>
>                 Key: MESOS-7991
>                 URL: https://issues.apache.org/jira/browse/MESOS-7991
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Jack Crawford
>            Assignee: Armand Grillet
>            Priority: Blocker
>              Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task f838a03c-5cd4-47eb-8606-69b004d89808 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: !framework->recovered()
> *** Check failure stack trace: ***
>     @     0x7f7bf80087ed  google::LogMessage::Fail()
>     @     0x7f7bf800a5a0  google::LogMessage::SendToLog()
>     @     0x7f7bf80083d3  google::LogMessage::Flush()
>     @     0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7f7bf736fe7e  mesos::internal::master::Master::reconcileKnownSlave()
>     @     0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
>     @     0x7f7bf73a580e  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> EEEERKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
>     @     0x7f7bf7f5e69c  process::ProcessBase::visit()
>     @     0x7f7bf7f71403  process::ProcessManager::resume()
>     @     0x7f7bf7f7c127  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>     @     0x7f7bf60b5c80  (unknown)
>     @     0x7f7bf58c86ba  start_thread
>     @     0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)