You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (Jira)" <ji...@apache.org> on 2019/10/11 02:35:00 UTC
[jira] [Created] (MESOS-10011) Operation feedback with stale agent
ID crashes the master
Yan Xu created MESOS-10011:
------------------------------
Summary: Operation feedback with stale agent ID crashes the master
Key: MESOS-10011
URL: https://issues.apache.org/jira/browse/MESOS-10011
Project: Mesos
Issue Type: Bug
Components: agent, master
Affects Versions: 1.9.0
Reporter: Yan Xu
We have observed the following in our environment.
{noformat}
F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
*** Check failure stack trace: ***
@ 0x7fd36ca9cf4d google::LogMessage::Fail()
@ 0x7fd36ca9f13d google::LogMessage::SendToLog()
@ 0x7fd36ca9ca87 google::LogMessage::Flush()
@ 0x7fd36ca9fbc9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd36b5ae3bc mesos::internal::master::Master::removeOperation()
@ 0x7fd36b5b3446 mesos::internal::master::Master::updateOperationStatus()
{noformat}
This follows registration of an agent that has changed its agent ID due to losing its local state.
The check failure code is inĀ [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
The masters would enter a crash loop unless the operation checkpoint state (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
Even thought we try to minimize the cases where an agent would lose its state, it can still happen when the {{latest}} symlink is removed either by an operator or automatically [in certain cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)