You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (Jira)" <ji...@apache.org> on 2019/10/11 02:42:00 UTC
[jira] [Commented] (MESOS-10011) Operation feedback with stale agent ID crashes the master

    [ https://issues.apache.org/jira/browse/MESOS-10011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16949069#comment-16949069 ] 

Yan Xu commented on MESOS-10011:
--------------------------------

In our environment we only use old style RESERVE/CREATE persistent volumes and a plausible case is that if the scheduler fails to acknowledge the operation feedback so the check pointed update still exists with the original agent ID in it.

After the agent losing its state, because [resources_and_operations.state|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/paths.hpp#L74] lives outside the [slaves|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/paths.hpp#L59] state, the unacked operations don't get cleaned up and now have the stale agent ID in them.

> Operation feedback with stale agent ID crashes the master
> ---------------------------------------------------------
>
>                 Key: MESOS-10011
>                 URL: https://issues.apache.org/jira/browse/MESOS-10011
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, master
>    Affects Versions: 1.9.0
>            Reporter: Yan Xu
>            Priority: Critical
>
> We have observed the following in our environment.
> {noformat}
> F1003 17:35:30.742681 58334 master.cpp:12528] Check failed: slave != nullptr f664c4a9-d1ca-4cd0-88e4-0a6acf20e629-S218
> *** Check failure stack trace: ***
>     @     0x7fd36ca9cf4d  google::LogMessage::Fail()
>     @     0x7fd36ca9f13d  google::LogMessage::SendToLog()
>     @     0x7fd36ca9ca87  google::LogMessage::Flush()
>     @     0x7fd36ca9fbc9  google::LogMessageFatal::~LogMessageFatal()
>     @     0x7fd36b5ae3bc  mesos::internal::master::Master::removeOperation()
>     @     0x7fd36b5b3446  mesos::internal::master::Master::updateOperationStatus()
> {noformat}
> This follows registration of an agent that has changed its agent ID due to losing its local state.
> The check failure code is in [Master::removeOperation|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/master/master.cpp#L12451].
> The masters would enter a crash loop unless the operation checkpoint state (i.e., {{resources_and_operations.state}}) on the offending agent is deleted.
>  Even thought we try to minimize the cases where an agent would lose its state, it can still happen when the {{latest}} symlink is removed either by an operator or automatically [in certain cases|https://github.com/apache/mesos/blob/558829eb24f4ad636348497075bbc0428a4794a4/src/slave/slave.cpp#L7719-L7725].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)