You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Greg Mann (JIRA)" <ji...@apache.org> on 2019/04/18 18:16:00 UTC

[jira] [Comment Edited] (MESOS-9609) Master check failure when marking agent unreachable

    [ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821377#comment-16821377 ] 

Greg Mann edited comment on MESOS-9609 at 4/18/19 6:15 PM:
-----------------------------------------------------------

Looks like {{Master::\_markUnreachable()}} is executed in a continuation after we perform a registry operation, but we assume that the framework still exists when we synchronously invoke {{\_\_removeSlave()}}. the framework could be removed in between {{markUnreachable()}} and {{\_markUnreachable()}}.
https://github.com/apache/mesos/blob/e2b95cdf26844c6c32aa4e44c3582b33570bd330/src/master/master.cpp#L9147-L9156


was (Author: greggomann):
Looks like {{Master::_markUnreachable()}} is executed in a continuation after we perform a registry operation, but we assume that the framework still exists when we synchronously invoke {{__removeSlave()}}. the framework could be removed in between {{markUnreachable()}} and {{_markUnreachable()}}.
https://github.com/apache/mesos/blob/e2b95cdf26844c6c32aa4e44c3582b33570bd330/src/master/master.cpp#L9147-L9156

> Master check failure when marking agent unreachable
> ---------------------------------------------------
>
>                 Key: MESOS-9609
>                 URL: https://issues.apache.org/jira/browse/MESOS-9609
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Greg Mann
>            Assignee: Greg Mann
>            Priority: Critical
>              Labels: foundations, mesosphere
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815433    13 http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815588    13 master.cpp:5467] Processing DECLINE call for offers: [ 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815693    13 master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820142    10 master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820367    10 registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820572    10 registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820642    11 master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957     9 hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.851961    11 master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d  google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830  google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663  google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259  google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14  google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8  mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2  mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11  process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a  process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80  (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba  start_thread
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d  (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @     0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c604ce2c google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830 google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663 google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14 google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11 process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba start_thread
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d (unknown)
> Mar 11 10:04:38 research systemd[1]: mesos-master2.service: main process exited, code=exited, status=139/n/a
> Mar 11 10:04:38 research docker[18886]: mesos-master
> Mar 11 10:04:38 research systemd[1]: Unit mesos-master2.service entered failed state.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)