You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Asha Rostamianfar (Jira)" <ji...@apache.org> on 2019/10/28 20:11:00 UTC

[jira] [Comment Edited] (MESOS-9609) Master check failure when marking agent unreachable

    [ https://issues.apache.org/jira/browse/MESOS-9609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961259#comment-16961259 ] 

Asha Rostamianfar edited comment on MESOS-9609 at 10/28/19 8:10 PM:
--------------------------------------------------------------------

[~greggomann] We have been seeing this issue frequently in the Toil workflow engine after updating Mesos from v1.0.1[*]. I have attached logs and details on [https://github.com/DataBiosphere/toil/issues/2740], which can hopefully help with debugging the issue.  Please let me know if you need any additional info and/or any help with reproducing/debugging the issue! Thanks!

Link to most recent log: [https://github.com/DataBiosphere/toil/files/3779869/mesos-1.9.0-crashlog.txt] from v1.9.0

[*] Updating to v1.8.* or v1.9.0 was the first Mesos update on the Toil pipeline in a long time, so unfortunately, we don't have more granular version changes about when this started happening. However, we are certain that no such issue existed in the old version. Also, note that we're using Ubuntu 18.04 in the new version while the old pipeline used Ubuntu 16.04.


was (Author: arostami):
[~greggomann] We have been seeing this issue frequently in the Toil workflow engine after updating Mesos from v1.0.1[*]. I have attached logs and details on https://github.com/DataBiosphere/toil/issues/2740, which can hopefully help with debugging the issue.  Please let me know if you need any additional info and/or any help with reproducing/debugging the issue! Thanks!

[*] Updating to v1.8.* or v1.9.0 was the first Mesos update on the Toil pipeline in a long time, so unfortunately, we don't have more granular version changes about when this started happening. However, we are certain that no such issue existed in the old version. Also, note that we're using Ubuntu 18.04 in the new version while the old pipeline used Ubuntu 16.04.

> Master check failure when marking agent unreachable
> ---------------------------------------------------
>
>                 Key: MESOS-9609
>                 URL: https://issues.apache.org/jira/browse/MESOS-9609
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 1.5.0
>            Reporter: Greg Mann
>            Assignee: Greg Mann
>            Priority: Critical
>              Labels: foundations, mesosphere
>             Fix For: 1.9.0
>
>
> {code}
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815433    13 http.cpp:1185] HTTP POST for /master/api/v1/scheduler from 10.142.0.5:55133
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815588    13 master.cpp:5467] Processing DECLINE call for offers: [ 5e57f633-a69c-4009-b773-990b4b8984ad-O58323 ] for framework 5e57f633-a69c-4009-b7
> Mar 11 10:04:33 research docker[4503]: I0311 10:04:33.815693    13 master.cpp:10703] Removing offer 5e57f633-a69c-4009-b773-990b4b8984ad-O58323
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820142    10 master.cpp:8227] Marking agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engi
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820367    10 registrar.cpp:495] Applied 1 operations in 86528ns; attempting to update the registry
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820572    10 registrar.cpp:552] Successfully updated the registry in 175872ns
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820642    11 master.cpp:8275] Marked agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49 at slave(1)@10.142.0.10:5051 (tf-mesos-agent-t7c8.c.bitcoin-engin
> Mar 11 10:04:35 research docker[4503]: I0311 10:04:35.820957     9 hierarchical.cpp:609] Removed agent 5e57f633-a69c-4009-b773-990b4b8984ad-S49
> Mar 11 10:04:35 research docker[4503]: F0311 10:04:35.851961    11 master.cpp:10018] Check failed: 'framework' Must be non NULL
> Mar 11 10:04:35 research docker[4503]: *** Check failure stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d  google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830  google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663  google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259  google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14  google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8  mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2  mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11  process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a  process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80  (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba  start_thread
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d  (unknown)
> Mar 11 10:04:36 research docker[4503]: *** Aborted at 1520762676 (unix time) try "date -d @1520762676" if you are using GNU date ***
> Mar 11 10:04:36 research docker[4503]: PC: @     0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: *** SIGSEGV (@0x0) received by PID 1 (TID 0x7f96b986d700) from PID 0; stack trace: ***
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2df1390 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2a4d196 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c604ce2c google::DumpStackTraceAndExit()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044a7d google::LogMessage::Fail()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6046830 google::LogMessage::SendToLog()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6044663 google::LogMessage::Flush()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c6047259 google::LogMessageFatal::~LogMessageFatal()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5258e14 google::CheckNotNull<>()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521dfc8 mesos::internal::master::Master::__removeSlave()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c521f1a2 mesos::internal::master::Master::_markUnreachable()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5f98f11 process::ProcessBase::consume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb2a4a process::ProcessManager::resume()
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c5fb65d6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c35d4c80 (unknown)
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2de76ba start_thread
> Mar 11 10:04:36 research docker[4503]: @     0x7f96c2b1d41d (unknown)
> Mar 11 10:04:38 research systemd[1]: mesos-master2.service: main process exited, code=exited, status=139/n/a
> Mar 11 10:04:38 research docker[18886]: mesos-master
> Mar 11 10:04:38 research systemd[1]: Unit mesos-master2.service entered failed state.
> {code}
> Additional case:
> {noformat}
>  I0715 02:56:40.071446    13 master.cpp:1295] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) disconnected
>  I0715 02:56:40.071503    13 master.cpp:3333] Disconnecting agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
>  I0715 02:56:40.071527    13 master.cpp:3352] Deactivating agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
>  I0715 02:56:40.071563    13 master.cpp:1319] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from disconnected agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150) because the framework is not checkpointing
>  I0715 02:56:40.071579    13 master.cpp:11006] Removing framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 (toil) from agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
>  I0715 02:56:40.071583    12 hierarchical.cpp:829] Agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 deactivated
>  I0715 02:56:40.071619    13 master.cpp:11766] Removing executor 'toil-41' with resources {} of framework 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-0000 on agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 at slave(1)@10.0.138.150:5051 (10.0.138.150)
>  I0715 02:58:08.642220    12 master.cpp:9130] Marking agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
>  I0715 02:58:08.642675    11 registrar.cpp:487] Applied 1 operations in 305592ns; attempting to update the registry
>  I0715 02:58:08.642922    13 registrar.cpp:544] Successfully updated the registry in 187904ns
>  I0715 02:58:08.643081    17 master.cpp:9173] Marked agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0 (10.0.138.150) unreachable: health check timed out
>  F0715 02:58:08.643210    17 master.cpp:11402] Check failed: 'framework' Must be non NULL
>  *** Check failure stack trace: ***
>  I0715 02:58:08.643254    12 hierarchical.cpp:680] Removed agent 9d8dd16c-13f4-4f15-bac8-e5138b2862ee-S0
>      @     0x7ffbcffd090d  google::LogMessage::Fail()
>      @     0x7ffbcffd2748  google::LogMessage::SendToLog()
>      @     0x7ffbcffd04f3  google::LogMessage::Flush()
>      @     0x7ffbcffd31d9  google::LogMessageFatal::~LogMessageFatal()
>      @     0x7ffbcec65024  google::CheckNotNull<>()
>      @     0x7ffbcec32658  mesos::internal::master::Master::__removeSlave()
>      @     0x7ffbcec33b13  mesos::internal::master::Master::_markUnreachable()
>      @     0x7ffbcec33e55  _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEUlbE_JbEEEEclEv
>      @     0x7ffbce93d5d8  _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
>      @     0x7ffbcff18371  process::ProcessBase::consume()
>      @     0x7ffbcff3a97a  process::ProcessManager::resume()
>      @     0x7ffbcff3e6a6  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>      @     0x7ffbcc2cd9e0  (unknown)
>      @     0x7ffbcbde06db  start_thread
>      @     0x7ffbcbb0988f  (unknown)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)