You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Naveen (Jira)" <ji...@apache.org> on 2020/07/03 23:40:00 UTC

[jira] [Created] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash

Naveen created MESOS-10146:
------------------------------

             Summary: Removing task from slave when framework is disconnected causes master to crash
                 Key: MESOS-10146
                 URL: https://issues.apache.org/jira/browse/MESOS-10146
             Project: Mesos
          Issue Type: Bug
          Components: c++ api, framework
    Affects Versions: 1.9.0
         Environment: Mesos master with three master nodes
            Reporter: Naveen


Hello, 

    we want to report an issue we observed when remove tasks from slave. There is condition to check for valid framework before tasks can be removed. There can be several reasons framework can be disconnected. This check fails and crashes mesos master node. 

[https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842]

There is also unguarded access to the internal framework state on line 11853.

Error logs - 
{noformat}
mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health check timed out
mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check failed: framework != nullptr Framework 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } }
mesos-master[5483]: *** Check failure stack trace: ***
mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica received learned notice for position 42070 from log-network(1)@10.160.73.212:5050
mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
mesos-master[5483]: @ 0x7f2fdf6a8859 google::LogMessageFatal::~LogMessageFatal()
mesos-master[5483]: @ 0x7f2fde2677f2 mesos::internal::master::Master::__removeSlave()
mesos-master[5483]: @ 0x7f2fde267ebe mesos::internal::master::Master::_markUnreachable()
mesos-master[5483]: @ 0x7f2fde268215 _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbEEEEclEv
mesos-master[5483]: @ 0x7f2fddf30688 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vEEEEESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEEEEEclEOS3_
mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
mesos-master[5483]: @ 0x7f2fdf60cb36 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
systemd[1]: mesos-master.service: main process exited, code=killed, status=6/ABRT
systemd[1]: Unit mesos-master.service entered failed state.
systemd[1]: mesos-master.service failed.
systemd[1]: mesos-master.service holdoff time over, scheduling restart.
systemd[1]: Stopped Mesos Master.
systemd[1]: Started Mesos Master.
mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level logging started!
mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 2020-05-09 10:42:00 by centos
mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)