You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2016/09/22 19:15:20 UTC

[jira] [Updated] (MESOS-6226) Master crashes while transitioning tasks to 'TASK_UNREACHABLE'

     [ https://issues.apache.org/jira/browse/MESOS-6226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Yu updated MESOS-6226:
--------------------------
    Fix Version/s: 1.1.0

> Master crashes while transitioning tasks to 'TASK_UNREACHABLE'
> --------------------------------------------------------------
>
>                 Key: MESOS-6226
>                 URL: https://issues.apache.org/jira/browse/MESOS-6226
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.1.0
>            Reporter: Jan Schlicht
>            Assignee: Neil Conway
>            Priority: Critical
>             Fix For: 1.1.0
>
>
> While marking an agent as unreachable, Mesos master crashes when marking tasks as unreachable.
> A possible reason could be that a framework might not be reregistered yet.
> {noformat}
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.207105  1043 master.cpp:236] Scheduling transition of agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 to UNREACHABLE because of health check timeout
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217079  1043 master.cpp:5849] Marking agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217375  1043 registrar.cpp:461] Applied 1 operations in 97324ns; attempting to update the registry
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.217798  1041 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 2681
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.218132  1041 replica.cpp:537] Replica received write request for position 2681 from __req_res__(35)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: E0922 06:16:16.218492  1049 process.cpp:2113] Failed to shutdown socket with fd 74: Transport endpoint is not connected
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.220237  1041 replica.cpp:691] Replica received learned notice for position 2681 from @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222190  1041 registrar.cpp:506] Successfully updated the registry in 4.75008ms
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222304  1043 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 2682
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222390  1043 master.cpp:5896] Marked agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 (10.10.0.205) unreachable: health check timed out
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222456  1043 master.cpp:7537] Updating the state of task 8 of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 (latest state: TASK_LOST, status update state: TASK_LOST)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222558  1043 master.cpp:7633] Removing task 8 with resources cpus(*):0.2; mem(*):32 of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 on agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1 at slave(1)@10.10.0.205:5051 (10.10.0.205)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222631  1043 master.cpp:5695] Sending status update TASK_LOST for task 8 of framework 2505868e-c750-4aed-8413-99810a59ffd8-0001 'Slave 10.10.0.205 is unreachable'
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222651  1041 hierarchical.cpp:514] Removed agent 5f0106ed-1033-4624-ba1b-99f81c7e348a-S1
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: F0922 06:16:16.222733  1043 master.cpp:5918] Check failed: 'framework' Must be non NULL
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: *** Check failure stack trace: ***
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.222829  1041 replica.cpp:537] Replica received write request for position 2682 from __req_res__(38)@10.10.0.212:5050
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2ba979d  google::LogMessage::Fail()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2bab5cd  google::LogMessage::SendToLog()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: I0922 06:16:16.248742  1041 replica.cpp:691] Replica received learned notice for position 2682 from @0.0.0.0:0
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2ba938c  google::LogMessage::Flush()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2babec9  google::LogMessageFatal::~LogMessageFatal()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc21dd52d  google::CheckNotNull<>()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc21c632c  mesos::internal::master::Master::_markUnreachable()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc220c88b  _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterEPNS7_5SlaveENS5_8TimeInfoERKNS0_6FutureIbEESA_SB_SD_EEvRKNS0_3PIDIT_EEMSH_FvT0_T1_T2_ET3_T4_T5_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2b305d1  process::ProcessManager::resume()
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc2b308d7  _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc11e1220  (unknown)
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc09ffdc5  start_thread
> Sep 22 06:16:16 ip-10-10-0-212 mesos-master[1036]: @     0x7f5fc072cced  __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)