You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Alexander Rukletsov (JIRA)" <ji...@apache.org> on 2017/11/07 18:07:00 UTC

[jira] [Updated] (MESOS-8179) Scheduler library has incorrect assumptions about connections.

     [ https://issues.apache.org/jira/browse/MESOS-8179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexander Rukletsov updated MESOS-8179:
---------------------------------------
    Priority: Critical  (was: Major)

> Scheduler library has incorrect assumptions about connections.
> --------------------------------------------------------------
>
>                 Key: MESOS-8179
>                 URL: https://issues.apache.org/jira/browse/MESOS-8179
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>            Reporter: Alexander Rukletsov
>            Priority: Critical
>
> Scheduler library assumes that a connection cannot be interrupted between continuations, for example {{send()}} and {{_send()}}: [https://github.com/apache/mesos/blob/509a1ab3226bbec7c369f431656f4ec692da00ba/src/scheduler/scheduler.cpp#L553]. This is not true, {{detected()}} can fire in-between, leading to disconnection:
> {noformat}
> I1107 18:50:57.154796 2138112 scheduler.cpp:496] New master detected at master@192.168.9.40:59063
> ...
> I1107 18:50:57.160935 2138112 scheduler.cpp:505] Waiting for 0ns before initiating a re-(connection) attempt with the master
> I1107 18:50:57.161245 1064960 clock.cpp:435] Clock of __collect__(7)@192.168.9.40:59063 updated to 2017-11-07 17:50:57.159954176+00:00
> I1107 18:50:57.161285 1898086400 clock.cpp:361] Clock resumed at 2017-11-07 17:50:57.159954176+00:00
> I1107 18:50:57.161602 1064960 scheduler.cpp:387] Connected with the master at http://192.168.9.40:59063/master/api/v1/scheduler
> I1107 18:50:57.161779 2138112 scheduler.cpp:249] Sending SUBSCRIBE call to http://192.168.9.40:59063/master/api/v1/scheduler
> I1107 18:50:57.162037 2138112 scheduler.cpp:496] New master detected at master@192.168.9.40:59063
> I1107 18:50:57.162055 2138112 scheduler.cpp:505] Waiting for 0ns before initiating a re-(connection) attempt with the master
> I1107 18:50:57.162164 4820992 process.cpp:3167] Dropping event for process __http_connection__(14)@192.168.9.40:59063
> F1107 18:50:57.162214 2138112 scheduler.cpp:553] CHECK_SOME(connections): is NONE 
> *** Check failure stack trace: ***
> E1107 18:50:57.162240 4820992 process.cpp:2576] Failed to shutdown socket with fd 9, address 192.168.9.40:59063: Socket is not connected
>     @        0x10ed262b4  google::LogMessage::Flush()
>     @        0x10ed2a21f  google::LogMessageFatal::~LogMessageFatal()
>     @        0x10ed26ef9  google::LogMessageFatal::~LogMessageFatal()
> E1107 18:50:57.162304 4820992 process.cpp:2576] Failed to shutdown socket with fd 10, address 192.168.9.40:59063: Socket is not connected
>     @        0x1078efaea  _CheckFatal::~_CheckFatal()
>     @        0x1078ea675  _CheckFatal::~_CheckFatal()
>     @        0x109dfcabf  mesos::v1::scheduler::MesosProcess::_send()
>     @        0x109e07438  _ZZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS3_4CallERKNS_6FutureINS_4http7RequestEEES7_SD_EEvRKNS_3PIDIT_EEMSF_FvT0_T1_EOT2_OT3_ENKUlRS5_RSB_PNS_11ProcessBaseEE_clESR_SS_SU_
>     @        0x109e072b7  _ZNSt3__128__invoke_void_return_wrapperIvE6__callIJRNS_6__bindIZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS8_4CallERKNS4_6FutureINS4_4http7RequestEEESC_SI_EEvRKNS4_3PIDIT_EEMSK_FvT0_T1_EOT2_OT3_EUlRSA_RSG_PNS4_11ProcessBaseEE_JSC_SI_RNS_12placeholders4__phILi1EEEEEESZ_EEEvDpOT_
>     @        0x109e06ba9  _ZNSt3__110__function6__funcINS_6__bindIZN7process8dispatchIN5mesos2v19scheduler12MesosProcessERKNS7_4CallERKNS3_6FutureINS3_4http7RequestEEESB_SH_EEvRKNS3_3PIDIT_EEMSJ_FvT0_T1_EOT2_OT3_EUlRS9_RSF_PNS3_11ProcessBaseEE_JSB_SH_RNS_12placeholders4__phILi1EEEEEENS_9allocatorIS14_EEFvSY_EEclEOSY_
>     @        0x10de77d3a  std::__1::function<>::operator()()
>     @        0x10e307abc  process::ProcessBase::visit()
>     @        0x10e3b804e  process::DispatchEvent::visit()
>     @        0x107a4b991  process::ProcessBase::serve()
>     @        0x10e300191  process::ProcessManager::resume()
>     @        0x10e42d27d  process::ProcessManager::init_threads()::$_2::operator()()
>     @        0x10e42ce12  _ZNSt3__114__thread_proxyINS_5tupleIJZN7process14ProcessManager12init_threadsEvE3$_2EEEEEPvS6_
>     @     0x7fff8591499d  _pthread_body
>     @     0x7fff8591491a  _pthread_start
>     @     0x7fff85912351  thread_start
> zsh: abort      GLOG_v=2 GTEST_FILTER="*SchedulerTest.MasterFailover*" ./bin/mesos-tests.sh  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)