You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jeremy Lingmann (JIRA)" <ji...@apache.org> on 2014/11/20 23:21:34 UTC

[jira] [Created] (MESOS-2148) Cluster formation often fails when starting mesos-masters with --registry=replicated_log

Jeremy Lingmann created MESOS-2148:
--------------------------------------

             Summary: Cluster formation often fails when starting mesos-masters with --registry=replicated_log
                 Key: MESOS-2148
                 URL: https://issues.apache.org/jira/browse/MESOS-2148
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.20.1
         Environment: Mesos 0.20.1
Zookeeper 3.3.5
Debian Wheezy
            Reporter: Jeremy Lingmann


When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.

https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
{code}
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289  2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247  2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf3acd  google::LogMessage::Fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf5883  google::LogMessage::SendToLog()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf36e8  google::LogMessage::Flush()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf60de  google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c5c7946  mesos::internal::master::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622c31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622caa  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c521aa9  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c661b7a  process::internal::thenf<>()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b1e8  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b179  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c66a629  mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f1cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f224  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9e57c  process::ProcessManager::resume()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9ee1c  process::schedule()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a7bdb50  start_thread
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a5077bd  (unknown)
{code}

https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
{code}
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592  2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720  2329 master.cpp:1209] Elected as the leading master!
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757  2329 master.cpp:1027] Recovering from registrar
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967  2327 registrar.cpp:313] Recovering registrar
Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543  2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36bacd  google::LogMessage::Fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36d883  google::LogMessage::SendToLog()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36b6e8  google::LogMessage::Flush()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36e0de  google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be3f946  mesos::internal::master::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9ac31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9acaa  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bd99aa9  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bed9b7a  process::internal::thenf<>()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef31e8  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef3179  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bee2629  mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef71cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef7224  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c31657c  process::ProcessManager::resume()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c316e1c  process::schedule()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491a035b50  start_thread
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f4919d7f7bd  (unknown)
{code}

https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
{code}
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823  2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727  2182 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884  2179 detector.cpp:138] Detected a new leader: None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786  2181 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982  2179 contender.cpp:196] Membership cancelled: 1
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026  2179 master.cpp:1196] The newly elected leader is None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095  2180 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)