You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jeremy Lingmann (JIRA)" <ji...@apache.org> on 2014/11/20 23:28:33 UTC

[jira] [Updated] (MESOS-2148) Cluster formation often fails when starting mesos-masters with --registry=replicated_log

     [ https://issues.apache.org/jira/browse/MESOS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Lingmann updated MESOS-2148:
-----------------------------------
    Description: 
When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.

https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
{code}
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289  2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247  2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf3acd  google::LogMessage::Fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf5883  google::LogMessage::SendToLog()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf36e8  google::LogMessage::Flush()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf60de  google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c5c7946  mesos::internal::master::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622c31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622caa  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c521aa9  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c661b7a  process::internal::thenf<>()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b1e8  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b179  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c66a629  mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f1cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f224  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9e57c  process::ProcessManager::resume()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9ee1c  process::schedule()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a7bdb50  start_thread
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a5077bd  (unknown)
{code}

https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
{code}
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592  2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720  2329 master.cpp:1209] Elected as the leading master!
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757  2329 master.cpp:1027] Recovering from registrar
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967  2327 registrar.cpp:313] Recovering registrar
Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543  2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36bacd  google::LogMessage::Fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36d883  google::LogMessage::SendToLog()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36b6e8  google::LogMessage::Flush()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36e0de  google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be3f946  mesos::internal::master::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9ac31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9acaa  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bd99aa9  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bed9b7a  process::internal::thenf<>()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef31e8  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef3179  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bee2629  mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef71cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef7224  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c31657c  process::ProcessManager::resume()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c316e1c  process::schedule()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491a035b50  start_thread
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f4919d7f7bd  (unknown)
{code}

https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
{code}
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823  2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727  2182 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884  2179 detector.cpp:138] Detected a new leader: None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786  2181 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982  2179 contender.cpp:196] Membership cancelled: 1
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026  2179 master.cpp:1196] The newly elected leader is None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095  2180 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
{code}

Here are the mesos-master startup options being used:
{code}
/usr/local/sbin/mesos-master \
  --zk=zk://10.132.193.161:2181,10.132.193.153:2181,10.132.193.160:2181/mesos \
  --port=5050 \
  --log_dir=/var/log/mesos \
  --cluster=test \
  --hostname=10.132.193.161 \
  --ip=10.132.193.161 \
  --quorum=2 \
  --work_dir=/var/lib/mesos
{code}

  was:
When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.

https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
{code}
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289  2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247  2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf3acd  google::LogMessage::Fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf5883  google::LogMessage::SendToLog()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf36e8  google::LogMessage::Flush()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf60de  google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c5c7946  mesos::internal::master::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622c31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622caa  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c521aa9  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c661b7a  process::internal::thenf<>()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b1e8  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b179  process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c66a629  mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f1cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f224  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9e57c  process::ProcessManager::resume()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9ee1c  process::schedule()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a7bdb50  start_thread
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a5077bd  (unknown)
{code}

https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
{code}
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592  2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720  2329 master.cpp:1209] Elected as the leading master!
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757  2329 master.cpp:1027] Recovering from registrar
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967  2327 registrar.cpp:313] Recovering registrar
Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543  2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36bacd  google::LogMessage::Fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36d883  google::LogMessage::SendToLog()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36b6e8  google::LogMessage::Flush()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36e0de  google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be3f946  mesos::internal::master::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9ac31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9acaa  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bd99aa9  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bed9b7a  process::internal::thenf<>()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef31e8  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef3179  process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bee2629  mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef71cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef7224  std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c31657c  process::ProcessManager::resume()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c316e1c  process::schedule()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491a035b50  start_thread
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f4919d7f7bd  (unknown)
{code}

https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
{code}
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823  2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727  2182 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884  2179 detector.cpp:138] Detected a new leader: None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786  2181 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982  2179 contender.cpp:196] Membership cancelled: 1
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026  2179 master.cpp:1196] The newly elected leader is None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095  2180 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
{code}


> Cluster formation often fails when starting mesos-masters with --registry=replicated_log
> ----------------------------------------------------------------------------------------
>
>                 Key: MESOS-2148
>                 URL: https://issues.apache.org/jira/browse/MESOS-2148
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.1
>         Environment: Mesos 0.20.1
> Zookeeper 3.3.5
> Debian Wheezy
>            Reporter: Jeremy Lingmann
>
> When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.
> https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
> {code}
> Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736  2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289  2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814  2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712  2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247  2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf3acd  google::LogMessage::Fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf5883  google::LogMessage::SendToLog()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf36e8  google::LogMessage::Flush()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383caf60de  google::LogMessageFatal::~LogMessageFatal()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c5c7946  mesos::internal::master::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622c31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c622caa  std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c521aa9  process::Future<>::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c661b7a  process::internal::thenf<>()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b1e8  process::Future<>::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67b179  process::Future<>::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c66a629  mesos::internal::master::RegistrarProcess::_recover()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f1cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383c67f224  std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9e57c  process::ProcessManager::resume()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383ca9ee1c  process::schedule()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a7bdb50  start_thread
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]:     @     0x7f383a5077bd  (unknown)
> {code}
> https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
> {code}
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592  2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720  2329 master.cpp:1209] Elected as the leading master!
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757  2329 master.cpp:1027] Recovering from registrar
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967  2327 registrar.cpp:313] Recovering registrar
> Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037  2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543  2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36bacd  google::LogMessage::Fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36d883  google::LogMessage::SendToLog()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36b6e8  google::LogMessage::Flush()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c36e0de  google::LogMessageFatal::~LogMessageFatal()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be3f946  mesos::internal::master::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9ac31  _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491be9acaa  std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bd99aa9  process::Future<>::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bed9b7a  process::internal::thenf<>()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef31e8  process::Future<>::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef3179  process::Future<>::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bee2629  mesos::internal::master::RegistrarProcess::_recover()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef71cb  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491bef7224  std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c31657c  process::ProcessManager::resume()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491c316e1c  process::schedule()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f491a035b50  start_thread
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]:     @     0x7f4919d7f7bd  (unknown)
> {code}
> https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
> {code}
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823  2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727  2182 group.cpp:472] ZooKeeper session expired
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884  2179 detector.cpp:138] Detected a new leader: None
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786  2181 group.cpp:472] ZooKeeper session expired
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982  2179 contender.cpp:196] Membership cancelled: 1
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026  2179 master.cpp:1196] The newly elected leader is None
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095  2180 group.cpp:472] ZooKeeper session expired
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
> {code}
> Here are the mesos-master startup options being used:
> {code}
> /usr/local/sbin/mesos-master \
>   --zk=zk://10.132.193.161:2181,10.132.193.153:2181,10.132.193.160:2181/mesos \
>   --port=5050 \
>   --log_dir=/var/log/mesos \
>   --cluster=test \
>   --hostname=10.132.193.161 \
>   --ip=10.132.193.161 \
>   --quorum=2 \
>   --work_dir=/var/lib/mesos
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)