You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jeremy Lingmann (JIRA)" <ji...@apache.org> on 2014/11/20 23:28:33 UTC
[jira] [Updated] (MESOS-2148) Cluster formation often fails when
starting mesos-masters with --registry=replicated_log
[ https://issues.apache.org/jira/browse/MESOS-2148?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jeremy Lingmann updated MESOS-2148:
-----------------------------------
Description:
When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.
https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
{code}
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300 2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736 2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644 2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289 2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814 2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992 2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712 2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247 2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf3acd google::LogMessage::Fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf5883 google::LogMessage::SendToLog()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf36e8 google::LogMessage::Flush()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf60de google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c5c7946 mesos::internal::master::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c622c31 _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c622caa std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c521aa9 process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c661b7a process::internal::thenf<>()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67b1e8 process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67b179 process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c66a629 mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67f1cb _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67f224 std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383ca9e57c process::ProcessManager::resume()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383ca9ee1c process::schedule()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383a7bdb50 start_thread
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383a5077bd (unknown)
{code}
https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
{code}
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592 2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720 2329 master.cpp:1209] Elected as the leading master!
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757 2329 master.cpp:1027] Recovering from registrar
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967 2327 registrar.cpp:313] Recovering registrar
Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931 2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037 2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543 2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36bacd google::LogMessage::Fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36d883 google::LogMessage::SendToLog()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36b6e8 google::LogMessage::Flush()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36e0de google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be3f946 mesos::internal::master::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be9ac31 _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be9acaa std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bd99aa9 process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bed9b7a process::internal::thenf<>()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef31e8 process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef3179 process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bee2629 mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef71cb _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef7224 std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c31657c process::ProcessManager::resume()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c316e1c process::schedule()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491a035b50 start_thread
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f4919d7f7bd (unknown)
{code}
https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
{code}
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823 2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727 2182 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884 2179 detector.cpp:138] Detected a new leader: None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786 2181 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982 2179 contender.cpp:196] Membership cancelled: 1
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026 2179 master.cpp:1196] The newly elected leader is None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095 2180 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
{code}
Here are the mesos-master startup options being used:
{code}
/usr/local/sbin/mesos-master \
--zk=zk://10.132.193.161:2181,10.132.193.153:2181,10.132.193.160:2181/mesos \
--port=5050 \
--log_dir=/var/log/mesos \
--cluster=test \
--hostname=10.132.193.161 \
--ip=10.132.193.161 \
--quorum=2 \
--work_dir=/var/lib/mesos
{code}
was:
When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.
https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
{code}
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300 2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736 2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644 2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289 2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814 2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992 2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712 2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247 2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf3acd google::LogMessage::Fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf5883 google::LogMessage::SendToLog()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf36e8 google::LogMessage::Flush()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf60de google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c5c7946 mesos::internal::master::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c622c31 _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c622caa std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c521aa9 process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c661b7a process::internal::thenf<>()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67b1e8 process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67b179 process::Future<>::fail()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c66a629 mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67f1cb _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67f224 std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383ca9e57c process::ProcessManager::resume()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383ca9ee1c process::schedule()
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383a7bdb50 start_thread
Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383a5077bd (unknown)
{code}
https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
{code}
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592 2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720 2329 master.cpp:1209] Elected as the leading master!
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757 2329 master.cpp:1027] Recovering from registrar
Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967 2327 registrar.cpp:313] Recovering registrar
Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931 2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037 2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543 2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36bacd google::LogMessage::Fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36d883 google::LogMessage::SendToLog()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36b6e8 google::LogMessage::Flush()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36e0de google::LogMessageFatal::~LogMessageFatal()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be3f946 mesos::internal::master::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be9ac31 _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be9acaa std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bd99aa9 process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bed9b7a process::internal::thenf<>()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef31e8 process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef3179 process::Future<>::fail()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bee2629 mesos::internal::master::RegistrarProcess::_recover()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef71cb _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef7224 std::tr1::_Function_handler<>::_M_invoke()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c31657c process::ProcessManager::resume()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c316e1c process::schedule()
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491a035b50 start_thread
Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f4919d7f7bd (unknown)
{code}
https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
{code}
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823 2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727 2182 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884 2179 detector.cpp:138] Detected a new leader: None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786 2181 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982 2179 contender.cpp:196] Membership cancelled: 1
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026 2179 master.cpp:1196] The newly elected leader is None
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095 2180 group.cpp:472] ZooKeeper session expired
Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
{code}
> Cluster formation often fails when starting mesos-masters with --registry=replicated_log
> ----------------------------------------------------------------------------------------
>
> Key: MESOS-2148
> URL: https://issues.apache.org/jira/browse/MESOS-2148
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.20.1
> Environment: Mesos 0.20.1
> Zookeeper 3.3.5
> Debian Wheezy
> Reporter: Jeremy Lingmann
>
> When starting mesos-masters in close proximity (within 1min of each other) to form a new cluster, we often see the cluster fail to form with 1 master committing suicide and the other 2 exiting with a stack trace. Here are the logs for a cluster with 3 mesos masters showing this behavior.
> https://gist.github.com/lingmann/78d1d65f02b28a419b12#file-gistfile1-txt-L447
> {code}
> Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.858300 2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.859736 2321 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:01 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:01.927644 2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: F1118 22:30:02.621289 2328 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: *** Check failure stack trace: ***
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.654814 2321 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.655992 2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.658712 2327 recover.cpp:188] Received a recover response from a replica in EMPTY status
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: I1118 22:30:02.708247 2327 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf3acd google::LogMessage::Fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf5883 google::LogMessage::SendToLog()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf36e8 google::LogMessage::Flush()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383caf60de google::LogMessageFatal::~LogMessageFatal()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c5c7946 mesos::internal::master::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c622c31 _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c622caa std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c521aa9 process::Future<>::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c661b7a process::internal::thenf<>()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67b1e8 process::Future<>::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67b179 process::Future<>::fail()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c66a629 mesos::internal::master::RegistrarProcess::_recover()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67f1cb _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383c67f224 std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383ca9e57c process::ProcessManager::resume()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383ca9ee1c process::schedule()
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383a7bdb50 start_thread
> Nov 18 22:30:02 highly-available-2162-c9 mesos-master[2154]: @ 0x7f383a5077bd (unknown)
> {code}
> https://gist.github.com/lingmann/fcf36fcce8ecc7eb7838#file-gistfile1-txt-L468
> {code}
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013592 2329 master.cpp:1196] The newly elected leader is master@10.6.163.44:5050 with id 20141118-222859-748881418-5050-2157
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013720 2329 master.cpp:1209] Elected as the leading master!
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013757 2329 master.cpp:1027] Recovering from registrar
> Nov 18 22:30:14 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:14.013967 2327 registrar.cpp:313] Recovering registrar
> Nov 18 22:30:23 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:23.576931 2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> Nov 18 22:30:33 highly-available-2162-ac2 mesos-master[2157]: I1118 22:30:33.578037 2330 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: F1118 22:31:14.015543 2329 master.cpp:1016] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: *** Check failure stack trace: ***
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36bacd google::LogMessage::Fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36d883 google::LogMessage::SendToLog()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36b6e8 google::LogMessage::Flush()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c36e0de google::LogMessageFatal::~LogMessageFatal()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be3f946 mesos::internal::master::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be9ac31 _ZNSt3tr15_BindIFPFvRKSsS2_EPKcNS_12_PlaceholderILi1EEEEE6__callIIS2_EILi0ELi1EEEENS_9result_ofIFS4_NSC_IFNS_3_MuIS6_Lb0ELb0EEES6_NS_5tupleIIDpT_EEEEE4typeENSC_IFNSD_IS8_Lb0ELb1EEES8_SI_EE4typeEEE4typeERKSI_NS_12_Index_tupleIIXspT0_EEEE
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491be9acaa std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bd99aa9 process::Future<>::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bed9b7a process::internal::thenf<>()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef31e8 process::Future<>::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef3179 process::Future<>::fail()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bee2629 mesos::internal::master::RegistrarProcess::_recover()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef71cb _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal6master16RegistrarProcessEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIJRS3_EJLi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIJDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIJXspT0_EEEE
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491bef7224 std::tr1::_Function_handler<>::_M_invoke()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c31657c process::ProcessManager::resume()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491c316e1c process::schedule()
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f491a035b50 start_thread
> Nov 18 22:31:14 highly-available-2162-ac2 mesos-master[2157]: @ 0x7f4919d7f7bd (unknown)
> {code}
> https://gist.github.com/lingmann/5691795a4f9d54632e5a#file-gistfile1-txt-L168
> {code}
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1697: Socket [10.6.163.44:2181] zk retcode=-4, errno=111(Connection refused): server refused to accept the client
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,644:2124(0x7f2e51d04700):ZOO_INFO@check_events@1703: initiated connection to server [10.67.129.108:2181]
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,645:2124(0x7f2e51d04700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [10.67.129.108:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: W1118 22:28:35.643823 2180 group.cpp:456] Timed out waiting to reconnect to ZooKeeper. Forcing ZooKeeper session (sessionId=149c5026bf20001) expiration
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645727 2182 group.cpp:472] ZooKeeper session expired
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645884 2179 detector.cpp:138] Detected a new leader: None
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645786 2181 group.cpp:472] ZooKeeper session expired
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.645982 2179 contender.cpp:196] Membership cancelled: 1
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646026 2179 master.cpp:1196] The newly elected leader is None
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: Lost leadership... committing suicide!
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: I1118 22:28:35.646095 2180 group.cpp:472] ZooKeeper session expired
> Nov 18 22:28:35 highly-available-2162-7e4 mesos-master[2124]: 2014-11-18 22:28:35,646:2124(0x7f2e56543700):ZOO_INFO@zookeeper_close@2522: Freeing zookeeper resources for sessionId=0x149c5026bf20002
> {code}
> Here are the mesos-master startup options being used:
> {code}
> /usr/local/sbin/mesos-master \
> --zk=zk://10.132.193.161:2181,10.132.193.153:2181,10.132.193.160:2181/mesos \
> --port=5050 \
> --log_dir=/var/log/mesos \
> --cluster=test \
> --hostname=10.132.193.161 \
> --ip=10.132.193.161 \
> --quorum=2 \
> --work_dir=/var/lib/mesos
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)