You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Steven Schlansker (JIRA)" <ji...@apache.org> on 2015/10/21 23:48:27 UTC
[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any
configured zookeeper does not resolve.
[ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ]
Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM:
--------------------------------------------------------------------
I am still able to easily reproduce this, even with master built from today:
{code}
$ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_0000000000' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}
Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat
I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2 google::LogMessage::Fail()
@ 0x7f9bec6044c2 google::LogMessage::Fail()
@ 0x7f9bec60440e google::LogMessage::SendToLog()
@ 0x7f9bec60440e google::LogMessage::SendToLog()
@ 0x7f9bec603e10 google::LogMessage::Flush()
@ 0x7f9bec603e10 google::LogMessage::Flush()
@ 0x7f9bec603c25 google::LogMessage::~LogMessage()
@ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25 google::LogMessage::~LogMessage()
@ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825 ZooKeeperProcess::initialize()
@ 0x7f9bec00b825 ZooKeeperProcess::initialize()
@ 0x7f9bec57053d process::ProcessManager::resume()
@ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d process::ProcessManager::resume()
@ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40 (unknown)
@ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182 start_thread
@ 0x7f9be77d847d (unknown)
Aborted (core dumped)
{code}
[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are).
was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:
{code}
./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
{code}
Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable:
{code}
./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat
I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
@ 0x7f9bec6044c2 google::LogMessage::Fail()
@ 0x7f9bec6044c2 google::LogMessage::Fail()
@ 0x7f9bec60440e google::LogMessage::SendToLog()
@ 0x7f9bec60440e google::LogMessage::SendToLog()
@ 0x7f9bec603e10 google::LogMessage::Flush()
@ 0x7f9bec603e10 google::LogMessage::Flush()
@ 0x7f9bec603c25 google::LogMessage::~LogMessage()
@ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec603c25 google::LogMessage::~LogMessage()
@ 0x7f9bec604b85 google::ErrnoLogMessage::~ErrnoLogMessage()
@ 0x7f9bec00b825 ZooKeeperProcess::initialize()
@ 0x7f9bec00b825 ZooKeeperProcess::initialize()
@ 0x7f9bec57053d process::ProcessManager::resume()
@ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec57053d process::ProcessManager::resume()
@ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec56d9ae _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
@ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9bec577b54 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
@ 0x7f9bec5779ed _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
@ 0x7f9bec577b04 _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
@ 0x7f9bec577986 _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
@ 0x7f9be828ea40 (unknown)
@ 0x7f9bec577a96 _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
@ 0x7f9be7aab182 start_thread
@ 0x7f9be77d847d (unknown)
Aborted (core dumped)
{code}
[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert. But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are).
> Mesos crashes if any configured zookeeper does not resolve.
> -----------------------------------------------------------
>
> Key: MESOS-2186
> URL: https://issues.apache.org/jira/browse/MESOS-2186
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.21.0
> Environment: Zookeeper: 3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
> Reporter: Daniel Hall
> Priority: Critical
> Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not resolve in DNS Mesos will crash and refuse to start. We noticed this issue while we were rebuilding one of our zookeeper hosts in Google compute (which bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been sanitised).
> {noformat}
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 28627 main.cpp:292] Starting Mesos master
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]:
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]:
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 google::LogMessage::Fail()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 google::LogMessage::Fail()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 google::LogMessage::SendToLog()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]:
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 google::LogMessage::Fail()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]:
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a0160 google::LogMessage::Fail()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 google::LogMessage::SendToLog()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 google::LogMessage::SendToLog()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a00b9 google::LogMessage::SendToLog()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 google::LogMessage::Flush()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 google::LogMessage::Flush()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 google::LogMessage::Flush()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af google::LogMessage::~LogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569fa97 google::LogMessage::Flush()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 28643 contender.cpp:131] Joining the ZK group
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 28640 master.cpp:1202] Successfully attached file '/var/log/mesos/mesos-master.INFO'
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af google::LogMessage::~LogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af google::LogMessage::~LogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f569f8af google::LogMessage::~LogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f56a086f google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf ZooKeeperProcess::initialize()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 process::ProcessManager::resume()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf ZooKeeperProcess::initialize()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf ZooKeeperProcess::initialize()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5201abf ZooKeeperProcess::initialize()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 process::ProcessManager::resume()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 process::ProcessManager::resume()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f5604367 process::ProcessManager::resume()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x7fa9f55fa21f process::schedule()
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x3e498079d1 (unknown)
> Dec 9 22:54:54 mesosmaster-2 mesos-master[28627]: @ 0x3e494e89dd (unknown)
> Dec 9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in '/usr/local/sbin/mesos-master'
> Dec 9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed by ABRT signal
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)