You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Steven Schlansker (JIRA)" <ji...@apache.org> on 2015/10/21 23:48:27 UTC

[jira] [Comment Edited] (MESOS-2186) Mesos crashes if any configured zookeeper does not resolve.

    [ https://issues.apache.org/jira/browse/MESOS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14968022#comment-14968022 ] 

Steven Schlansker edited comment on MESOS-2186 at 10/21/15 9:47 PM:
--------------------------------------------------------------------

I am still able to easily reproduce this, even with master built from today:

{code}
$ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat

I1021 21:48:00.308338 32707 group.cpp:674] Trying to get '/wat/json.info_0000000000' in ZooKeeper
I1021 21:48:00.310456 32708 detector.cpp:482] A new leading master (UPID=master@127.0.1.1:5050) is detected
I1021 21:48:00.310746 32707 master.cpp:1609] The newly elected leader is master@127.0.1.1:5050 with id 950ec119-b0ab-4c55-9143-c6c21b9f187e
I1021 21:48:00.310899 32707 master.cpp:1622] Elected as the leading master!
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable:
{code}
$ ./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat

I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec56d9ae  _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577b54  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec577b04  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec56d9ae  _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577a96  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9bec577b54  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec5779ed  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
    @     0x7f9bec577b04  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec577986  _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
    @     0x7f9be828ea40  (unknown)
    @     0x7f9bec577a96  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9be7aab182  start_thread
    @     0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert.  But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are).



was (Author: stevenschlansker):
I am still able to easily reproduce this, even with master built from today:

{code}
./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,zk3.mycorp.com:2181/wat
{code}

Three configured ZK members, all is OK.
Change one to be an unresolvable hostname -- two are still alive and correct though, so this should be recoverable:
{code}
./bin/mesos-master.sh --registry=in_memory --zk=zk://zk1.mycorp.com:2181,zk2.mycorp.com:2181,badhost.mycorp.com:2181/wat
I1021 21:48:08.466562 32729 contender.cpp:149] Joining the ZK group
2015-10-21 21:48:08,549:32715(0x7f9bdda41700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory

F1021 21:48:08.549351 32736 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
2015-10-21 21:48:08,549:32715(0x7f9be0a47700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory

F1021 21:48:08.549708 32730 zookeeper.cpp:111] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
*** Check failure stack trace: ***
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec6044c2  google::LogMessage::Fail()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec60440e  google::LogMessage::SendToLog()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603e10  google::LogMessage::Flush()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec603c25  google::LogMessage::~LogMessage()
    @     0x7f9bec604b85  google::ErrnoLogMessage::~ErrnoLogMessage()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec00b825  ZooKeeperProcess::initialize()
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec56d9ae  _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577b54  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec57053d  process::ProcessManager::resume()
    @     0x7f9bec577b04  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec56d9ae  _ZZN7process14ProcessManager12init_threadsEvENKUlRKSt11atomic_boolE_clES3_
    @     0x7f9bec577a96  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9bec577b54  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEE6__callIvIEILm0EEEET_OSt5tupleIIDpT0_EESt12_Index_tupleIIXspT1_EEE
    @     0x7f9bec5779ed  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEEclEv
    @     0x7f9bec577b04  _ZNSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS3_EEEclIIEvEET0_DpOT_
    @     0x7f9bec577986  _ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS6_EEEvEEE6_M_runEv
    @     0x7f9be828ea40  (unknown)
    @     0x7f9bec577a96  _ZNSt12_Bind_simpleIFSt5_BindIFZN7process14ProcessManager12init_threadsEvEUlRKSt11atomic_boolE_St17reference_wrapperIS4_EEEvEE9_M_invokeIIEEEvSt12_Index_tupleIIXspT_EEE
    @     0x7f9be7aab182  start_thread
    @     0x7f9be77d847d  (unknown)
Aborted (core dumped)
{code}

[~rgs] I am very sorry if this does not end up being a ZK problem at all, I am no C++ expert.  But Mesos is still trivial to crash if one of the ZK members are not valid (even if two are).


> Mesos crashes if any configured zookeeper does not resolve.
> -----------------------------------------------------------
>
>                 Key: MESOS-2186
>                 URL: https://issues.apache.org/jira/browse/MESOS-2186
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.21.0
>         Environment: Zookeeper:  3.4.5+28-1.cdh4.7.1.p0.13.el6
> Mesos: 0.21.0-1.0.centos65
> CentOS: CentOS release 6.6 (Final)
>            Reporter: Daniel Hall
>            Priority: Critical
>              Labels: mesosphere
>
> When starting Mesos, if one of the configured zookeeper servers does not resolve in DNS Mesos will crash and refuse to start. We noticed this issue while we were rebuilding one of our zookeeper hosts in Google compute (which bases the DNS on the machines running).
> Here is a log from a failed startup (hostnames and ip addresses have been sanitised).
> {noformat}
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.088835 28627 main.cpp:292] Starting Mesos master
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,095:28627(0x7fa9f042f700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.095239 28642 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,097:28627(0x7fa9ed22a700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,108:28627(0x7fa9ef02d700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 2014-12-09 22:54:54,109:28627(0x7fa9f0e30700):ZOO_ERROR@getaddrs@599: getaddrinfo: No such file or directory
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: 
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: F1209 22:54:54.097718 28647 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.108422 28644 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]F1209 22:54:54.109864 28641 zookeeper.cpp:113] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2]
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: *** Check failure stack trace: ***
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a0160  google::LogMessage::Fail()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123208 28640 master.cpp:318] Master 20141209-225454-4155764746-5050-28627 (mesosmaster-2.internal) started on 10.x.x.x:5050
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123306 28640 master.cpp:366] Master allowing unauthenticated frameworks to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.123327 28640 master.cpp:371] Master allowing unauthenticated slaves to register
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a00b9  google::LogMessage::SendToLog()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569fa97  google::LogMessage::Flush()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.159488 28643 contender.cpp:131] Joining the ZK group
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]: I1209 22:54:54.160753 28640 master.cpp:1202] Successfully attached file '/var/log/mesos/mesos-master.INFO'
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f569f8af  google::LogMessage::~LogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f56a086f  google::ErrnoLogMessage::~ErrnoLogMessage()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5201abf  ZooKeeperProcess::initialize()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f5604367  process::ProcessManager::resume()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @     0x7fa9f55fa21f  process::schedule()
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e498079d1  (unknown)
> Dec  9 22:54:54 mesosmaster-2 mesos-master[28627]:     @       0x3e494e89dd  (unknown)
> Dec  9 22:54:54 mesosmaster-2 abrt[28650]: Not saving repeating crash in '/usr/local/sbin/mesos-master'
> Dec  9 22:54:54 mesosmaster-2 init: mesos-master main process (28627) killed by ABRT signal
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)