You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Davies Liu <da...@gmail.com> on 2011/12/16 11:03:27 UTC

Mesos master suicide after ZooKeeper session expired

Hi devs:

Our Mesos master suicide after ZooKeeper session expired several times
in a day with the following msg:


2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_@2612:
Sending request xid=0x4eeb07ce for path [/mesos_master] to
192.168.1.32:2181
2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621:
Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file
handle): sessionId=0x333cf926ee322ef has expired.
2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=ZOO_EXPIRED_SESSION_STATE
2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO
thread terminated
2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1795:
Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112
E1216 17:06:35.594744  7699 detector.cpp:559] Master detector failed
to get masters: session expired
I1216 17:06:48.371737  7699 detector.cpp:457] Master detector lost
connection to ZooKeeper, attempting to reconnect ...
W1216 17:06:48.371817  7699 detector.cpp:500] Master detector
ZooKeeper session expired!
2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336:
completion thread terminated
2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close@2307:
Freeing zookeeper resources for sessionId=0x333cf926ee322ef
F1216 17:06:48.374893  7699 master.cpp:475] No longer elected master
... committing suicide!
*** Check failure stack trace: ***
    @           0x5aa61d  google::LogMessage::Fail()
    @           0x5ace2f  google::LogMessage::SendToLog()
    @           0x5aa206  google::LogMessage::Flush()
    @           0x5ad6ad  google::LogMessageFatal::~LogMessageFatal()
    @           0x439951  mesos::internal::master::Master::noMasterDetected()
    @           0x440158  mesos::internal::master::Master::operator()()
    @           0x5bfc39  process::ProcessManager::run()
    @           0x5bfd8a  process::trampoline()
    @     0x7f8248791160  (unknown)





At some times, it can reconnect to ZooKeeper without suicide:


2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest@1461:
Exceeded deadline by 13ms
2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process@1974:
Got ping response in 0 ms
2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process@1974:
Got ping response in 0 ms
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest@1461:
Exceeded deadline by 19401ms
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528:
Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed
out): connection timed out (exceeded timeout by 16062ms)
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=CONNECTING_STATE
2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest@1461:
Exceeded deadline by 19401ms
2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I1216 17:01:54.657625  7699 detector.cpp:457] Master detector lost
connection to ZooKeeper, attempting to reconnect ...
2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585:
initiated connection to server [192.168.1.34:2181]
2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621:
Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file
handle): sessionId=0x333cf926ee322ee has expired.
2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=ZOO_EXPIRED_SESSION_STATE
2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO
thread terminated
2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
W1216 17:01:54.662583  7699 detector.cpp:500] Master detector
ZooKeeper session expired!
2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336:
completion thread terminated
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close@2307:
Freeing zookeeper resources for sessionId=0x333cf926ee322ee

2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658:
Client environment:zookeeper.version=zookeeper C client 3.3.1
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662:
Client environment:host.name=boromir
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669:
Client environment:os.name=Linux
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670:
Client environment:os.arch=2.6.39-gentoo-r3
2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671:
Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679:
Client environment:user.name=(null)
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687:
Client environment:user.home=/root
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699:
Client environment:user.dir=/opt/mesos/bin
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727:
Initiating client connection,
host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000
watcher=0x4cb750 sessionId=0 sessionPasswd=<null>
context=0x7f823803c380 flags=0
2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152:
starting threads...
2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326:
started completion thread
2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279:
started IO thread
2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585:
initiated connection to server [192.168.1.33:2181]
2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632:
session establishment complete on server [192.168.1.33:2181],
sessionId=0x333cf926ee322ef, negotiated timeout=10000
2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638:
Calling a watcher for a ZOO_SESSION_EVENT and the
state=ZOO_CONNECTED_STATE
2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1765:
Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
I1216 17:01:54.719355  7699 detector.cpp:375] Master detector
connected to ZooKeeper ...
I1216 17:01:54.719421  7699 detector.cpp:404] Trying to create znode
'/mesos_master' in ZooKeeper
2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
Sending request xid=0x4eeb07c9 for path [/mesos_master] to
192.168.1.33:2181
2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1817:
Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110
2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
Sending request xid=0x4eeb07ca for path [/mesos_master] to
192.168.1.33:2181
2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1772:
Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0
2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
Sending request xid=0x4eeb07cb for path [/mesos_master/] to
192.168.1.33:2181
2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1817:
Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0
I1216 17:01:54.726915  7699 detector.cpp:443] Created
ephemeral/sequence:0000000344
2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_@2612:
Sending request xid=0x4eeb07cc for path [/mesos_master] to
192.168.1.33:2181
W1216 17:01:54.727063  7699 master.cpp:299] Dropping unknown message
'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0
2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1795:
Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0
I1216 17:01:54.727489  7699 detector.cpp:562] Master detector found 1
registered masters
2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to
192.168.1.33:2181
2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process@1980:
Queueing asynchronous response
2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions@1772:
Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0
I1216 17:01:54.728397  7699 detector.cpp:592] Master detector got new
master pid: master@192.168.1.13:5050
I1216 17:01:54.728489  7699 master.cpp:467] Still acting as master!

Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but
the client is the included one, which is 3.3.1,
is it the reason ?

One week before, we use Mesos 0.3-alpha, it's more stable with
ZooKeeper than now.

Any suggestion are welcomed, thank you.

-- 
 - Davies

Re: Mesos master suicide after ZooKeeper session expired

Posted by Davies Liu <da...@gmail.com>.
I have found the reason causing zookeeper session expiration,
the logs/ located at a very slow disk, logging operation block the
master process to send ping to zookeeper.
then we move the logs/ into /dev/shm to solve the problem.

On Sun, Dec 18, 2011 at 10:01 AM, Davies Liu <da...@gmail.com> wrote:
> Thank you for the reply.
>
> We run master under deamon-tools, so it can restart after shutdown.
> When it restart, some msg from master or slaves may be lost,
> then some not finished application will hang, waiting for tasks to
> finish for ever.
>
> If the master do not restart so many times, the apps will run more stable.
>
> Why the session expiration occurs in ZooKeeper ? It was expected or not ?
>
> Davies
>
> On Sun, Dec 18, 2011 at 2:01 AM, Benjamin Hindman
> <be...@eecs.berkeley.edu> wrote:
>> Currently, the master is programmed to shutdown if a session expiration
>> occurs. This could be considered a bug, because when only using one master
>> once ZooKeeper has sorted itself out that master will obviously get
>> re-elected (translation: I'd be happy if you filed a bug). In the mean
>> time, most HA deployments of Mesos have used multiple masters (either on
>> separate machines or not). When one master kills itself another takes over
>> and the cluster continues to operate as normal.
>>
>> Hope that helps!
>>
>>
>> On Fri, Dec 16, 2011 at 2:03 AM, Davies Liu <da...@gmail.com> wrote:
>>
>>> Hi devs:
>>>
>>> Our Mesos master suicide after ZooKeeper session expired several times
>>> in a day with the following msg:
>>>
>>>
>>> 2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
>>> @2612:
>>> Sending request xid=0x4eeb07ce for path [/mesos_master] to
>>> 192.168.1.32:2181
>>> 2011-12-16
>>> 17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621:
>>> Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file
>>> handle): sessionId=0x333cf926ee322ef has expired.
>>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=ZOO_EXPIRED_SESSION_STATE
>>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO
>>> thread terminated
>>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1795:
>>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112
>>> E1216 17:06:35.594744  7699 detector.cpp:559] Master detector failed
>>> to get masters: session expired
>>> I1216 17:06:48.371737  7699 detector.cpp:457] Master detector lost
>>> connection to ZooKeeper, attempting to reconnect ...
>>> W1216 17:06:48.371817  7699 detector.cpp:500] Master detector
>>> ZooKeeper session expired!
>>> 2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336:
>>> completion thread terminated
>>> 2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
>>> @2307:
>>> Freeing zookeeper resources for sessionId=0x333cf926ee322ef
>>> F1216 17:06:48.374893  7699 master.cpp:475] No longer elected master
>>> ... committing suicide!
>>> *** Check failure stack trace: ***
>>>    @           0x5aa61d  google::LogMessage::Fail()
>>>    @           0x5ace2f  google::LogMessage::SendToLog()
>>>    @           0x5aa206  google::LogMessage::Flush()
>>>    @           0x5ad6ad  google::LogMessageFatal::~LogMessageFatal()
>>>    @           0x439951
>>>  mesos::internal::master::Master::noMasterDetected()
>>>    @           0x440158  mesos::internal::master::Master::operator()()
>>>    @           0x5bfc39  process::ProcessManager::run()
>>>    @           0x5bfd8a  process::trampoline()
>>>    @     0x7f8248791160  (unknown)
>>>
>>>
>>>
>>>
>>>
>>> At some times, it can reconnect to ZooKeeper without suicide:
>>>
>>>
>>> 2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>>> @1461:
>>> Exceeded deadline by 13ms
>>> 2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
>>> @1974:
>>> Got ping response in 0 ms
>>> 2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
>>> @1974:
>>> Got ping response in 0 ms
>>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>>> @1461:
>>> Exceeded deadline by 19401ms
>>> 2011-12-16
>>> 17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528:
>>> Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed
>>> out): connection timed out (exceeded timeout by 16062ms)
>>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=CONNECTING_STATE
>>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>>> @1461:
>>> Exceeded deadline by 19401ms
>>> 2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> I1216 17:01:54.657625  7699 detector.cpp:457] Master detector lost
>>> connection to ZooKeeper, attempting to reconnect ...
>>> 2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585:
>>> initiated connection to server [192.168.1.34:2181]
>>> 2011-12-16
>>> 17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621:
>>> Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file
>>> handle): sessionId=0x333cf926ee322ee has expired.
>>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=ZOO_EXPIRED_SESSION_STATE
>>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO
>>> thread terminated
>>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> W1216 17:01:54.662583  7699 detector.cpp:500] Master detector
>>> ZooKeeper session expired!
>>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336:
>>> completion thread terminated
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
>>> @2307:
>>> Freeing zookeeper resources for sessionId=0x333cf926ee322ee
>>>
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658:
>>> Client environment:zookeeper.version=zookeeper C client 3.3.1
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662:
>>> Client environment:host.name=boromir
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669:
>>> Client environment:os.name=Linux
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670:
>>> Client environment:os.arch=2.6.39-gentoo-r3
>>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671:
>>> Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679:
>>> Client environment:user.name=(null)
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687:
>>> Client environment:user.home=/root
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699:
>>> Client environment:user.dir=/opt/mesos/bin
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727:
>>> Initiating client connection,
>>> host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000
>>> watcher=0x4cb750 sessionId=0 sessionPasswd=<null>
>>> context=0x7f823803c380 flags=0
>>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152:
>>> starting threads...
>>> 2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326:
>>> started completion thread
>>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279:
>>> started IO thread
>>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585:
>>> initiated connection to server [192.168.1.33:2181]
>>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632:
>>> session establishment complete on server [192.168.1.33:2181],
>>> sessionId=0x333cf926ee322ef, negotiated timeout=10000
>>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638:
>>> Calling a watcher for a ZOO_SESSION_EVENT and the
>>> state=ZOO_CONNECTED_STATE
>>> 2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1765:
>>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>>> I1216 17:01:54.719355  7699 detector.cpp:375] Master detector
>>> connected to ZooKeeper ...
>>> I1216 17:01:54.719421  7699 detector.cpp:404] Trying to create znode
>>> '/mesos_master' in ZooKeeper
>>> 2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
>>> Sending request xid=0x4eeb07c9 for path [/mesos_master] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1817:
>>> Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110
>>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
>>> Sending request xid=0x4eeb07ca for path [/mesos_master] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1772:
>>> Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0
>>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
>>> Sending request xid=0x4eeb07cb for path [/mesos_master/] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1817:
>>> Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0
>>> I1216 17:01:54.726915  7699 detector.cpp:443] Created
>>> ephemeral/sequence:0000000344
>>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
>>> @2612:
>>> Sending request xid=0x4eeb07cc for path [/mesos_master] to
>>> 192.168.1.33:2181
>>> W1216 17:01:54.727063  7699 master.cpp:299] Dropping unknown message
>>> 'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0
>>> 2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1795:
>>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0
>>> I1216 17:01:54.727489  7699 detector.cpp:562] Master detector found 1
>>> registered masters
>>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
>>> Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to
>>> 192.168.1.33:2181
>>> 2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>>> @1980:
>>> Queueing asynchronous response
>>> 2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>>> @1772:
>>> Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0
>>> I1216 17:01:54.728397  7699 detector.cpp:592] Master detector got new
>>> master pid: master@192.168.1.13:5050
>>> I1216 17:01:54.728489  7699 master.cpp:467] Still acting as master!
>>>
>>> Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but
>>> the client is the included one, which is 3.3.1,
>>> is it the reason ?
>>>
>>> One week before, we use Mesos 0.3-alpha, it's more stable with
>>> ZooKeeper than now.
>>>
>>> Any suggestion are welcomed, thank you.
>>>
>>> --
>>>  - Davies
>>>
>
>
>
> --
>  - Davies



-- 
 - Davies

Re: Mesos master suicide after ZooKeeper session expired

Posted by Davies Liu <da...@gmail.com>.
Thank you for the reply.

We run master under deamon-tools, so it can restart after shutdown.
When it restart, some msg from master or slaves may be lost,
then some not finished application will hang, waiting for tasks to
finish for ever.

If the master do not restart so many times, the apps will run more stable.

Why the session expiration occurs in ZooKeeper ? It was expected or not ?

Davies

On Sun, Dec 18, 2011 at 2:01 AM, Benjamin Hindman
<be...@eecs.berkeley.edu> wrote:
> Currently, the master is programmed to shutdown if a session expiration
> occurs. This could be considered a bug, because when only using one master
> once ZooKeeper has sorted itself out that master will obviously get
> re-elected (translation: I'd be happy if you filed a bug). In the mean
> time, most HA deployments of Mesos have used multiple masters (either on
> separate machines or not). When one master kills itself another takes over
> and the cluster continues to operate as normal.
>
> Hope that helps!
>
>
> On Fri, Dec 16, 2011 at 2:03 AM, Davies Liu <da...@gmail.com> wrote:
>
>> Hi devs:
>>
>> Our Mesos master suicide after ZooKeeper session expired several times
>> in a day with the following msg:
>>
>>
>> 2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
>> @2612:
>> Sending request xid=0x4eeb07ce for path [/mesos_master] to
>> 192.168.1.32:2181
>> 2011-12-16
>> 17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621:
>> Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file
>> handle): sessionId=0x333cf926ee322ef has expired.
>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138:
>> Calling a watcher for a ZOO_SESSION_EVENT and the
>> state=ZOO_EXPIRED_SESSION_STATE
>> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO
>> thread terminated
>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1765:
>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1795:
>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112
>> E1216 17:06:35.594744  7699 detector.cpp:559] Master detector failed
>> to get masters: session expired
>> I1216 17:06:48.371737  7699 detector.cpp:457] Master detector lost
>> connection to ZooKeeper, attempting to reconnect ...
>> W1216 17:06:48.371817  7699 detector.cpp:500] Master detector
>> ZooKeeper session expired!
>> 2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336:
>> completion thread terminated
>> 2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
>> @2307:
>> Freeing zookeeper resources for sessionId=0x333cf926ee322ef
>> F1216 17:06:48.374893  7699 master.cpp:475] No longer elected master
>> ... committing suicide!
>> *** Check failure stack trace: ***
>>    @           0x5aa61d  google::LogMessage::Fail()
>>    @           0x5ace2f  google::LogMessage::SendToLog()
>>    @           0x5aa206  google::LogMessage::Flush()
>>    @           0x5ad6ad  google::LogMessageFatal::~LogMessageFatal()
>>    @           0x439951
>>  mesos::internal::master::Master::noMasterDetected()
>>    @           0x440158  mesos::internal::master::Master::operator()()
>>    @           0x5bfc39  process::ProcessManager::run()
>>    @           0x5bfd8a  process::trampoline()
>>    @     0x7f8248791160  (unknown)
>>
>>
>>
>>
>>
>> At some times, it can reconnect to ZooKeeper without suicide:
>>
>>
>> 2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>> @1461:
>> Exceeded deadline by 13ms
>> 2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
>> @1974:
>> Got ping response in 0 ms
>> 2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
>> @1974:
>> Got ping response in 0 ms
>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>> @1461:
>> Exceeded deadline by 19401ms
>> 2011-12-16
>> 17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528:
>> Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed
>> out): connection timed out (exceeded timeout by 16062ms)
>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141:
>> Calling a watcher for a ZOO_SESSION_EVENT and the
>> state=CONNECTING_STATE
>> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
>> @1461:
>> Exceeded deadline by 19401ms
>> 2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
>> @1765:
>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>> I1216 17:01:54.657625  7699 detector.cpp:457] Master detector lost
>> connection to ZooKeeper, attempting to reconnect ...
>> 2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585:
>> initiated connection to server [192.168.1.34:2181]
>> 2011-12-16
>> 17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621:
>> Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file
>> handle): sessionId=0x333cf926ee322ee has expired.
>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138:
>> Calling a watcher for a ZOO_SESSION_EVENT and the
>> state=ZOO_EXPIRED_SESSION_STATE
>> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO
>> thread terminated
>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
>> @1765:
>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>> W1216 17:01:54.662583  7699 detector.cpp:500] Master detector
>> ZooKeeper session expired!
>> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336:
>> completion thread terminated
>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
>> @2307:
>> Freeing zookeeper resources for sessionId=0x333cf926ee322ee
>>
>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658:
>> Client environment:zookeeper.version=zookeeper C client 3.3.1
>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662:
>> Client environment:host.name=boromir
>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669:
>> Client environment:os.name=Linux
>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670:
>> Client environment:os.arch=2.6.39-gentoo-r3
>> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671:
>> Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011
>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679:
>> Client environment:user.name=(null)
>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687:
>> Client environment:user.home=/root
>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699:
>> Client environment:user.dir=/opt/mesos/bin
>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727:
>> Initiating client connection,
>> host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000
>> watcher=0x4cb750 sessionId=0 sessionPasswd=<null>
>> context=0x7f823803c380 flags=0
>> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152:
>> starting threads...
>> 2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326:
>> started completion thread
>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279:
>> started IO thread
>> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585:
>> initiated connection to server [192.168.1.33:2181]
>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632:
>> session establishment complete on server [192.168.1.33:2181],
>> sessionId=0x333cf926ee322ef, negotiated timeout=10000
>> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638:
>> Calling a watcher for a ZOO_SESSION_EVENT and the
>> state=ZOO_CONNECTED_STATE
>> 2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1765:
>> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
>> I1216 17:01:54.719355  7699 detector.cpp:375] Master detector
>> connected to ZooKeeper ...
>> I1216 17:01:54.719421  7699 detector.cpp:404] Trying to create znode
>> '/mesos_master' in ZooKeeper
>> 2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
>> Sending request xid=0x4eeb07c9 for path [/mesos_master] to
>> 192.168.1.33:2181
>> 2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>> @1980:
>> Queueing asynchronous response
>> 2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1817:
>> Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110
>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
>> Sending request xid=0x4eeb07ca for path [/mesos_master] to
>> 192.168.1.33:2181
>> 2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>> @1980:
>> Queueing asynchronous response
>> 2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1772:
>> Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0
>> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
>> Sending request xid=0x4eeb07cb for path [/mesos_master/] to
>> 192.168.1.33:2181
>> 2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>> @1980:
>> Queueing asynchronous response
>> 2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1817:
>> Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0
>> I1216 17:01:54.726915  7699 detector.cpp:443] Created
>> ephemeral/sequence:0000000344
>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
>> @2612:
>> Sending request xid=0x4eeb07cc for path [/mesos_master] to
>> 192.168.1.33:2181
>> W1216 17:01:54.727063  7699 master.cpp:299] Dropping unknown message
>> 'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0
>> 2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>> @1980:
>> Queueing asynchronous response
>> 2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1795:
>> Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0
>> I1216 17:01:54.727489  7699 detector.cpp:562] Master detector found 1
>> registered masters
>> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
>> Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to
>> 192.168.1.33:2181
>> 2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
>> @1980:
>> Queueing asynchronous response
>> 2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions
>> @1772:
>> Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0
>> I1216 17:01:54.728397  7699 detector.cpp:592] Master detector got new
>> master pid: master@192.168.1.13:5050
>> I1216 17:01:54.728489  7699 master.cpp:467] Still acting as master!
>>
>> Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but
>> the client is the included one, which is 3.3.1,
>> is it the reason ?
>>
>> One week before, we use Mesos 0.3-alpha, it's more stable with
>> ZooKeeper than now.
>>
>> Any suggestion are welcomed, thank you.
>>
>> --
>>  - Davies
>>



-- 
 - Davies

Re: Mesos master suicide after ZooKeeper session expired

Posted by Benjamin Hindman <be...@eecs.berkeley.edu>.
Currently, the master is programmed to shutdown if a session expiration
occurs. This could be considered a bug, because when only using one master
once ZooKeeper has sorted itself out that master will obviously get
re-elected (translation: I'd be happy if you filed a bug). In the mean
time, most HA deployments of Mesos have used multiple masters (either on
separate machines or not). When one master kills itself another takes over
and the cluster continues to operate as normal.

Hope that helps!


On Fri, Dec 16, 2011 at 2:03 AM, Davies Liu <da...@gmail.com> wrote:

> Hi devs:
>
> Our Mesos master suicide after ZooKeeper session expired several times
> in a day with the following msg:
>
>
> 2011-12-16 17:06:35,594:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
> @2612:
> Sending request xid=0x4eeb07ce for path [/mesos_master] to
> 192.168.1.32:2181
> 2011-12-16
> 17:06:35,594:7696(0x7f8246f4c700):ZOO_ERROR@handle_socket_error_msg@1621:
> Socket [192.168.1.32:2181] zk retcode=-112, errno=116(Stale NFS file
> handle): sessionId=0x333cf926ee322ef has expired.
> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@handle_error@1138:
> Calling a watcher for a ZOO_SESSION_EVENT and the
> state=ZOO_EXPIRED_SESSION_STATE
> 2011-12-16 17:06:35,594:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@317: IO
> thread terminated
> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1765:
> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> 2011-12-16 17:06:35,594:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1795:
> Calling COMPLETION_STRINGLIST for xid=0x4eeb07ce rc=-112
> E1216 17:06:35.594744  7699 detector.cpp:559] Master detector failed
> to get masters: session expired
> I1216 17:06:48.371737  7699 detector.cpp:457] Master detector lost
> connection to ZooKeeper, attempting to reconnect ...
> W1216 17:06:48.371817  7699 detector.cpp:500] Master detector
> ZooKeeper session expired!
> 2011-12-16 17:06:48,371:7696(0x7f824674b700):ZOO_DEBUG@do_completion@336:
> completion thread terminated
> 2011-12-16 17:06:48,371:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
> @2307:
> Freeing zookeeper resources for sessionId=0x333cf926ee322ef
> F1216 17:06:48.374893  7699 master.cpp:475] No longer elected master
> ... committing suicide!
> *** Check failure stack trace: ***
>    @           0x5aa61d  google::LogMessage::Fail()
>    @           0x5ace2f  google::LogMessage::SendToLog()
>    @           0x5aa206  google::LogMessage::Flush()
>    @           0x5ad6ad  google::LogMessageFatal::~LogMessageFatal()
>    @           0x439951
>  mesos::internal::master::Master::noMasterDetected()
>    @           0x440158  mesos::internal::master::Master::operator()()
>    @           0x5bfc39  process::ProcessManager::run()
>    @           0x5bfd8a  process::trampoline()
>    @     0x7f8248791160  (unknown)
>
>
>
>
>
> At some times, it can reconnect to ZooKeeper without suicide:
>
>
> 2011-12-16 17:01:28,580:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
> @1461:
> Exceeded deadline by 13ms
> 2011-12-16 17:01:28,581:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
> @1974:
> Got ping response in 0 ms
> 2011-12-16 17:01:31,921:7696(0x7f824674b700):ZOO_DEBUG@zookeeper_process
> @1974:
> Got ping response in 0 ms
> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
> @1461:
> Exceeded deadline by 19401ms
> 2011-12-16
> 17:01:54,649:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1528:
> Socket [192.168.1.33:2181] zk retcode=-7, errno=110(Connection timed
> out): connection timed out (exceeded timeout by 16062ms)
> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1141:
> Calling a watcher for a ZOO_SESSION_EVENT and the
> state=CONNECTING_STATE
> 2011-12-16 17:01:54,649:7696(0x7f824674b700):ZOO_WARN@zookeeper_interest
> @1461:
> Exceeded deadline by 19401ms
> 2011-12-16 17:01:54,649:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
> @1765:
> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> I1216 17:01:54.657625  7699 detector.cpp:457] Master detector lost
> connection to ZooKeeper, attempting to reconnect ...
> 2011-12-16 17:01:54,660:7696(0x7f824674b700):ZOO_INFO@check_events@1585:
> initiated connection to server [192.168.1.34:2181]
> 2011-12-16
> 17:01:54,662:7696(0x7f824674b700):ZOO_ERROR@handle_socket_error_msg@1621:
> Socket [192.168.1.34:2181] zk retcode=-112, errno=116(Stale NFS file
> handle): sessionId=0x333cf926ee322ee has expired.
> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@handle_error@1138:
> Calling a watcher for a ZOO_SESSION_EVENT and the
> state=ZOO_EXPIRED_SESSION_STATE
> 2011-12-16 17:01:54,662:7696(0x7f824674b700):ZOO_DEBUG@do_io@317: IO
> thread terminated
> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@process_completions
> @1765:
> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> W1216 17:01:54.662583  7699 detector.cpp:500] Master detector
> ZooKeeper session expired!
> 2011-12-16 17:01:54,662:7696(0x7f8246f4c700):ZOO_DEBUG@do_completion@336:
> completion thread terminated
> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@zookeeper_close
> @2307:
> Freeing zookeeper resources for sessionId=0x333cf926ee322ee
>
> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@658:
> Client environment:zookeeper.version=zookeeper C client 3.3.1
> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@662:
> Client environment:host.name=boromir
> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@669:
> Client environment:os.name=Linux
> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@670:
> Client environment:os.arch=2.6.39-gentoo-r3
> 2011-12-16 17:01:54,662:7696(0x7f824874f700):ZOO_INFO@log_env@671:
> Client environment:os.version=#4 SMP Mon Sep 19 17:12:59 CST 2011
> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@679:
> Client environment:user.name=(null)
> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@687:
> Client environment:user.home=/root
> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@log_env@699:
> Client environment:user.dir=/opt/mesos/bin
> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_INFO@zookeeper_init@727:
> Initiating client connection,
> host=zk1:2181,zk2:2181,zk3:2181,zk4:2181,zk5:2181 sessionTimeout=10000
> watcher=0x4cb750 sessionId=0 sessionPasswd=<null>
> context=0x7f823803c380 flags=0
> 2011-12-16 17:01:54,663:7696(0x7f824874f700):ZOO_DEBUG@start_threads@152:
> starting threads...
> 2011-12-16 17:01:54,680:7696(0x7f824674b700):ZOO_DEBUG@do_completion@326:
> started completion thread
> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_DEBUG@do_io@279:
> started IO thread
> 2011-12-16 17:01:54,680:7696(0x7f8246f4c700):ZOO_INFO@check_events@1585:
> initiated connection to server [192.168.1.33:2181]
> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_INFO@check_events@1632:
> session establishment complete on server [192.168.1.33:2181],
> sessionId=0x333cf926ee322ef, negotiated timeout=10000
> 2011-12-16 17:01:54,719:7696(0x7f8246f4c700):ZOO_DEBUG@check_events@1638:
> Calling a watcher for a ZOO_SESSION_EVENT and the
> state=ZOO_CONNECTED_STATE
> 2011-12-16 17:01:54,719:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1765:
> Calling a watcher for node [], type = -1 event=ZOO_SESSION_EVENT
> I1216 17:01:54.719355  7699 detector.cpp:375] Master detector
> connected to ZooKeeper ...
> I1216 17:01:54.719421  7699 detector.cpp:404] Trying to create znode
> '/mesos_master' in ZooKeeper
> 2011-12-16 17:01:54,719:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
> Sending request xid=0x4eeb07c9 for path [/mesos_master] to
> 192.168.1.33:2181
> 2011-12-16 17:01:54,723:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
> @1980:
> Queueing asynchronous response
> 2011-12-16 17:01:54,723:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1817:
> Calling COMPLETION_STRING for xid=0x4eeb07c9 rc=-110
> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
> Sending request xid=0x4eeb07ca for path [/mesos_master] to
> 192.168.1.33:2181
> 2011-12-16 17:01:54,724:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
> @1980:
> Queueing asynchronous response
> 2011-12-16 17:01:54,724:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1772:
> Calling COMPLETION_DATA for xid=0x4eeb07ca rc=0
> 2011-12-16 17:01:54,724:7696(0x7f824874f700):ZOO_DEBUG@zoo_acreate@2489:
> Sending request xid=0x4eeb07cb for path [/mesos_master/] to
> 192.168.1.33:2181
> 2011-12-16 17:01:54,726:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
> @1980:
> Queueing asynchronous response
> 2011-12-16 17:01:54,726:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1817:
> Calling COMPLETION_STRING for xid=0x4eeb07cb rc=0
> I1216 17:01:54.726915  7699 detector.cpp:443] Created
> ephemeral/sequence:0000000344
> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget_children_
> @2612:
> Sending request xid=0x4eeb07cc for path [/mesos_master] to
> 192.168.1.33:2181
> W1216 17:01:54.727063  7699 master.cpp:299] Dropping unknown message
> 'mesos.internal.GotMasterTokenMessage' from: @0.0.0.0:0
> 2011-12-16 17:01:54,727:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
> @1980:
> Queueing asynchronous response
> 2011-12-16 17:01:54,727:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1795:
> Calling COMPLETION_STRINGLIST for xid=0x4eeb07cc rc=0
> I1216 17:01:54.727489  7699 detector.cpp:562] Master detector found 1
> registered masters
> 2011-12-16 17:01:54,727:7696(0x7f824874f700):ZOO_DEBUG@zoo_awget@2400:
> Sending request xid=0x4eeb07cd for path [/mesos_master/0000000344] to
> 192.168.1.33:2181
> 2011-12-16 17:01:54,728:7696(0x7f8246f4c700):ZOO_DEBUG@zookeeper_process
> @1980:
> Queueing asynchronous response
> 2011-12-16 17:01:54,728:7696(0x7f824674b700):ZOO_DEBUG@process_completions
> @1772:
> Calling COMPLETION_DATA for xid=0x4eeb07cd rc=0
> I1216 17:01:54.728397  7699 detector.cpp:592] Master detector got new
> master pid: master@192.168.1.13:5050
> I1216 17:01:54.728489  7699 master.cpp:467] Still acting as master!
>
> Our Mesos is lastest master branch at github, ZooKeeper is 3.3.3, but
> the client is the included one, which is 3.3.1,
> is it the reason ?
>
> One week before, we use Mesos 0.3-alpha, it's more stable with
> ZooKeeper than now.
>
> Any suggestion are welcomed, thank you.
>
> --
>  - Davies
>