You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2015/02/10 03:43:36 UTC

[jira] [Closed] (MESOS-2329) Mesos master crashes after ZooKeeper session expires

     [ https://issues.apache.org/jira/browse/MESOS-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benjamin Mahler closed MESOS-2329.
----------------------------------
    Resolution: Not a Problem

Yes, ideally, it can transition out of leadership without exiting, however it's non-trivial to implement this. Also, a requirement for HA is that the master must be restarted if it exits or crashes for any reason. We lean on this requirement to get away with just exiting when losing leadership (as it's a trivial implementation of leadership loss :)).

Closing this, please feel free to open an improvement ticket for the master to transition out of leadership without exiting! Also, if there's anywhere that the documentation could improve here please let us know. :)

> Mesos master crashes after ZooKeeper session expires
> ----------------------------------------------------
>
>                 Key: MESOS-2329
>                 URL: https://issues.apache.org/jira/browse/MESOS-2329
>             Project: Mesos
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.21.1
>         Environment: CentOS 6.5 (kernel 2.6.32-431), Java 1.7.0_55, ZooKeeper 3.4.5
>            Reporter: Craig W
>
> In a test environment I have experienced an issue where the Mesos Master process crashes after its ZooKeeper session expires. The last few messages in the INFO log file look like this:
> {noformat}
> group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
> group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
> group.cpp:313] Group process (group(4)@192.168.1.4:5050) reconnected to ZooKeeper
> group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
> group.cpp:790] Syncing group operations: queue size (joins, cancels datas) = (0, 0, 0)
> group.cpp:418] Lost connection to ZooKeeper, attempting to reconnect ...
> group.cpp:472] ZooKeeper session expired
> detector.cpp:138] Detected a new leader: None
> master.cpp:1263] The newly elected leader is None
> {noformat}
> . I had a single node ZooKeeper ensemble.
> In my environment, I had a single master, 7 slaves and a single ZooKeeper instance. 
> Restarting the mater process "fixes" the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)