You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Frederic LE BRIS <fl...@pagesjaunes.fr> on 2015/12/02 15:01:28 UTC

Sync Mesos-Master to Slaves

Hi,

I manage a Mesos Cluster 0.23.0 based on .deb from Mesosphere on Ubuntu 14.04.

We deployed 3 zookeeper, 3 Mesos-master, and 3 Marathon : HA Mode

And deployed 6 mesos slaves + slave process on 3 masters mesos.

So I have the following topology:

3 servers : Mesos-master / Marathon
3 servers : Zookeeper / mesos-slaves.
3 servers : mesos-slave

I follow the HA configuration for Mesos-master and marathon.

The point is when I kill the leader mesos-master, we lost the existing task on the slave, and the ressources available are lock by the slave even if the master see no activity on this slave.

My mesos cluster is in production, so I’m not able te restart from scratch, so I look for a procedure te re-synchronise the cluster.

And some way to check that my leaders mesos master are working together as a leader and two slaves correctly synchronize.

I guess I miss something, but I’m need some help...

Regards,

Fred

Re: Sync Mesos-Master to Slaves

Posted by Alex Rukletsov <al...@mesosphere.com>.

Hi Fred,

hm, if the bug dependents on Ubuntu version, my random guess is that it's
systemd related. Were you able to solve the problem? If not, it would be
helpful if you provide more context and describe a minimal setup that
reproduces the issue.

On Thu, Dec 10, 2015 at 10:15 AM, Frederic LE BRIS <fl...@pagesjaunes.fr>
wrote:

> Thanks Alex.
>
> About the context, we use spark on mesos and marathon to launch some
> elastisearch,
>
> I kill each leader one-by-one.
>
> By the way as I said, we are on a config Mesos-Master on ubuntu 12, and
> mesos-slave on ubuntu 14, to reproduce this comportement.
>
> When I deploy only on Ubuntu 14 master+slave, the issue disappear …
>
> Fred
>
>
>
>
>
>
> On 09 Dec 2015, at 16:30, Alex Rukletsov <al...@mesosphere.com> wrote:
>
> Frederic,
>
> I have skimmed through the logs and they are do not seem to be complete
> (especially for master1). Could you please say what task has been killed
> (id) and which master failover triggered that? I see at least three
> failovers in the logs : ). Also, could you please share some background
> about your setup? I believe you're on systemd, do you use docker tasks?
>
> To connect our conversation to particular events, let me post here the
> chain of (potentially) interesting events and some info I mined from the
> logs.
> master1: 192.168.37.59 ?
> master2: 192.168.37.58
> master3: 192.168.37.104
>
> timestamp   observed by   event
> 13:48:38     master1          master1 killed by sigterm
> 13:48:48     master2,3       new leader elected (192.168.37.104), id=5
> 13:49:25     master2          master2 killed by sigterm
> 13:50:44     master2,3       new leader elected (192.168.37.59), id=7
> 14:23:34     master1          master1 killed by sigterm
> 14:23:44     master2,3       new leader elected (192.168.37.58), id=8
>
> One interesting thing I cannot understand is why master3 did not commit
> suicide when it lost leadership?
>
>
> On Mon, Dec 7, 2015 at 4:08 PM, Frederic LE BRIS <fl...@pagesjaunes.fr>
> wrote:
>
>> With the context .. sorry
>>
>>
>
>

Re: Sync Mesos-Master to Slaves

Posted by Frederic LE BRIS <fl...@pagesjaunes.fr>.

Thanks Alex.

About the context, we use spark on mesos and marathon to launch some elastisearch,

I kill each leader one-by-one.

By the way as I said, we are on a config Mesos-Master on ubuntu 12, and mesos-slave on ubuntu 14, to reproduce this comportement.

When I deploy only on Ubuntu 14 master+slave, the issue disappear …

Fred

On 09 Dec 2015, at 16:30, Alex Rukletsov <al...@mesosphere.com>> wrote:

Frederic,

I have skimmed through the logs and they are do not seem to be complete (especially for master1). Could you please say what task has been killed (id) and which master failover triggered that? I see at least three failovers in the logs : ). Also, could you please share some background about your setup? I believe you're on systemd, do you use docker tasks?

To connect our conversation to particular events, let me post here the chain of (potentially) interesting events and some info I mined from the logs.
master1: 192.168.37.59 ?
master2: 192.168.37.58
master3: 192.168.37.104

timestamp observed by event
13:48:38 master1 master1 killed by sigterm
13:48:48 master2,3 new leader elected (192.168.37.104), id=5
13:49:25 master2 master2 killed by sigterm
13:50:44 master2,3 new leader elected (192.168.37.59), id=7
14:23:34 master1 master1 killed by sigterm
14:23:44 master2,3 new leader elected (192.168.37.58), id=8

One interesting thing I cannot understand is why master3 did not commit suicide when it lost leadership?

On Mon, Dec 7, 2015 at 4:08 PM, Frederic LE BRIS <fl...@pagesjaunes.fr>> wrote:
With the context .. sorry

Re: Sync Mesos-Master to Slaves

Posted by Alex Rukletsov <al...@mesosphere.com>.

Frederic,

I have skimmed through the logs and they are do not seem to be complete
(especially for master1). Could you please say what task has been killed
(id) and which master failover triggered that? I see at least three
failovers in the logs : ). Also, could you please share some background
about your setup? I believe you're on systemd, do you use docker tasks?

To connect our conversation to particular events, let me post here the
chain of (potentially) interesting events and some info I mined from the
logs.
master1: 192.168.37.59 ?
master2: 192.168.37.58
master3: 192.168.37.104

timestamp   observed by   event
13:48:38     master1          master1 killed by sigterm
13:48:48     master2,3       new leader elected (192.168.37.104), id=5
13:49:25     master2          master2 killed by sigterm
13:50:44     master2,3       new leader elected (192.168.37.59), id=7
14:23:34     master1          master1 killed by sigterm
14:23:44     master2,3       new leader elected (192.168.37.58), id=8

One interesting thing I cannot understand is why master3 did not commit
suicide when it lost leadership?

On Mon, Dec 7, 2015 at 4:08 PM, Frederic LE BRIS <fl...@pagesjaunes.fr>
wrote:

> With the context .. sorry
>
>

Re: Sync Mesos-Master to Slaves

Posted by Frederic LE BRIS <fl...@pagesjaunes.fr>.

With the context .. sorry

Re: Sync Mesos-Master to Slaves

Posted by Frederic LE BRIS <fl...@pagesjaunes.fr>.

Thanks Alex,

This is the logs from the 3 masters and a slave.

Re: Sync Mesos-Master to Slaves

Posted by Alex Rukletsov <al...@mesosphere.com>.

Hey Fred,

Logs (master and slave) can be helpful to sched some light on the problem.
On 2 Dec 2015 3:01 pm, "Frederic LE BRIS" <fl...@pagesjaunes.fr> wrote:

> Hi,
>
> I manage a Mesos Cluster 0.23.0 based on .deb from Mesosphere on Ubuntu
> 14.04.
>
> We deployed 3 zookeeper, 3 Mesos-master, and 3 Marathon : HA Mode
>
> And deployed 6 mesos slaves + slave process on 3 masters mesos.
>
> So I have the following topology:
>
> 3 servers : Mesos-master / Marathon
> 3 servers : Zookeeper / mesos-slaves.
> 3 servers : mesos-slave
>
> I follow the HA configuration for Mesos-master and marathon.
>
> The point is when I kill the leader mesos-master, we lost the existing
> task on the slave, and the ressources available are lock by the slave even
> if the master see no activity on this slave.
>
> My mesos cluster is in production, so I’m not able te restart from
> scratch, so I look for a procedure te re-synchronise the cluster.
>
> And some way to check that my leaders mesos master are working together as
> a leader and two slaves correctly synchronize.
>
> I guess I miss something, but I’m need some help...
>
> Regards,
>
> Fred
>
>
>