You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Guilherme Moro <gu...@ammeon.com> on 2015/11/24 13:45:48 UTC

Initial leader election

Hi,

I'm having a problem while trying to create the initial cluster, no leader
is elected.
For a start, let me explain my setup:
3 nodes
3 zookeepers
3 mesos-master services, configured as initctl services and controlled by
puppet, RPM's installed are from the RHEL repository at mesosphere
(installed through puppet as well), running on RHEL 6.6
Quorum is set to 2, as expected, all the remaining configs were double
checked and appears to be correct.
Most of times I can get the cluster to bootstrap after rebooting the nodes
(sometimes more than once).
The whole thing resembles a bit
https://issues.apache.org/jira/browse/MESOS-2148 and
https://issues.apache.org/jira/browse/MESOS-2014

Even when I get the master elected, sometimes another couple of reboots or
restarts of the services are needed to get all the slave nodes added (they
are the same nodes as the masters).

I can quite easily reproduce this behavior, if someone cares to look at
logs tell me exactly what to collect and what logging flags I should enable.

So, should I maybe open a bug or is there any trick to bootstrap the
cluster that I'm losing here.

Regards,

Guilherme Moro

-- 
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail.

Re: Initial leader election

Posted by Guilherme Moro <gu...@ammeon.com>.

The nodes are quite fast to come up, but I will try to increase that for a
test anyway.
Either way, shouldn't the system try again automatically instead of just
issuing repeatedly "Replica in EMPTY status received a broadcasted recover
request" after a couple of failures?


Thanks for the answer.



On 25 November 2015 at 17:31, Marco Massenzio <ma...@mesosphere.io> wrote:

> A quick glance of the logs doesn't show anything that stands out, apart
> from:
>
> --zk_session_timeout="10secs"
>
> which seems to lead to:
>
> Nov 23 16:50:13 node1 mesos-master[17501]: I1123 16:50:13.594151 17521
> recover.cpp:111] Unable to finish the recover protocol in 10secs,
> retrying
>
> That is the default value, but maybe your setup may need longer than that
> (it is possible that the time it takes for all master nodes to come up and
> reach quorum may be the issue).
>
> --
> *Marco Massenzio*
> Distributed Systems Engineer
> http://codetrips.com
>
> On Wed, Nov 25, 2015 at 3:06 AM, Guilherme Moro <guilherme.moro@ammeon.com
> >
> wrote:
>
> > https://issues.apache.org/jira/browse/MESOS-4010
> >
> > On 24 November 2015 at 13:55, Klaus Ma <kl...@gmail.com> wrote:
> >
> > > I'd suggest to open a JIRA to trace issue; I think you can append
> > > master.log & slave.log for owner reference.
> > >
> > > ----
> > > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> > > Platform Symphony/DCOS Development & Support, STG, IBM GCG
> > > +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
> > >
> > > On Tue, Nov 24, 2015 at 8:45 PM, Guilherme Moro <
> > guilherme.moro@ammeon.com
> > > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > I'm having a problem while trying to create the initial cluster, no
> > > leader
> > > > is elected.
> > > > For a start, let me explain my setup:
> > > > 3 nodes
> > > > 3 zookeepers
> > > > 3 mesos-master services, configured as initctl services and
> controlled
> > by
> > > > puppet, RPM's installed are from the RHEL repository at mesosphere
> > > > (installed through puppet as well), running on RHEL 6.6
> > > > Quorum is set to 2, as expected, all the remaining configs were
> double
> > > > checked and appears to be correct.
> > > > Most of times I can get the cluster to bootstrap after rebooting the
> > > nodes
> > > > (sometimes more than once).
> > > > The whole thing resembles a bit
> > > > https://issues.apache.org/jira/browse/MESOS-2148 and
> > > > https://issues.apache.org/jira/browse/MESOS-2014
> > > >
> > > > Even when I get the master elected, sometimes another couple of
> reboots
> > > or
> > > > restarts of the services are needed to get all the slave nodes added
> > > (they
> > > > are the same nodes as the masters).
> > > >
> > > > I can quite easily reproduce this behavior, if someone cares to look
> at
> > > > logs tell me exactly what to collect and what logging flags I should
> > > > enable.
> > > >
> > > > So, should I maybe open a bug or is there any trick to bootstrap the
> > > > cluster that I'm losing here.
> > > >
> > > > Regards,
> > > >
> > > > Guilherme Moro
> > > >
> > > > --
> > > > This email and any files transmitted with it are confidential and
> > > intended
> > > > solely for the use of the individual or entity to whom they are
> > > addressed.
> > > > If you have received this email in error please notify the system
> > > manager.
> > > > This message contains confidential information and is intended only
> for
> > > the
> > > > individual named. If you are not the named addressee you should not
> > > > disseminate, distribute or copy this e-mail.
> > > >
> > > >
> > >
> >
> > --
> > This email and any files transmitted with it are confidential and
> intended
> > solely for the use of the individual or entity to whom they are
> addressed.
> > If you have received this email in error please notify the system
> manager.
> > This message contains confidential information and is intended only for
> the
> > individual named. If you are not the named addressee you should not
> > disseminate, distribute or copy this e-mail.
> >
> >
>

-- 
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail.

Re: Initial leader election

Posted by Marco Massenzio <ma...@mesosphere.io>.

A quick glance of the logs doesn't show anything that stands out, apart
from:

--zk_session_timeout="10secs"

which seems to lead to:

Nov 23 16:50:13 node1 mesos-master[17501]: I1123 16:50:13.594151 17521
recover.cpp:111] Unable to finish the recover protocol in 10secs,
retrying

That is the default value, but maybe your setup may need longer than that
(it is possible that the time it takes for all master nodes to come up and
reach quorum may be the issue).

--
*Marco Massenzio*
Distributed Systems Engineer
http://codetrips.com

On Wed, Nov 25, 2015 at 3:06 AM, Guilherme Moro <gu...@ammeon.com>
wrote:

> https://issues.apache.org/jira/browse/MESOS-4010
>
> On 24 November 2015 at 13:55, Klaus Ma <kl...@gmail.com> wrote:
>
> > I'd suggest to open a JIRA to trace issue; I think you can append
> > master.log & slave.log for owner reference.
> >
> > ----
> > Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> > Platform Symphony/DCOS Development & Support, STG, IBM GCG
> > +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
> >
> > On Tue, Nov 24, 2015 at 8:45 PM, Guilherme Moro <
> guilherme.moro@ammeon.com
> > >
> > wrote:
> >
> > > Hi,
> > >
> > > I'm having a problem while trying to create the initial cluster, no
> > leader
> > > is elected.
> > > For a start, let me explain my setup:
> > > 3 nodes
> > > 3 zookeepers
> > > 3 mesos-master services, configured as initctl services and controlled
> by
> > > puppet, RPM's installed are from the RHEL repository at mesosphere
> > > (installed through puppet as well), running on RHEL 6.6
> > > Quorum is set to 2, as expected, all the remaining configs were double
> > > checked and appears to be correct.
> > > Most of times I can get the cluster to bootstrap after rebooting the
> > nodes
> > > (sometimes more than once).
> > > The whole thing resembles a bit
> > > https://issues.apache.org/jira/browse/MESOS-2148 and
> > > https://issues.apache.org/jira/browse/MESOS-2014
> > >
> > > Even when I get the master elected, sometimes another couple of reboots
> > or
> > > restarts of the services are needed to get all the slave nodes added
> > (they
> > > are the same nodes as the masters).
> > >
> > > I can quite easily reproduce this behavior, if someone cares to look at
> > > logs tell me exactly what to collect and what logging flags I should
> > > enable.
> > >
> > > So, should I maybe open a bug or is there any trick to bootstrap the
> > > cluster that I'm losing here.
> > >
> > > Regards,
> > >
> > > Guilherme Moro
> > >
> > > --
> > > This email and any files transmitted with it are confidential and
> > intended
> > > solely for the use of the individual or entity to whom they are
> > addressed.
> > > If you have received this email in error please notify the system
> > manager.
> > > This message contains confidential information and is intended only for
> > the
> > > individual named. If you are not the named addressee you should not
> > > disseminate, distribute or copy this e-mail.
> > >
> > >
> >
>
> --
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail.
>
>

Re: Initial leader election

Posted by Guilherme Moro <gu...@ammeon.com>.

https://issues.apache.org/jira/browse/MESOS-4010

On 24 November 2015 at 13:55, Klaus Ma <kl...@gmail.com> wrote:

> I'd suggest to open a JIRA to trace issue; I think you can append
> master.log & slave.log for owner reference.
>
> ----
> Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
> Platform Symphony/DCOS Development & Support, STG, IBM GCG
> +86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me
>
> On Tue, Nov 24, 2015 at 8:45 PM, Guilherme Moro <guilherme.moro@ammeon.com
> >
> wrote:
>
> > Hi,
> >
> > I'm having a problem while trying to create the initial cluster, no
> leader
> > is elected.
> > For a start, let me explain my setup:
> > 3 nodes
> > 3 zookeepers
> > 3 mesos-master services, configured as initctl services and controlled by
> > puppet, RPM's installed are from the RHEL repository at mesosphere
> > (installed through puppet as well), running on RHEL 6.6
> > Quorum is set to 2, as expected, all the remaining configs were double
> > checked and appears to be correct.
> > Most of times I can get the cluster to bootstrap after rebooting the
> nodes
> > (sometimes more than once).
> > The whole thing resembles a bit
> > https://issues.apache.org/jira/browse/MESOS-2148 and
> > https://issues.apache.org/jira/browse/MESOS-2014
> >
> > Even when I get the master elected, sometimes another couple of reboots
> or
> > restarts of the services are needed to get all the slave nodes added
> (they
> > are the same nodes as the masters).
> >
> > I can quite easily reproduce this behavior, if someone cares to look at
> > logs tell me exactly what to collect and what logging flags I should
> > enable.
> >
> > So, should I maybe open a bug or is there any trick to bootstrap the
> > cluster that I'm losing here.
> >
> > Regards,
> >
> > Guilherme Moro
> >
> > --
> > This email and any files transmitted with it are confidential and
> intended
> > solely for the use of the individual or entity to whom they are
> addressed.
> > If you have received this email in error please notify the system
> manager.
> > This message contains confidential information and is intended only for
> the
> > individual named. If you are not the named addressee you should not
> > disseminate, distribute or copy this e-mail.
> >
> >
>

-- 
This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail.

Re: Initial leader election

Posted by Klaus Ma <kl...@gmail.com>.

I'd suggest to open a JIRA to trace issue; I think you can append
master.log & slave.log for owner reference.

----
Da (Klaus), Ma (马达) | PMP® | Advisory Software Engineer
Platform Symphony/DCOS Development & Support, STG, IBM GCG
+86-10-8245 4084 | klaus1982.cn@gmail.com | http://k82.me

On Tue, Nov 24, 2015 at 8:45 PM, Guilherme Moro <gu...@ammeon.com>
wrote:

> Hi,
>
> I'm having a problem while trying to create the initial cluster, no leader
> is elected.
> For a start, let me explain my setup:
> 3 nodes
> 3 zookeepers
> 3 mesos-master services, configured as initctl services and controlled by
> puppet, RPM's installed are from the RHEL repository at mesosphere
> (installed through puppet as well), running on RHEL 6.6
> Quorum is set to 2, as expected, all the remaining configs were double
> checked and appears to be correct.
> Most of times I can get the cluster to bootstrap after rebooting the nodes
> (sometimes more than once).
> The whole thing resembles a bit
> https://issues.apache.org/jira/browse/MESOS-2148 and
> https://issues.apache.org/jira/browse/MESOS-2014
>
> Even when I get the master elected, sometimes another couple of reboots or
> restarts of the services are needed to get all the slave nodes added (they
> are the same nodes as the masters).
>
> I can quite easily reproduce this behavior, if someone cares to look at
> logs tell me exactly what to collect and what logging flags I should
> enable.
>
> So, should I maybe open a bug or is there any trick to bootstrap the
> cluster that I'm losing here.
>
> Regards,
>
> Guilherme Moro
>
> --
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail.
>
>