You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Nikolay Borodachev <nb...@adobe.com> on 2015/04/27 20:41:54 UTC

Marathon chage of leader and stalled deployments

Hello All,

I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.

Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.

Is this a known and expected behavior?

Thanks
Nikolay

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

You're most welcome!



> On 28.04.2015, at 20:06, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> That did the trick! Thank you very much, Dario!
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 10:00 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Yes. Unfortunately that’s the only way to set the IP in the Mesos Java bindings.
>  
> On 28 Apr 2015, at 16:57, Nikolay Borodachev <nb...@adobe.com> wrote:
>  
> I actually have ‘—ip’ parameter set for both master and slave. So, LIBPROCESS_IP should only be set for marathon?
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 9:56 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> On master and slave you should be able to start it with the —ip parameter, instead of using the env variable. But you should set the IP to a fixed value for all processes.
>  
> On 28 Apr 2015, at 16:52, Nikolay Borodachev <nb...@adobe.com> wrote:
>  
> Is it for all 3 processes: master, slave, and marathon?
>  
> Thanks
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 9:47 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> On each host you have to set it to the interface that is connected to the network your cluster is running in. 
> 
>  
> 
> On 28.04.2015, at 16:41, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Hi Dario,
>  
> This could be the reason but why would it not bind to all network interfaces by default?
> To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 4:31 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> could this be the problem?
>  
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
> Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
>  
> This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.
>  
> Cheers,
> Dario
>  
> On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com> wrote:
>  
> Dario,
>  
> The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Monday, April 27, 2015 4:01 PM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?
>  
> Thanks,
> Dario
> 
>  
> 
> On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Hello All,
>  
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
>  
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
>  
> Is this a known and expected behavior?
>  
> Thanks
> Nikolay
>

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

Hi Egor,

you can do that by setting $LIBPROCESS_PORT.

Cheers,
Dario



> On 28.04.2015, at 20:28, Egor Guz <EG...@walmartlabs.com> wrote:
> 
> Dario,
> 
> Is possible to specify port which libprocess uses to listen messages from master. I put "30000 60000 tcp 0.0.0.0/0” range which works for us, but I really want to narrow it down.
> 
> —
> Egor
> 
> From: Nikolay Borodachev <nb...@adobe.com>>
> Reply-To: "user@mesos.apache.org<ma...@mesos.apache.org>" <us...@mesos.apache.org>>
> Date: Tuesday, April 28, 2015 at 11:06
> To: "user@mesos.apache.org<ma...@mesos.apache.org>" <us...@mesos.apache.org>>
> Subject: RE: Marathon chage of leader and stalled deployments
> 
> That did the trick! Thank you very much, Dario!
> 
> From: Dario Rexin [mailto:dario@mesosphere.io]
> Sent: Tuesday, April 28, 2015 10:00 AM
> To: user@mesos.apache.org<ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
> 
> Yes. Unfortunately that’s the only way to set the IP in the Mesos Java bindings.
> 
> On 28 Apr 2015, at 16:57, Nikolay Borodachev <nb...@adobe.com>> wrote:
> 
> I actually have ‘—ip’ parameter set for both master and slave. So, LIBPROCESS_IP should only be set for marathon?
> 
> From: Dario Rexin [mailto:dario@mesosphere.io]
> Sent: Tuesday, April 28, 2015 9:56 AM
> To: user@mesos.apache.org<ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
> 
> On master and slave you should be able to start it with the —ip parameter, instead of using the env variable. But you should set the IP to a fixed value for all processes.
> 
> On 28 Apr 2015, at 16:52, Nikolay Borodachev <nb...@adobe.com>> wrote:
> 
> Is it for all 3 processes: master, slave, and marathon?
> 
> Thanks
> Nikolay
> 
> From: Dario Rexin [mailto:dario@mesosphere.io]
> Sent: Tuesday, April 28, 2015 9:47 AM
> To: user@mesos.apache.org<ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
> 
> On each host you have to set it to the interface that is connected to the network your cluster is running in.
> 
> 
> On 28.04.2015, at 16:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
> Hi Dario,
> 
> This could be the reason but why would it not bind to all network interfaces by default?
> To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?
> 
> Thank you
> Nikolay
> 
> From: Dario Rexin [mailto:dario@mesosphere.io]
> Sent: Tuesday, April 28, 2015 4:31 AM
> To: user@mesos.apache.org<ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
> 
> Hi Nikolay,
> 
> could this be the problem?
> 
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
> Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
> 
> This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.
> 
> Cheers,
> Dario
> 
> On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com>> wrote:
> 
> Dario,
> 
> The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.
> 
> Thank you
> Nikolay
> 
> From: Dario Rexin [mailto:dario@mesosphere.io]
> Sent: Monday, April 27, 2015 4:01 PM
> To: user@mesos.apache.org<ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
> 
> Hi Nikolay,
> 
> this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?
> 
> Thanks,
> Dario
> 
> 
> On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
> Hello All,
> 
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
> 
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
> 
> Is this a known and expected behavior?
> 
> Thanks
> Nikolay
>

RE: Marathon chage of leader and stalled deployments

Posted by Nikolay Borodachev <nb...@adobe.com>.

That did the trick! Thank you very much, Dario!

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 10:00 AM
To: user@mesos.apache.org
Subject: Re: Marathon chage of leader and stalled deployments

Yes. Unfortunately that’s the only way to set the IP in the Mesos Java bindings.

On 28 Apr 2015, at 16:57, Nikolay Borodachev <nb...@adobe.com>> wrote:

I actually have ‘—ip’ parameter set for both master and slave. So, LIBPROCESS_IP should only be set for marathon?

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 9:56 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

On master and slave you should be able to start it with the —ip parameter, instead of using the env variable. But you should set the IP to a fixed value for all processes.

On 28 Apr 2015, at 16:52, Nikolay Borodachev <nb...@adobe.com>> wrote:

Is it for all 3 processes: master, slave, and marathon?

Thanks
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 9:47 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

On each host you have to set it to the interface that is connected to the network your cluster is running in.

On 28.04.2015, at 16:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hi Dario,

This could be the reason but why would it not bind to all network interfaces by default?
To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 4:31 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

could this be the problem?

Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************

This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.

Cheers,
Dario

On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com>> wrote:

Dario,

The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Monday, April 27, 2015 4:01 PM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?

Thanks,
Dario

On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hello All,

I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.

Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.

Is this a known and expected behavior?

Thanks
Nikolay

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

Yes. Unfortunately that’s the only way to set the IP in the Mesos Java bindings.

> On 28 Apr 2015, at 16:57, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> I actually have ‘—ip’ parameter set for both master and slave. So, LIBPROCESS_IP should only be set for marathon?
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 9:56 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> On master and slave you should be able to start it with the —ip parameter, instead of using the env variable. But you should set the IP to a fixed value for all processes.
>  
> On 28 Apr 2015, at 16:52, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
>  
> Is it for all 3 processes: master, slave, and marathon?
>  
> Thanks
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io <ma...@mesosphere.io>] 
> Sent: Tuesday, April 28, 2015 9:47 AM
> To: user@mesos.apache.org <ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> On each host you have to set it to the interface that is connected to the network your cluster is running in. 
> 
>  
> 
> On 28.04.2015, at 16:41, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
> 
> Hi Dario,
>  
> This could be the reason but why would it not bind to all network interfaces by default?
> To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io <ma...@mesosphere.io>] 
> Sent: Tuesday, April 28, 2015 4:31 AM
> To: user@mesos.apache.org <ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> could this be the problem?
>  
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
> Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
>  
> This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.
>  
> Cheers,
> Dario
>  
> On 27 Apr 2015, at 23:49, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
>  
> Dario,
>  
> The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io <ma...@mesosphere.io>] 
> Sent: Monday, April 27, 2015 4:01 PM
> To: user@mesos.apache.org <ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?
>  
> Thanks,
> Dario
> 
>  
> 
> On 27.04.2015, at 20:41, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
> 
> Hello All,
>  
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
>  
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
>  
> Is this a known and expected behavior?
>  
> Thanks
> Nikolay

RE: Marathon chage of leader and stalled deployments

Posted by Nikolay Borodachev <nb...@adobe.com>.

I actually have ‘—ip’ parameter set for both master and slave. So, LIBPROCESS_IP should only be set for marathon?

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 9:56 AM
To: user@mesos.apache.org
Subject: Re: Marathon chage of leader and stalled deployments

On master and slave you should be able to start it with the —ip parameter, instead of using the env variable. But you should set the IP to a fixed value for all processes.

On 28 Apr 2015, at 16:52, Nikolay Borodachev <nb...@adobe.com>> wrote:

Is it for all 3 processes: master, slave, and marathon?

Thanks
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 9:47 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

On each host you have to set it to the interface that is connected to the network your cluster is running in.

On 28.04.2015, at 16:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hi Dario,

This could be the reason but why would it not bind to all network interfaces by default?
To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 4:31 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

could this be the problem?

Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************

This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.

Cheers,
Dario

On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com>> wrote:

Dario,

The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Monday, April 27, 2015 4:01 PM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?

Thanks,
Dario

On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hello All,

I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.

Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.

Is this a known and expected behavior?

Thanks
Nikolay

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

On master and slave you should be able to start it with the —ip parameter, instead of using the env variable. But you should set the IP to a fixed value for all processes.

> On 28 Apr 2015, at 16:52, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Is it for all 3 processes: master, slave, and marathon?
>  
> Thanks
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 9:47 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> On each host you have to set it to the interface that is connected to the network your cluster is running in. 
> 
>  
> 
> On 28.04.2015, at 16:41, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
> 
> Hi Dario,
>  
> This could be the reason but why would it not bind to all network interfaces by default?
> To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io <ma...@mesosphere.io>] 
> Sent: Tuesday, April 28, 2015 4:31 AM
> To: user@mesos.apache.org <ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> could this be the problem?
>  
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
> Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
>  
> This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.
>  
> Cheers,
> Dario
>  
> On 27 Apr 2015, at 23:49, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
>  
> Dario,
>  
> The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io <ma...@mesosphere.io>] 
> Sent: Monday, April 27, 2015 4:01 PM
> To: user@mesos.apache.org <ma...@mesos.apache.org>
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?
>  
> Thanks,
> Dario
> 
>  
> 
> On 27.04.2015, at 20:41, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
> 
> Hello All,
>  
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
>  
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
>  
> Is this a known and expected behavior?
>  
> Thanks
> Nikolay

RE: Marathon chage of leader and stalled deployments

Posted by Nikolay Borodachev <nb...@adobe.com>.

Is it for all 3 processes: master, slave, and marathon?

Thanks
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 9:47 AM
To: user@mesos.apache.org
Subject: Re: Marathon chage of leader and stalled deployments

On each host you have to set it to the interface that is connected to the network your cluster is running in.

On 28.04.2015, at 16:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hi Dario,

This could be the reason but why would it not bind to all network interfaces by default?
To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 4:31 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

could this be the problem?

Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************

This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.

Cheers,
Dario

On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com>> wrote:

Dario,

The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Monday, April 27, 2015 4:01 PM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?

Thanks,
Dario

On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hello All,

I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.

Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.

Is this a known and expected behavior?

Thanks
Nikolay

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

On each host you have to set it to the interface that is connected to the network your cluster is running in. 



> On 28.04.2015, at 16:41, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Hi Dario,
>  
> This could be the reason but why would it not bind to all network interfaces by default?
> To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Tuesday, April 28, 2015 4:31 AM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> could this be the problem?
>  
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
> Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
> Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
>  
> This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.
>  
> Cheers,
> Dario
>  
> On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com> wrote:
>  
> Dario,
>  
> The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Monday, April 27, 2015 4:01 PM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?
>  
> Thanks,
> Dario
> 
>  
> 
> On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Hello All,
>  
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
>  
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
>  
> Is this a known and expected behavior?
>  
> Thanks
> Nikolay
>

RE: Marathon chage of leader and stalled deployments

Posted by Nikolay Borodachev <nb...@adobe.com>.

Hi Dario,

This could be the reason but why would it not bind to all network interfaces by default?
To test it out, should I set LIBPROCESS_IP to an IP address of mesos1 server?

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Tuesday, April 28, 2015 4:31 AM
To: user@mesos.apache.org
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

could this be the problem?

Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************

This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.

Cheers,
Dario

On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com>> wrote:

Dario,

The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Monday, April 27, 2015 4:01 PM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?

Thanks,
Dario


On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hello All,

I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.

Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.

Is this a known and expected behavior?

Thanks
Nikolay

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

Hi Nikolay,

could this be the problem?

Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************
Apr 27 22:36:00 mesos1 marathon[6289]: Scheduler driver bound to loopback interface! Cannot communicate with remote master(s). You might want to set 'LIBPROCESS_IP' environment variable to use a routable IP address.
Apr 27 22:36:00 mesos1 marathon[6289]: **************************************************

This would explain why only a certain node (most likely the one that’s running on the same machine as the current Mesos leader) can start tasks.

Cheers,
Dario

> On 27 Apr 2015, at 23:49, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Dario,
>  
> The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.
>  
> Thank you
> Nikolay
>  
> From: Dario Rexin [mailto:dario@mesosphere.io] 
> Sent: Monday, April 27, 2015 4:01 PM
> To: user@mesos.apache.org
> Subject: Re: Marathon chage of leader and stalled deployments
>  
> Hi Nikolay,
>  
> this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?
>  
> Thanks,
> Dario
> 
>  
> 
> On 27.04.2015, at 20:41, Nikolay Borodachev <nborod@adobe.com <ma...@adobe.com>> wrote:
> 
> Hello All,
>  
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
>  
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
>  
> Is this a known and expected behavior?
>  
> Thanks
> Nikolay

RE: Marathon chage of leader and stalled deployments

Posted by Nikolay Borodachev <nb...@adobe.com>.

Dario,

The logs are quote lengthy, so I sent them to you directly. Marathon version is 0.8.1.

Thank you
Nikolay

From: Dario Rexin [mailto:dario@mesosphere.io]
Sent: Monday, April 27, 2015 4:01 PM
To: user@mesos.apache.org
Subject: Re: Marathon chage of leader and stalled deployments

Hi Nikolay,

this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?

Thanks,
Dario

On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com>> wrote:
Hello All,

I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.

Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.

Is this a known and expected behavior?

Thanks
Nikolay

Re: Marathon chage of leader and stalled deployments

Posted by Dario Rexin <da...@mesosphere.io>.

Hi Nikolay,

this is an unexpected behavior. Could you please post the log output from the leading node around the time you try to scale? Also, what version of Marathon are you running?

Thanks,
Dario



> On 27.04.2015, at 20:41, Nikolay Borodachev <nb...@adobe.com> wrote:
> 
> Hello All,
>  
> I noticed a strange behavior of a Marathon cluster. The cluster consist of 3 mesos/marathon masters and 3 slaves.
>  
> Once the cluster is freshly started I can start a process (e.g. httpd) and scale it up and down without any problems. Everything works as it should.
> However, if a Marathon leader goes down or gets restarted, the managed processes cannot be scaled anymore. The scaling request gets queued but does not get executed by a new Marathon leader.
> I found that if I recycle the current leader until the original server becomes a leader again, the  scaling request would not move.
> It is only when the server that used to be a leader when the tasks were created becomes a leader again then these tasks can be scaled.
>  
> Is this a known and expected behavior?
>  
> Thanks
> Nikolay
>