You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mesos.apache.org by Maciej Strzelecki <ma...@crealytics.com> on 2015/07/16 14:29:56 UTC

Marathon can no longer deploy any apps after a failover

Problem:


If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at  'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs)

Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them.


When that happens i can:


stop marathon on all masters

remove the framework via a curl to mesos api /shutdown

purge /marathon from zookeper cli

restart docker services on all slaves (that kills the zombie containers)

restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.


How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there.





Configuration:


Running ubuntu 14.04.2. LTS

mesos                               0.22.1-1.0.ubuntu1404

marathon                            0.9.0-1.0.381.ubuntu1404

chronos                             2.3.4-1.0.81.ubuntu1404


The cluster  uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a  portion of resources as offerrings)


The configuration for mesos/marathon is very "default" dependant, options specified You can see below. The quorum is 2.


Marathon service is run on 3 master machines


root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
    |-- event_subscriber
    |-- framework_name
    |-- hostname
    |-- logging_level
    `-- zk

1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk

0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources

0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir

Re: Marathon can no longer deploy any apps after a failover

Posted by tommy xiao <xi...@gmail.com>.

sometimes you need check zookeeper log, slave log, master log. this is
mesos pain, it very difficult debug for the wired case.

2015-07-16 20:29 GMT+08:00 Maciej Strzelecki <
maciej.strzelecki@crealytics.com>:

>  Problem:
>
>
>
> If i restart a current framework leader for marathon ( the host from
> active frameworks tab in mesos ui) , a new one is elected after a moment
> and any new deployments are stuck infinitely at  'deploying' state (empty
> black bar, 0/1 and hanging - with debug level i dont see any errors in
> marathon/mesos logs)
>
> Also the old tasks are untouchable at that time - yes, they keep running,
> but cant kill, restart nor scale them.
>
>
>  When that happens i can:
>
> stop marathon on all masters
>
> remove the framework via a curl to mesos api /shutdown
>
> purge /marathon from zookeper cli
>
> restart docker services on all slaves (that kills the zombie containers)
> restart mesos-slave services on all slaves (pampering my paranoia here)
> then i can deploy apps again.
>
>
>  How can i avoid this problem? Any basic settings im missing? This is
> scary, as the reboot of a single master (out of 3 or 5 servers) freezes
> everything that is deployed using marathon, and the steps to reclaim
> control introduce downtime to every single app sunning there.
>
>
>
>
>
>  Configuration:
>
>
>  Running ubuntu 14.04.2. LTS
>
> mesos                               0.22.1-1.0.ubuntu1404
>
> marathon                            0.9.0-1.0.381.ubuntu1404
>
> chronos                             2.3.4-1.0.81.ubuntu1404
>
>
>  The cluster  uses 3 masters and a 15 slaves. Also the master machines
> are running mesos-slave process (albeit those machines give only a  portion
> of resources as offerrings)
>
>
>  The configuration for mesos/marathon is very "default" dependant,
> options specified You can see below. The quorum is 2.
>
>
>  Marathon service is run on 3 master machines
>
>
>  root@mesos-master1 ~ # tree /etc/marathon/
> /etc/marathon/
> `-- conf
>     |-- event_subscriber
>     |-- framework_name
>     |-- hostname
>     |-- logging_level
>     `-- zk
>
> 1 directory, 5 files
> root@mesos-master1 ~ # tree /etc/mesos
> /etc/mesos
> `-- zk
>
> 0 directories, 1 file
> root@mesos-master1 ~ # tree /etc/mesos-slave/
> /etc/mesos-slave/
> |-- containerizers
> |-- docker_stop_timeout
> |-- executor_registration_timeout
> |-- executor_shutdown_grace_period
> |-- hostname
> |-- ip
> |-- logging_level
> `-- resources
>
> 0 directories, 8 files
> root@mesos-master1 ~ # tree /etc/mesos-master
> /etc/mesos-master
> |-- cluster
> |-- hostname
> |-- ip
> |-- logging_level
> |-- quorum
> `-- work_dir
>



-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Re: Marathon can no longer deploy any apps after a failover

Posted by Maciej Strzelecki <ma...@crealytics.com>.

Thanks for guidelines! Ill try these paths out, and join the marathon mailing-list (was oblivious there was one ;))


Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzelecki@crealytics.de
www.crealytics.com<http://www.crealytics.com>
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466

________________________________
From: Vinod Kone <vi...@gmail.com>
Sent: Thursday, July 16, 2015 7:09 PM
To: user@mesos.apache.org
Subject: Re: Marathon can no longer deploy any apps after a failover

Sounds like a marathon issue. Mind asking in marathon's mailing list?

On Thu, Jul 16, 2015 at 8:02 AM, Nikolay Borodachev <nb...@adobe.com>> wrote:
Maciej,

I had a similar problem but it got solved by setting LIBPROCESS_IP environment variable to the host IP address for the Marathon process.

Nikolay


From: Maciej Strzelecki [mailto:maciej.strzelecki@crealytics.com<ma...@crealytics.com>]
Sent: Thursday, July 16, 2015 7:30 AM
To: user@mesos.apache.org<ma...@mesos.apache.org>
Subject: Marathon can no longer deploy any apps after a failover


Problem:



If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at  'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs)

Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them.


When that happens i can:

stop marathon on all masters

remove the framework via a curl to mesos api /shutdown

purge /marathon from zookeper cli

restart docker services on all slaves (that kills the zombie containers)
restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.



How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there.









Configuration:



Running ubuntu 14.04.2. LTS

mesos                               0.22.1-1.0.ubuntu1404

marathon                            0.9.0-1.0.381.ubuntu1404

chronos                             2.3.4-1.0.81.ubuntu1404



The cluster  uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a  portion of resources as offerrings)



The configuration for mesos/marathon is very "default" dependant, options specified You can see below. The quorum is 2.



Marathon service is run on 3 master machines



root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
    |-- event_subscriber
    |-- framework_name
    |-- hostname
    |-- logging_level
    `-- zk

1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk

0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources

0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir

Re: Marathon can no longer deploy any apps after a failover

Posted by Vinod Kone <vi...@gmail.com>.

Sounds like a marathon issue. Mind asking in marathon's mailing list?

On Thu, Jul 16, 2015 at 8:02 AM, Nikolay Borodachev <nb...@adobe.com>
wrote:

>  Maciej,
>
>
>
> I had a similar problem but it got solved by setting LIBPROCESS_IP
> environment variable to the host IP address for the Marathon process.
>
>
>
> Nikolay
>
>
>
>
>
> *From:* Maciej Strzelecki [mailto:maciej.strzelecki@crealytics.com]
> *Sent:* Thursday, July 16, 2015 7:30 AM
> *To:* user@mesos.apache.org
> *Subject:* Marathon can no longer deploy any apps after a failover
>
>
>
> Problem:
>
>
>
>
> If i restart a current framework leader for marathon ( the host from
> active frameworks tab in mesos ui) , a new one is elected after a moment
> and any new deployments are stuck infinitely at  'deploying' state (empty
> black bar, 0/1 and hanging - with debug level i dont see any errors in
> marathon/mesos logs)
>
> Also the old tasks are untouchable at that time - yes, they keep running,
> but cant kill, restart nor scale them.
>
>
>
> When that happens i can:
>
> stop marathon on all masters
>
> remove the framework via a curl to mesos api /shutdown
>
> purge /marathon from zookeper cli
>
> restart docker services on all slaves (that kills the zombie containers)
>
> restart mesos-slave services on all slaves (pampering my paranoia here)
> then i can deploy apps again.
>
>
>
> How can i avoid this problem? Any basic settings im missing? This is
> scary, as the reboot of a single master (out of 3 or 5 servers) freezes
> everything that is deployed using marathon, and the steps to reclaim
> control introduce downtime to every single app sunning there.
>
>
>
>
>
>
>
>
>
> Configuration:
>
>
>
> Running ubuntu 14.04.2. LTS
>
> mesos                               0.22.1-1.0.ubuntu1404
>
> marathon                            0.9.0-1.0.381.ubuntu1404
>
> chronos                             2.3.4-1.0.81.ubuntu1404
>
>
>
> The cluster  uses 3 masters and a 15 slaves. Also the master machines are
> running mesos-slave process (albeit those machines give only a  portion of
> resources as offerrings)
>
>
>
> The configuration for mesos/marathon is very "default" dependant, options
> specified You can see below. The quorum is 2.
>
>
>
> Marathon service is run on 3 master machines
>
>
>
> root@mesos-master1 ~ # tree /etc/marathon/
> /etc/marathon/
> `-- conf
>     |-- event_subscriber
>     |-- framework_name
>     |-- hostname
>     |-- logging_level
>     `-- zk
>
> 1 directory, 5 files
> root@mesos-master1 ~ # tree /etc/mesos
> /etc/mesos
> `-- zk
>
> 0 directories, 1 file
> root@mesos-master1 ~ # tree /etc/mesos-slave/
> /etc/mesos-slave/
> |-- containerizers
> |-- docker_stop_timeout
> |-- executor_registration_timeout
> |-- executor_shutdown_grace_period
> |-- hostname
> |-- ip
> |-- logging_level
> `-- resources
>
> 0 directories, 8 files
> root@mesos-master1 ~ # tree /etc/mesos-master
> /etc/mesos-master
> |-- cluster
> |-- hostname
> |-- ip
> |-- logging_level
> |-- quorum
> `-- work_dir
>

RE: Marathon can no longer deploy any apps after a failover

Posted by Nikolay Borodachev <nb...@adobe.com>.

Maciej,

I had a similar problem but it got solved by setting LIBPROCESS_IP environment variable to the host IP address for the Marathon process.

Nikolay

From: Maciej Strzelecki [mailto:maciej.strzelecki@crealytics.com]
Sent: Thursday, July 16, 2015 7:30 AM
To: user@mesos.apache.org
Subject: Marathon can no longer deploy any apps after a failover

Problem:

If i restart a current framework leader for marathon ( the host from active frameworks tab in mesos ui) , a new one is elected after a moment and any new deployments are stuck infinitely at  'deploying' state (empty black bar, 0/1 and hanging - with debug level i dont see any errors in marathon/mesos logs)

Also the old tasks are untouchable at that time - yes, they keep running, but cant kill, restart nor scale them.

When that happens i can:

stop marathon on all masters

remove the framework via a curl to mesos api /shutdown

purge /marathon from zookeper cli

restart docker services on all slaves (that kills the zombie containers)
restart mesos-slave services on all slaves (pampering my paranoia here)
then i can deploy apps again.

How can i avoid this problem? Any basic settings im missing? This is scary, as the reboot of a single master (out of 3 or 5 servers) freezes everything that is deployed using marathon, and the steps to reclaim control introduce downtime to every single app sunning there.

Configuration:

Running ubuntu 14.04.2. LTS

mesos                               0.22.1-1.0.ubuntu1404

marathon                            0.9.0-1.0.381.ubuntu1404

chronos                             2.3.4-1.0.81.ubuntu1404

The cluster  uses 3 masters and a 15 slaves. Also the master machines are running mesos-slave process (albeit those machines give only a  portion of resources as offerrings)

The configuration for mesos/marathon is very "default" dependant, options specified You can see below. The quorum is 2.

Marathon service is run on 3 master machines

root@mesos-master1 ~ # tree /etc/marathon/
/etc/marathon/
`-- conf
    |-- event_subscriber
    |-- framework_name
    |-- hostname
    |-- logging_level
    `-- zk

1 directory, 5 files
root@mesos-master1 ~ # tree /etc/mesos
/etc/mesos
`-- zk

0 directories, 1 file
root@mesos-master1 ~ # tree /etc/mesos-slave/
/etc/mesos-slave/
|-- containerizers
|-- docker_stop_timeout
|-- executor_registration_timeout
|-- executor_shutdown_grace_period
|-- hostname
|-- ip
|-- logging_level
`-- resources

0 directories, 8 files
root@mesos-master1 ~ # tree /etc/mesos-master
/etc/mesos-master
|-- cluster
|-- hostname
|-- ip
|-- logging_level
|-- quorum
`-- work_dir