You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Maciej Strzelecki <ma...@crealytics.com> on 2015/07/07 16:42:58 UTC

Can marathon cancel a deployment if the application is "sick"?

How to make marathon cancel a deployment if the app is not starting after several tries?

I saw those three settings (with defaults) in the documentation

"backoffSeconds": 1,
"backoffFactor": 1.15,
"maxLaunchDelaySeconds": 3600,

backoffSeconds, backoffFactor and maxLaunchDelaySeconds

Configures exponential backoff behavior when launching potentially sick apps. This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos slaves. The backoff period is multiplied by the factor for each consecutive failure until it reaches maxLaunchDelaySeconds. This applies also to tasks that are killed due to failing too many health checks.



I would expect to be able to tell marathon to "give up" after it tried few times. Is there a way?


backoffseconds - 5

factor -  high, -  100-200ish (so it reaches max delay very quickly after just a few failures)

maxdelay - 600 ( to allow for a docker pull to finish and general startup lag)


Root cause - a developer deploys application with either code failure - skipped test - or a docker image cant be pulled. If this task is left on marathon-retry-deployment for some time, mesos-ui shows thousands of failed tasks. Id love to see one, maybe two failed starts attempts, then "back-off".





Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzelecki@crealytics.de
www.crealytics.com<http://www.crealytics.com>
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466

RE: Can marathon cancel a deployment if the application is "sick"?

Posted by David Kesler <DK...@yodle.com>.
I don't believe so.  We ran into a similar issue.  Investigation of marathon's github account revealed the following relevant tickets:

https://github.com/mesosphere/marathon/issues/1504
https://github.com/mesosphere/marathon/issues/1111
https://github.com/mesosphere/marathon/issues/1470


Basically, the issue is that as soon as the mesos task reaches the RUNNING state, marathon clears the exponential backoff, even if the task eventually fails.  Currently a ticket to fix it is slated for 0.10.0, but it's previously been slated for other releases and slipped.

(We actually set our deploy process up to create the new deployment and then periodically check on its status so that we can kill it if it times out so that we don't end up with perma-failing deployments in marathon.)


From: Maciej Strzelecki [mailto:maciej.strzelecki@crealytics.com]
Sent: Tuesday, July 07, 2015 10:43 AM
To: user@mesos.apache.org
Subject: Can marathon cancel a deployment if the application is "sick"?


How to make marathon cancel a deployment if the app is not starting after several tries?

I saw those three settings (with defaults) in the documentation

"backoffSeconds": 1,

"backoffFactor": 1.15,

"maxLaunchDelaySeconds": 3600,

backoffSeconds, backoffFactor and maxLaunchDelaySeconds

Configures exponential backoff behavior when launching potentially sick apps. This prevents sandboxes associated with consecutively failing tasks from filling up the hard disk on Mesos slaves. The backoff period is multiplied by the factor for each consecutive failure until it reaches maxLaunchDelaySeconds. This applies also to tasks that are killed due to failing too many health checks.





I would expect to be able to tell marathon to "give up" after it tried few times. Is there a way?



backoffseconds - 5

factor -  high, -  100-200ish (so it reaches max delay very quickly after just a few failures)

maxdelay - 600 ( to allow for a docker pull to finish and general startup lag)



Root cause - a developer deploys application with either code failure - skipped test - or a docker image cant be pulled. If this task is left on marathon-retry-deployment for some time, mesos-ui shows thousands of failed tasks. Id love to see one, maybe two failed starts attempts, then "back-off".









Maciej Strzelecki
Operations Engineer
Tel: +49 30 6098381-50
Fax: +49 851-213728-88
E-mail: mstrzelecki@crealytics.de<ma...@crealytics.de>
www.crealytics.com<http://www.crealytics.com>
blog.crealytics.com

crealytics GmbH - Semantic PPC Advertising Technology

Brunngasse 1 - 94032 Passau - Germany
Oranienstraße 185 - 10999 Berlin - Germany

Managing directors: Andreas Reiffen, Christof König, Dr. Markus Kurch
Register court: Amtsgericht Passau, HRB 7466
Geschäftsführer: Andreas Reiffen, Christof König, Daniel Trost
Reg.-Gericht: Amtsgericht Passau, HRB 7466