You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Brian Weber (JIRA)" <ji...@apache.org> on 2015/10/26 18:54:28 UTC
[jira] [Commented] (AURORA-279) Allow scheduler to decide how to respond to task health check failures

    [ https://issues.apache.org/jira/browse/AURORA-279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14974658#comment-14974658 ] 

Brian Weber commented on AURORA-279:
------------------------------------

It shouldn't be too much to ask for a guard rail to prevent a health check reaction from taking down an entire job. It also doesn't look like a huge addition to add an integer for max_concurrent_restarts or something like that (perhaps default to a batch size?) to permit customers who don't have central remediation frameworks to allow aurora to manage the failure rates.

e.g.: a job with 1000 instances can serve well enough with 10% (100 instances) down. Let's suppose a bug running wild and instances arbitrarily start responding unhealthy. If a restart temporarily fixes the bug until the next deploy, cool. If the bug hits enough instances that between the bug and the existing restarts that over 100 instances are down, then the configured health check would take down enough instances that the service would potentially stop serving well at all. 

Suppose instead, thermos queried aurora for permission to remediate, and aurora could then ratelimit remediations and send a notification to someone so they can respond more immediately. Aurora can then know that 10% of the fleet is down, and hold off while a human is notified". It would then be up to the notified party to decide whether to fix the bug right there.

- It may be 3am when nobody is awake, so the action may be to just restart the entire job.
- It may be a low traffic point, in which case one may decide to adjust the threshold.
- It may be a critical time because the entire site is on fire, and only one service is less important.
- It may be important enough that the decision is made to push a bugfix right then and there, which is not always an easy task.

The only action in thermos would be to query aurora for permission, which would be a boolean response. The only action in aurora would be to compare number of not-healthy instances to a ratelimit (e.g., if not_serving_instances > rate_limit: return False). This doesn't seem too complicated to build in and would give aurora a great bit of repair power.

> Allow scheduler to decide how to respond to task health check failures
> ----------------------------------------------------------------------
>
>                 Key: AURORA-279
>                 URL: https://issues.apache.org/jira/browse/AURORA-279
>             Project: Aurora
>          Issue Type: Story
>          Components: Executor, Scheduler
>            Reporter: Bill Farner
>            Priority: Minor
>
> The executor is currently autonomous in deciding to kill tasks that have failed health checks.  If health check failures synchronize across a service, the service could suffer an outage.  SLA considerations may also need to be me made before deciding to kill a task for health check failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)