You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Tengfei Mu <te...@gmail.com> on 2018/06/18 19:47:17 UTC

Massive instance rescheduling outage upon traffic spike

Hi,

We have had a few incidents when service under unexpected traffic/load
spike then container starts to respond slow/fail health check, which caused
massive instance rescheduling in Aurora. This could be a bad cycle that
instances rescheduled (being started) causing more load on other instances,
then more and more instances hammered down. Any one can share some best
practice/lessons for preventing such outage caused by dynamic rescheduling
in production cluster?


Best,
Tengfei

Re: Massive instance rescheduling outage upon traffic spike

Posted by Stephan Erb <st...@blue-yonder.com>.

Hey Tengfei,

the Aurora health checks cannot differentiate a service instance which has deadlocked from one which is extremely slow. The decision to restart is then performed by the executor without central coordination by the scheduler. Your best course of action will therefore be to prevent the overload in the first place, for example via load shedding and graceful degradation. You can find further details in the Google SRE Book [1].

Specifically, you will want to do tight(er) health checking in your loadbalancers, so that instances drop out of rotation before they hit their capacity limit. In addition, I have had a good experience by also protecting instance with a limiting HAProxy/Nginx that runs as a side-car within Aurora tasks.

I hope this gets you started.

Best regards,
Stephan

[1] https://landing.google.com/sre/book/chapters/addressing-cascading-failures.html


On 18.06.18, 21:45, "Tengfei Mu" <te...@gmail.com> wrote:

    Hi,
    
    We have had a few incidents when service under unexpected traffic/load
    spike then container starts to respond slow/fail health check, which caused
    massive instance rescheduling in Aurora. This could be a bad cycle that
    instances rescheduled (being started) causing more load on other instances,
    then more and more instances hammered down. Any one can share some best
    practice/lessons for preventing such outage caused by dynamic rescheduling
    in production cluster?
    
    
    Best,
    Tengfei