You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@aurora.apache.org by Kaiwen Xu <ke...@kevxu.net> on 2017/09/19 00:13:24 UTC

Way to kill failed instances during a unsuccessful job update

Hi,

I am wondering if it's there is any way for Aurora to kill the failed
instances when a job update is not successful (e.g. apps on some
backends
fail to start up etc.)?

Since right now, we turned off the "rollback" feature during the job
update, because of one or two backends (out of tens to hundreds
backends)
failing is acceptable for us, we don't want completely rollback the
whole fleet due to that. However, it seems like with "rollback" off,
those failed backends will just be left there, and they will try to
restart infinitely.

Just curious what would be a recommended approach for this situation?
Should we try to identify those instances and stop them in our own
deployment scripts?

Thanks,
Kaiwen

Re: Way to kill failed instances during a unsuccessful job update

Posted by Bill Farner <wf...@apache.org>.

Aurora doesn't currently offer a way to do what you describe.

A job in the scheduler describes a provisioning goal (number of instances),
and we assume the scheduler shouldn't choose to modify that goal over
time.  To that end, the scheduler doesn't consider it a problem to
infinitely restart the failed instances; it is hopeful that the environment
will eventually self-heal.

On Mon, Sep 18, 2017 at 5:13 PM, Kaiwen Xu <ke...@kevxu.net> wrote:

> Hi,
>
> I am wondering if it's there is any way for Aurora to kill the failed
> instances when a job update is not successful (e.g. apps on some
> backends
> fail to start up etc.)?
>
> Since right now, we turned off the "rollback" feature during the job
> update, because of one or two backends (out of tens to hundreds
> backends)
> failing is acceptable for us, we don't want completely rollback the
> whole fleet due to that. However, it seems like with "rollback" off,
> those failed backends will just be left there, and they will try to
> restart infinitely.
>
> Just curious what would be a recommended approach for this situation?
> Should we try to identify those instances and stop them in our own
> deployment scripts?
>
> Thanks,
> Kaiwen
>