You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@aurora.apache.org by Maxim Khutornenko <ma...@apache.org> on 2016/02/05 23:43:56 UTC

[PROPOSAL] Disallow instance removal in job update

This is mostly a survey rather than a proposal. How would people think
about limiting updater to only adding/updating instances and let
killTasks take care of instance removals?

We have all heard stories (or happen to create some ourselves) when an
outdated instance count value in .aurora config caused unexpected
instance removals. Granted, there are plenty of other values in the
config that can cause service-wide outage but instance count seems to
be the worst in that sense.

After the recent refactoring of addInstances and killTasks to act as
scaleOut/scaleIn APIs [1], the outdated instance count problem will
only get worse as automated scaling tools will quickly render existing
.aurora config value obsolete. With that in mind, should we block
instance removal in the updater and let an explicit killTasks call be
the only acceptable action to reduce instance count? Is there any
value (aside from arguable convenience factor) in having
startJobUpdate ever killing instances?

Thanks,
Maxim

[1] - http://markmail.org/message/2smaej5n5e54li3g

Re: [PROPOSAL] Disallow instance removal in job update

Posted by John Sirois <jo...@conductant.com>.
On Fri, Feb 5, 2016 at 4:31 PM, Bill Farner <wf...@apache.org> wrote:

> Or without any persistence at all.  The client could refuse to adjust the
> instance count on a job unless there's additional command line argument.
> The same arguments of responsibility could be said here of users of old
> clients or custom clients.
>

I guess that's true.  I concur.


> On Fri, Feb 5, 2016 at 3:17 PM, John Sirois <jo...@conductant.com> wrote:
>
> > On Fri, Feb 5, 2016 at 4:07 PM, Maxim Khutornenko <ma...@apache.org>
> > wrote:
> >
> > > We have had attempts to safeguard client updater command with a
> > > "dangerous change" warning before but it did not get good feedback.
> > > Besides, automated tools/scripts just ignored it.
> > >
> > > An alternative could be what George suggest on the scaling API thread
> > > mentioned earlier: automatically bump up instance count to the job
> > > active task count. I'd say this could be an implementation to the
> > > proposal above rather than a safeguard as it accomplishes the exact
> > > same goal.
> > >
> > > Bill, do you have any ideas of what that safeguard could be?
> > >
> >
> > I'd recommend that an API call that reduced instance count require an
> > `confirm_instance_reduction =true` parameter - this could be plumbed back
> > to a flag in the official Aurora client.
> > That said, since Aurora immediately forgets jobs and splits things into
> > tasks, I'm not sure this is even sanely possible today.
> >
> > Assuming it is possible, any human that turns that flag on by default
> with
> > a shell alias or an rc file can take responsibility for their own
> problem.
> > If a tool passes the boolean, again - that's the tool's problem.
> Hopefully
> > its a carefully developed and vetted auto-scaling tool.
> >
> >
> > > On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org>
> wrote:
> > > >>
> > > >> the outdated instance count problem will only get worse as automated
> > > >> scaling tools will quickly render existing .aurora config value
> > obsolete
> > > >
> > > >
> > > > This is not a compelling reason to remove functionality.  Sounds
> like a
> > > > safeguard is needed instead.
> > > >
> > > > On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org>
> > > wrote:
> > > >
> > > >> This is mostly a survey rather than a proposal. How would people
> think
> > > >> about limiting updater to only adding/updating instances and let
> > > >> killTasks take care of instance removals?
> > > >>
> > > >> We have all heard stories (or happen to create some ourselves) when
> an
> > > >> outdated instance count value in .aurora config caused unexpected
> > > >> instance removals. Granted, there are plenty of other values in the
> > > >> config that can cause service-wide outage but instance count seems
> to
> > > >> be the worst in that sense.
> > > >>
> > > >> After the recent refactoring of addInstances and killTasks to act as
> > > >> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> > > >> only get worse as automated scaling tools will quickly render
> existing
> > > >> .aurora config value obsolete. With that in mind, should we block
> > > >> instance removal in the updater and let an explicit killTasks call
> be
> > > >> the only acceptable action to reduce instance count? Is there any
> > > >> value (aside from arguable convenience factor) in having
> > > >> startJobUpdate ever killing instances?
> > > >>
> > > >> Thanks,
> > > >> Maxim
> > > >>
> > > >> [1] - http://markmail.org/message/2smaej5n5e54li3g
> > > >>
> > >
> >
> >
> >
> > --
> > John Sirois
> > 303-512-3301
> >
>



-- 
John Sirois
303-512-3301

Re: [PROPOSAL] Disallow instance removal in job update

Posted by Bill Farner <wf...@apache.org>.
Or without any persistence at all.  The client could refuse to adjust the
instance count on a job unless there's additional command line argument.
The same arguments of responsibility could be said here of users of old
clients or custom clients.

On Fri, Feb 5, 2016 at 3:17 PM, John Sirois <jo...@conductant.com> wrote:

> On Fri, Feb 5, 2016 at 4:07 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > We have had attempts to safeguard client updater command with a
> > "dangerous change" warning before but it did not get good feedback.
> > Besides, automated tools/scripts just ignored it.
> >
> > An alternative could be what George suggest on the scaling API thread
> > mentioned earlier: automatically bump up instance count to the job
> > active task count. I'd say this could be an implementation to the
> > proposal above rather than a safeguard as it accomplishes the exact
> > same goal.
> >
> > Bill, do you have any ideas of what that safeguard could be?
> >
>
> I'd recommend that an API call that reduced instance count require an
> `confirm_instance_reduction =true` parameter - this could be plumbed back
> to a flag in the official Aurora client.
> That said, since Aurora immediately forgets jobs and splits things into
> tasks, I'm not sure this is even sanely possible today.
>
> Assuming it is possible, any human that turns that flag on by default with
> a shell alias or an rc file can take responsibility for their own problem.
> If a tool passes the boolean, again - that's the tool's problem.  Hopefully
> its a carefully developed and vetted auto-scaling tool.
>
>
> > On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org> wrote:
> > >>
> > >> the outdated instance count problem will only get worse as automated
> > >> scaling tools will quickly render existing .aurora config value
> obsolete
> > >
> > >
> > > This is not a compelling reason to remove functionality.  Sounds like a
> > > safeguard is needed instead.
> > >
> > > On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org>
> > wrote:
> > >
> > >> This is mostly a survey rather than a proposal. How would people think
> > >> about limiting updater to only adding/updating instances and let
> > >> killTasks take care of instance removals?
> > >>
> > >> We have all heard stories (or happen to create some ourselves) when an
> > >> outdated instance count value in .aurora config caused unexpected
> > >> instance removals. Granted, there are plenty of other values in the
> > >> config that can cause service-wide outage but instance count seems to
> > >> be the worst in that sense.
> > >>
> > >> After the recent refactoring of addInstances and killTasks to act as
> > >> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> > >> only get worse as automated scaling tools will quickly render existing
> > >> .aurora config value obsolete. With that in mind, should we block
> > >> instance removal in the updater and let an explicit killTasks call be
> > >> the only acceptable action to reduce instance count? Is there any
> > >> value (aside from arguable convenience factor) in having
> > >> startJobUpdate ever killing instances?
> > >>
> > >> Thanks,
> > >> Maxim
> > >>
> > >> [1] - http://markmail.org/message/2smaej5n5e54li3g
> > >>
> >
>
>
>
> --
> John Sirois
> 303-512-3301
>

Re: [PROPOSAL] Disallow instance removal in job update

Posted by John Sirois <jo...@conductant.com>.
On Fri, Feb 5, 2016 at 4:07 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> We have had attempts to safeguard client updater command with a
> "dangerous change" warning before but it did not get good feedback.
> Besides, automated tools/scripts just ignored it.
>
> An alternative could be what George suggest on the scaling API thread
> mentioned earlier: automatically bump up instance count to the job
> active task count. I'd say this could be an implementation to the
> proposal above rather than a safeguard as it accomplishes the exact
> same goal.
>
> Bill, do you have any ideas of what that safeguard could be?
>

I'd recommend that an API call that reduced instance count require an
`confirm_instance_reduction =true` parameter - this could be plumbed back
to a flag in the official Aurora client.
That said, since Aurora immediately forgets jobs and splits things into
tasks, I'm not sure this is even sanely possible today.

Assuming it is possible, any human that turns that flag on by default with
a shell alias or an rc file can take responsibility for their own problem.
If a tool passes the boolean, again - that's the tool's problem.  Hopefully
its a carefully developed and vetted auto-scaling tool.


> On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org> wrote:
> >>
> >> the outdated instance count problem will only get worse as automated
> >> scaling tools will quickly render existing .aurora config value obsolete
> >
> >
> > This is not a compelling reason to remove functionality.  Sounds like a
> > safeguard is needed instead.
> >
> > On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >
> >> This is mostly a survey rather than a proposal. How would people think
> >> about limiting updater to only adding/updating instances and let
> >> killTasks take care of instance removals?
> >>
> >> We have all heard stories (or happen to create some ourselves) when an
> >> outdated instance count value in .aurora config caused unexpected
> >> instance removals. Granted, there are plenty of other values in the
> >> config that can cause service-wide outage but instance count seems to
> >> be the worst in that sense.
> >>
> >> After the recent refactoring of addInstances and killTasks to act as
> >> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> >> only get worse as automated scaling tools will quickly render existing
> >> .aurora config value obsolete. With that in mind, should we block
> >> instance removal in the updater and let an explicit killTasks call be
> >> the only acceptable action to reduce instance count? Is there any
> >> value (aside from arguable convenience factor) in having
> >> startJobUpdate ever killing instances?
> >>
> >> Thanks,
> >> Maxim
> >>
> >> [1] - http://markmail.org/message/2smaej5n5e54li3g
> >>
>



-- 
John Sirois
303-512-3301

Re: [PROPOSAL] Disallow instance removal in job update

Posted by Tony Dong <td...@twitter.com.INVALID>.
Definitely +1 on the idea of a safeguard.
I didn't really have any proposals outside of the ones that have been
mentioned in this thread already.

W.R.T. automatic scaling up the instances via a binding helper (George
talked about it in the scaling API discussion).
Essentially we had a binding helper which talks to the scheduler and
figures out how many active task instances. Similar to what Stephan had
brought up.
However, we've still had incidents where an engineer has forgotten to or
accidentally did not use the binding helper.
I think to avoid operator error, I'd like whatever safeguard be
automatically activated, rather than requiring an explicit flag.


On Mon, Feb 8, 2016 at 9:42 AM, Maxim Khutornenko <ma...@apache.org> wrote:

> > Or without any persistence at all.  The client could refuse to adjust the
> > instance count on a job unless there's additional command line argument.
> > The same arguments of responsibility could be said here of users of old
> > clients or custom clients.
>
> Bill, are you suggesting 'aurora update start' client command call a
> scheduler to acquire an update diff first and block startJobUpdate RPC
> call unless a special command line flag is present?
>
> > When updating a job, the scheduler would fill in the current instance
> count.
> > However, when I want to change the number of instances, I could simply
> > bind another value locally when triggering the update.
>
> Stephan, this sounds like increasing instances would also require a
> binding helper, which makes an update process less deterministic (i.e.
> .aurora config file is no longer self-contained).
>
> On Sun, Feb 7, 2016 at 3:02 PM, Erb, Stephan
> <St...@blue-yonder.com> wrote:
> > A related idea that recently crossed my mind was some kind of pystachio
> variable / binding helper:  {{aurora.instances}}.
> >
> > When updating a job, the scheduler would fill in the current instance
> count. However, when I want to change the number of instances, I could
> simply bind another value locally when triggering the update.
> > ________________________________________
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Saturday, February 6, 2016 00:07
> > To: dev@aurora.apache.org
> > Subject: Re: [PROPOSAL] Disallow instance removal in job update
> >
> > We have had attempts to safeguard client updater command with a
> > "dangerous change" warning before but it did not get good feedback.
> > Besides, automated tools/scripts just ignored it.
> >
> > An alternative could be what George suggest on the scaling API thread
> > mentioned earlier: automatically bump up instance count to the job
> > active task count. I'd say this could be an implementation to the
> > proposal above rather than a safeguard as it accomplishes the exact
> > same goal.
> >
> > Bill, do you have any ideas of what that safeguard could be?
> >
> > On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org> wrote:
> >>>
> >>> the outdated instance count problem will only get worse as automated
> >>> scaling tools will quickly render existing .aurora config value
> obsolete
> >>
> >>
> >> This is not a compelling reason to remove functionality.  Sounds like a
> >> safeguard is needed instead.
> >>
> >> On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >>
> >>> This is mostly a survey rather than a proposal. How would people think
> >>> about limiting updater to only adding/updating instances and let
> >>> killTasks take care of instance removals?
> >>>
> >>> We have all heard stories (or happen to create some ourselves) when an
> >>> outdated instance count value in .aurora config caused unexpected
> >>> instance removals. Granted, there are plenty of other values in the
> >>> config that can cause service-wide outage but instance count seems to
> >>> be the worst in that sense.
> >>>
> >>> After the recent refactoring of addInstances and killTasks to act as
> >>> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> >>> only get worse as automated scaling tools will quickly render existing
> >>> .aurora config value obsolete. With that in mind, should we block
> >>> instance removal in the updater and let an explicit killTasks call be
> >>> the only acceptable action to reduce instance count? Is there any
> >>> value (aside from arguable convenience factor) in having
> >>> startJobUpdate ever killing instances?
> >>>
> >>> Thanks,
> >>> Maxim
> >>>
> >>> [1] - http://markmail.org/message/2smaej5n5e54li3g
> >>>
>

Re: [PROPOSAL] Disallow instance removal in job update

Posted by Maxim Khutornenko <ma...@apache.org>.
> Or without any persistence at all.  The client could refuse to adjust the
> instance count on a job unless there's additional command line argument.
> The same arguments of responsibility could be said here of users of old
> clients or custom clients.

Bill, are you suggesting 'aurora update start' client command call a
scheduler to acquire an update diff first and block startJobUpdate RPC
call unless a special command line flag is present?

> When updating a job, the scheduler would fill in the current instance count.
> However, when I want to change the number of instances, I could simply
> bind another value locally when triggering the update.

Stephan, this sounds like increasing instances would also require a
binding helper, which makes an update process less deterministic (i.e.
.aurora config file is no longer self-contained).

On Sun, Feb 7, 2016 at 3:02 PM, Erb, Stephan
<St...@blue-yonder.com> wrote:
> A related idea that recently crossed my mind was some kind of pystachio variable / binding helper:  {{aurora.instances}}.
>
> When updating a job, the scheduler would fill in the current instance count. However, when I want to change the number of instances, I could simply bind another value locally when triggering the update.
> ________________________________________
> From: Maxim Khutornenko <ma...@apache.org>
> Sent: Saturday, February 6, 2016 00:07
> To: dev@aurora.apache.org
> Subject: Re: [PROPOSAL] Disallow instance removal in job update
>
> We have had attempts to safeguard client updater command with a
> "dangerous change" warning before but it did not get good feedback.
> Besides, automated tools/scripts just ignored it.
>
> An alternative could be what George suggest on the scaling API thread
> mentioned earlier: automatically bump up instance count to the job
> active task count. I'd say this could be an implementation to the
> proposal above rather than a safeguard as it accomplishes the exact
> same goal.
>
> Bill, do you have any ideas of what that safeguard could be?
>
> On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org> wrote:
>>>
>>> the outdated instance count problem will only get worse as automated
>>> scaling tools will quickly render existing .aurora config value obsolete
>>
>>
>> This is not a compelling reason to remove functionality.  Sounds like a
>> safeguard is needed instead.
>>
>> On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>>
>>> This is mostly a survey rather than a proposal. How would people think
>>> about limiting updater to only adding/updating instances and let
>>> killTasks take care of instance removals?
>>>
>>> We have all heard stories (or happen to create some ourselves) when an
>>> outdated instance count value in .aurora config caused unexpected
>>> instance removals. Granted, there are plenty of other values in the
>>> config that can cause service-wide outage but instance count seems to
>>> be the worst in that sense.
>>>
>>> After the recent refactoring of addInstances and killTasks to act as
>>> scaleOut/scaleIn APIs [1], the outdated instance count problem will
>>> only get worse as automated scaling tools will quickly render existing
>>> .aurora config value obsolete. With that in mind, should we block
>>> instance removal in the updater and let an explicit killTasks call be
>>> the only acceptable action to reduce instance count? Is there any
>>> value (aside from arguable convenience factor) in having
>>> startJobUpdate ever killing instances?
>>>
>>> Thanks,
>>> Maxim
>>>
>>> [1] - http://markmail.org/message/2smaej5n5e54li3g
>>>

Re: [PROPOSAL] Disallow instance removal in job update

Posted by "Erb, Stephan" <St...@blue-yonder.com>.
A related idea that recently crossed my mind was some kind of pystachio variable / binding helper:  {{aurora.instances}}.

When updating a job, the scheduler would fill in the current instance count. However, when I want to change the number of instances, I could simply bind another value locally when triggering the update.
________________________________________
From: Maxim Khutornenko <ma...@apache.org>
Sent: Saturday, February 6, 2016 00:07
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Disallow instance removal in job update

We have had attempts to safeguard client updater command with a
"dangerous change" warning before but it did not get good feedback.
Besides, automated tools/scripts just ignored it.

An alternative could be what George suggest on the scaling API thread
mentioned earlier: automatically bump up instance count to the job
active task count. I'd say this could be an implementation to the
proposal above rather than a safeguard as it accomplishes the exact
same goal.

Bill, do you have any ideas of what that safeguard could be?

On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org> wrote:
>>
>> the outdated instance count problem will only get worse as automated
>> scaling tools will quickly render existing .aurora config value obsolete
>
>
> This is not a compelling reason to remove functionality.  Sounds like a
> safeguard is needed instead.
>
> On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> This is mostly a survey rather than a proposal. How would people think
>> about limiting updater to only adding/updating instances and let
>> killTasks take care of instance removals?
>>
>> We have all heard stories (or happen to create some ourselves) when an
>> outdated instance count value in .aurora config caused unexpected
>> instance removals. Granted, there are plenty of other values in the
>> config that can cause service-wide outage but instance count seems to
>> be the worst in that sense.
>>
>> After the recent refactoring of addInstances and killTasks to act as
>> scaleOut/scaleIn APIs [1], the outdated instance count problem will
>> only get worse as automated scaling tools will quickly render existing
>> .aurora config value obsolete. With that in mind, should we block
>> instance removal in the updater and let an explicit killTasks call be
>> the only acceptable action to reduce instance count? Is there any
>> value (aside from arguable convenience factor) in having
>> startJobUpdate ever killing instances?
>>
>> Thanks,
>> Maxim
>>
>> [1] - http://markmail.org/message/2smaej5n5e54li3g
>>

Re: [PROPOSAL] Disallow instance removal in job update

Posted by Maxim Khutornenko <ma...@apache.org>.
We have had attempts to safeguard client updater command with a
"dangerous change" warning before but it did not get good feedback.
Besides, automated tools/scripts just ignored it.

An alternative could be what George suggest on the scaling API thread
mentioned earlier: automatically bump up instance count to the job
active task count. I'd say this could be an implementation to the
proposal above rather than a safeguard as it accomplishes the exact
same goal.

Bill, do you have any ideas of what that safeguard could be?

On Fri, Feb 5, 2016 at 2:56 PM, Bill Farner <wf...@apache.org> wrote:
>>
>> the outdated instance count problem will only get worse as automated
>> scaling tools will quickly render existing .aurora config value obsolete
>
>
> This is not a compelling reason to remove functionality.  Sounds like a
> safeguard is needed instead.
>
> On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> This is mostly a survey rather than a proposal. How would people think
>> about limiting updater to only adding/updating instances and let
>> killTasks take care of instance removals?
>>
>> We have all heard stories (or happen to create some ourselves) when an
>> outdated instance count value in .aurora config caused unexpected
>> instance removals. Granted, there are plenty of other values in the
>> config that can cause service-wide outage but instance count seems to
>> be the worst in that sense.
>>
>> After the recent refactoring of addInstances and killTasks to act as
>> scaleOut/scaleIn APIs [1], the outdated instance count problem will
>> only get worse as automated scaling tools will quickly render existing
>> .aurora config value obsolete. With that in mind, should we block
>> instance removal in the updater and let an explicit killTasks call be
>> the only acceptable action to reduce instance count? Is there any
>> value (aside from arguable convenience factor) in having
>> startJobUpdate ever killing instances?
>>
>> Thanks,
>> Maxim
>>
>> [1] - http://markmail.org/message/2smaej5n5e54li3g
>>

Re: [PROPOSAL] Disallow instance removal in job update

Posted by Bill Farner <wf...@apache.org>.
>
> the outdated instance count problem will only get worse as automated
> scaling tools will quickly render existing .aurora config value obsolete


This is not a compelling reason to remove functionality.  Sounds like a
safeguard is needed instead.

On Fri, Feb 5, 2016 at 2:43 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> This is mostly a survey rather than a proposal. How would people think
> about limiting updater to only adding/updating instances and let
> killTasks take care of instance removals?
>
> We have all heard stories (or happen to create some ourselves) when an
> outdated instance count value in .aurora config caused unexpected
> instance removals. Granted, there are plenty of other values in the
> config that can cause service-wide outage but instance count seems to
> be the worst in that sense.
>
> After the recent refactoring of addInstances and killTasks to act as
> scaleOut/scaleIn APIs [1], the outdated instance count problem will
> only get worse as automated scaling tools will quickly render existing
> .aurora config value obsolete. With that in mind, should we block
> instance removal in the updater and let an explicit killTasks call be
> the only acceptable action to reduce instance count? Is there any
> value (aside from arguable convenience factor) in having
> startJobUpdate ever killing instances?
>
> Thanks,
> Maxim
>
> [1] - http://markmail.org/message/2smaej5n5e54li3g
>