You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Maxim Khutornenko <ma...@apache.org> on 2016/01/14 20:40:16 UTC

[PROPOSAL] Job instance scaling APIs

Our rolling update APIs can be quite inconvenient to work with when it
comes to instance scaling [1]. It's especially frustrating when
adding/removing instances has to be done in an automated fashion (e.g.: by
an external autoscaling process) as it requires holding on to the original
aurora config at all times.

I propose we add simple instance scaling APIs to address the above. Since
Aurora job may have instances at different configs at any moment, I propose
we accept an InstanceKey as a reference point when scaling out. For example:

    /** Scales out a given job by adding more instances with the task
config of the templateKey. */
    Response scaleOut(1: InstanceKey templateKey, 2: i32 incrementCount)

    /** Scales in a given job by removing existing instances. */
    Response scaleIn(1: JobKey job, 2: i32 decrementCount)

A correspondent client command could then look like:

    aurora job scale-out devcluster/vagrant/test/hello/1 10

For the above command, a scheduler would take task config of instance 1 of
the 'hello' job and replicate it 10 more times thus adding 10 additional
instances to the job.

There are, of course, some details to work out like making sure no active
update is in flight, scale out does not violate quota and etc. I intend to
address those during the implementation as things progress.

Does the above make sense? Any concerns/suggestions?

Thanks,
Maxim

[1] - https://issues.apache.org/jira/browse/AURORA-1258

Re: [PROPOSAL] Job instance scaling APIs

Posted by Bill Farner <wf...@apache.org>.

How is scaling down different from killing instances?

On Thursday, January 14, 2016, Joshua Cohen <jc...@apache.org> wrote:

> What happens if a job has been scaled out, but the underlying config is not
> updated to take that scaling into account? Would the next update on that
> job revert the number of instances (presumably, because what else could we
> do)? Is there anything we can do, tooling-wise, to improve upon this?
>
> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <maxim@apache.org
> <javascript:;>> wrote:
>
> > Our rolling update APIs can be quite inconvenient to work with when it
> > comes to instance scaling [1]. It's especially frustrating when
> > adding/removing instances has to be done in an automated fashion (e.g.:
> by
> > an external autoscaling process) as it requires holding on to the
> original
> > aurora config at all times.
> >
> > I propose we add simple instance scaling APIs to address the above. Since
> > Aurora job may have instances at different configs at any moment, I
> propose
> > we accept an InstanceKey as a reference point when scaling out. For
> > example:
> >
> >     /** Scales out a given job by adding more instances with the task
> > config of the templateKey. */
> >     Response scaleOut(1: InstanceKey templateKey, 2: i32 incrementCount)
> >
> >     /** Scales in a given job by removing existing instances. */
> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >
> > A correspondent client command could then look like:
> >
> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >
> > For the above command, a scheduler would take task config of instance 1
> of
> > the 'hello' job and replicate it 10 more times thus adding 10 additional
> > instances to the job.
> >
> > There are, of course, some details to work out like making sure no active
> > update is in flight, scale out does not violate quota and etc. I intend
> to
> > address those during the implementation as things progress.
> >
> > Does the above make sense? Any concerns/suggestions?
> >
> > Thanks,
> > Maxim
> >
> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

Refactoring killTasks to get rid of TaskQuery sounds good to me. Filed
AURORA-1583.

On Thu, Jan 14, 2016 at 12:33 PM, Bill Farner <wf...@apache.org> wrote:
> Fwiw I think an APi call being "too powerful" is not a good reason to
> introduce another.
>
> As for the specific behavior here, I know there has been a desire to
> eliminate the use of TaskQuery in killTasks because it is rarely used, and
> kills spanning jobs can be implemented on top of the APi.
>
> On Thursday, January 14, 2016, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> "How is scaling down different from killing instances?"
>>
>> I found 'killTasks' syntax too different and way much more powerful to
>> be used for scaling in. The TaskQuery allows killing instances across
>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>> Additional benefit: it can be ACLed independently by allowing external
>> process kill tasks only within a given job. We may also add rate
>> limiting or backoff to it later.
>>
>> As for Joshua's question, I feel it should be an operator's
>> responsibility to diff a job with its aurora config before applying an
>> update. That said, if there is enough demand we can definitely
>> consider adding something similar to what George suggested or
>> resurrecting a 'large change' warning message we used to have in
>> client updater.
>>
>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <george@tellapart.com
>> <javascript:;>> wrote:
>> > As a point of reference, we solved this problem by adding a binding
>> helper
>> > that queries the scheduler for the current number of instances and uses
>> > that number instead of a hardcoded config:
>> >
>> >    instances='{{scaling_instances[60]}}'
>> >
>> > In this example, instances will be set to the currently running number
>> > (unless there are none, in which case 60 instances will be created).
>> >
>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jcohen@apache.org
>> <javascript:;>> wrote:
>> >
>> >> What happens if a job has been scaled out, but the underlying config is
>> not
>> >> updated to take that scaling into account? Would the next update on that
>> >> job revert the number of instances (presumably, because what else could
>> we
>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>> >>
>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <maxim@apache.org
>> <javascript:;>>
>> >> wrote:
>> >>
>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>> >> > comes to instance scaling [1]. It's especially frustrating when
>> >> > adding/removing instances has to be done in an automated fashion
>> (e.g.:
>> >> by
>> >> > an external autoscaling process) as it requires holding on to the
>> >> original
>> >> > aurora config at all times.
>> >> >
>> >> > I propose we add simple instance scaling APIs to address the above.
>> Since
>> >> > Aurora job may have instances at different configs at any moment, I
>> >> propose
>> >> > we accept an InstanceKey as a reference point when scaling out. For
>> >> > example:
>> >> >
>> >> >     /** Scales out a given job by adding more instances with the task
>> >> > config of the templateKey. */
>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> incrementCount)
>> >> >
>> >> >     /** Scales in a given job by removing existing instances. */
>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> >> >
>> >> > A correspondent client command could then look like:
>> >> >
>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> >> >
>> >> > For the above command, a scheduler would take task config of instance
>> 1
>> >> of
>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>> additional
>> >> > instances to the job.
>> >> >
>> >> > There are, of course, some details to work out like making sure no
>> active
>> >> > update is in flight, scale out does not violate quota and etc. I
>> intend
>> >> to
>> >> > address those during the implementation as things progress.
>> >> >
>> >> > Does the above make sense? Any concerns/suggestions?
>> >> >
>> >> > Thanks,
>> >> > Maxim
>> >> >
>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >> >
>> >>
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Bill Farner <wf...@apache.org>.

Fwiw I think an APi call being "too powerful" is not a good reason to
introduce another.

As for the specific behavior here, I know there has been a desire to
eliminate the use of TaskQuery in killTasks because it is rarely used, and
kills spanning jobs can be implemented on top of the APi.

On Thursday, January 14, 2016, Maxim Khutornenko <ma...@apache.org> wrote:

> "How is scaling down different from killing instances?"
>
> I found 'killTasks' syntax too different and way much more powerful to
> be used for scaling in. The TaskQuery allows killing instances across
> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> Additional benefit: it can be ACLed independently by allowing external
> process kill tasks only within a given job. We may also add rate
> limiting or backoff to it later.
>
> As for Joshua's question, I feel it should be an operator's
> responsibility to diff a job with its aurora config before applying an
> update. That said, if there is enough demand we can definitely
> consider adding something similar to what George suggested or
> resurrecting a 'large change' warning message we used to have in
> client updater.
>
> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <george@tellapart.com
> <javascript:;>> wrote:
> > As a point of reference, we solved this problem by adding a binding
> helper
> > that queries the scheduler for the current number of instances and uses
> > that number instead of a hardcoded config:
> >
> >    instances='{{scaling_instances[60]}}'
> >
> > In this example, instances will be set to the currently running number
> > (unless there are none, in which case 60 instances will be created).
> >
> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jcohen@apache.org
> <javascript:;>> wrote:
> >
> >> What happens if a job has been scaled out, but the underlying config is
> not
> >> updated to take that scaling into account? Would the next update on that
> >> job revert the number of instances (presumably, because what else could
> we
> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
> >>
> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <maxim@apache.org
> <javascript:;>>
> >> wrote:
> >>
> >> > Our rolling update APIs can be quite inconvenient to work with when it
> >> > comes to instance scaling [1]. It's especially frustrating when
> >> > adding/removing instances has to be done in an automated fashion
> (e.g.:
> >> by
> >> > an external autoscaling process) as it requires holding on to the
> >> original
> >> > aurora config at all times.
> >> >
> >> > I propose we add simple instance scaling APIs to address the above.
> Since
> >> > Aurora job may have instances at different configs at any moment, I
> >> propose
> >> > we accept an InstanceKey as a reference point when scaling out. For
> >> > example:
> >> >
> >> >     /** Scales out a given job by adding more instances with the task
> >> > config of the templateKey. */
> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> incrementCount)
> >> >
> >> >     /** Scales in a given job by removing existing instances. */
> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >> >
> >> > A correspondent client command could then look like:
> >> >
> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >> >
> >> > For the above command, a scheduler would take task config of instance
> 1
> >> of
> >> > the 'hello' job and replicate it 10 more times thus adding 10
> additional
> >> > instances to the job.
> >> >
> >> > There are, of course, some details to work out like making sure no
> active
> >> > update is in flight, scale out does not violate quota and etc. I
> intend
> >> to
> >> > address those during the implementation as things progress.
> >> >
> >> > Does the above make sense? Any concerns/suggestions?
> >> >
> >> > Thanks,
> >> > Maxim
> >> >
> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >> >
> >>
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Tony Dong <td...@twitter.com.INVALID>.

The API change is basically what I had imagined. +1

I really imagined a scaling down command as a wrapper to job kill.
Although another approach would be to augment the current job kill command
to kill the last N instance of a job.

i.e. aurora job kill devcluster/vagrant/test/hello 10

On Fri, Jan 15, 2016 at 10:06 AM, Maxim Khutornenko <ma...@apache.org>
wrote:

> I wasn't planning on using the rolling updater functionality given the
> simplicity of the operation. I'd second Steve's earlier concerns about
> scaleOut() looking more like startJobUpdate() if we keep adding
> features. If health watching, throttling (batch_size) or rollback on
> failure is required then I believe the startJobUpdate() should be used
> instead of scaleOut().
>
> On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> <St...@blue-yonder.com> wrote:
> > I really like the proposal. The gain in simplicity on the client-side by
> not having to provide an aurora config is quite significant.
> >
> > The implementation on the scheduler side is probably rather straight
> forward as the update can be reused. That would also provide us with the
> update UI, which has shown to be quite useful when tracing autoscaler
> events.
> >
> > Regards,
> > Stephan
> > ________________________________________
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Thursday, January 14, 2016 9:50 PM
> > To: dev@aurora.apache.org
> > Subject: Re: [PROPOSAL] Job instance scaling APIs
> >
> > "I'd be concerned that any
> > scaling API to be powerful enough to fit all (most) use cases would just
> > end up looking like the update API."
> >
> > There is a big difference between scaleOut and startJobUpdate APIs
> > that justifies the inclusion of the former. Namely, scaleOut may only
> > replicate the existing instances without changing/introducing any new
> > scheduling requirements or performing instance rollout/rollback. I
> > don't see scaleOut ever becoming more powerful to threaten
> > startJobUpdate. At the same time, the absence of aurora config
> > requirement is a huge boost to autoscaling client simplification.
> >
> > "For example, when scaling down we don't just kill the last N instances,
> we
> > actually look at the least loaded hosts (globally) and kill tasks from
> > those."
> >
> > I don't quite see why the same wouldn't be possible with a scaleIn
> > API. Isn't it always external process responsibility to pay due
> > diligence before killing instances?
> >
> >
> > On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org>
> wrote:
> >> As some background, we handle scale up / down purely from the client
> side,
> >> using the update API for both directions.  I'd be concerned that any
> >> scaling API to be powerful enough to fit all (most) use cases would just
> >> end up looking like the update API.
> >>
> >> For example, when scaling down we don't just kill the last N instances,
> we
> >> actually look at the least loaded hosts (globally) and kill tasks from
> >> those.
> >>
> >>
> >> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >>
> >>> "How is scaling down different from killing instances?"
> >>>
> >>> I found 'killTasks' syntax too different and way much more powerful to
> >>> be used for scaling in. The TaskQuery allows killing instances across
> >>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> >>> Additional benefit: it can be ACLed independently by allowing external
> >>> process kill tasks only within a given job. We may also add rate
> >>> limiting or backoff to it later.
> >>>
> >>> As for Joshua's question, I feel it should be an operator's
> >>> responsibility to diff a job with its aurora config before applying an
> >>> update. That said, if there is enough demand we can definitely
> >>> consider adding something similar to what George suggested or
> >>> resurrecting a 'large change' warning message we used to have in
> >>> client updater.
> >>>
> >>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
> >>> wrote:
> >>> > As a point of reference, we solved this problem by adding a binding
> >>> helper
> >>> > that queries the scheduler for the current number of instances and
> uses
> >>> > that number instead of a hardcoded config:
> >>> >
> >>> >    instances='{{scaling_instances[60]}}'
> >>> >
> >>> > In this example, instances will be set to the currently running
> number
> >>> > (unless there are none, in which case 60 instances will be created).
> >>> >
> >>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org>
> wrote:
> >>> >
> >>> >> What happens if a job has been scaled out, but the underlying
> config is
> >>> not
> >>> >> updated to take that scaling into account? Would the next update on
> that
> >>> >> job revert the number of instances (presumably, because what else
> could
> >>> we
> >>> >> do)? Is there anything we can do, tooling-wise, to improve upon
> this?
> >>> >>
> >>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> maxim@apache.org>
> >>> >> wrote:
> >>> >>
> >>> >> > Our rolling update APIs can be quite inconvenient to work with
> when it
> >>> >> > comes to instance scaling [1]. It's especially frustrating when
> >>> >> > adding/removing instances has to be done in an automated fashion
> >>> (e.g.:
> >>> >> by
> >>> >> > an external autoscaling process) as it requires holding on to the
> >>> >> original
> >>> >> > aurora config at all times.
> >>> >> >
> >>> >> > I propose we add simple instance scaling APIs to address the
> above.
> >>> Since
> >>> >> > Aurora job may have instances at different configs at any moment,
> I
> >>> >> propose
> >>> >> > we accept an InstanceKey as a reference point when scaling out.
> For
> >>> >> > example:
> >>> >> >
> >>> >> >     /** Scales out a given job by adding more instances with the
> task
> >>> >> > config of the templateKey. */
> >>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> >>> incrementCount)
> >>> >> >
> >>> >> >     /** Scales in a given job by removing existing instances. */
> >>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >>> >> >
> >>> >> > A correspondent client command could then look like:
> >>> >> >
> >>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >>> >> >
> >>> >> > For the above command, a scheduler would take task config of
> instance
> >>> 1
> >>> >> of
> >>> >> > the 'hello' job and replicate it 10 more times thus adding 10
> >>> additional
> >>> >> > instances to the job.
> >>> >> >
> >>> >> > There are, of course, some details to work out like making sure no
> >>> active
> >>> >> > update is in flight, scale out does not violate quota and etc. I
> >>> intend
> >>> >> to
> >>> >> > address those during the implementation as things progress.
> >>> >> >
> >>> >> > Does the above make sense? Any concerns/suggestions?
> >>> >> >
> >>> >> > Thanks,
> >>> >> > Maxim
> >>> >> >
> >>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >>> >> >
> >>> >>
> >>>
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

Hmm, I have no idea where I got that old version from :) Sounds good to me.

On Tue, Jan 19, 2016 at 5:02 PM, Bill Farner <wf...@apache.org> wrote:
> Nit: your examples above include SessionKey, which no longer exists.
>
> Otherwise LGTM.
>
> On Tue, Jan 19, 2016 at 4:58 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> I think the behavior is spot on.
>>
>> So the plan is to move the current:
>>
>> Response addInstances(
>>     1: AddInstancesConfig config,
>>     2: Lock lock,
>>     3: SessionKey session)
>>
>> to:
>>
>> Response addInstances(
>>     1: AddInstancesConfig config,
>>     2: Lock lock,
>>     3: SessionKey session,
>>     4: InstanceKey instanceKey,
>>     5: int32 instanceCount)
>>
>> In the current release we will fork the behavior depending on whether
>> the InstanceKey is present. In the next release, we drop 1,2 and 3 and
>> end up with only 4 and 5.
>>
>> Any concerns?
>>
>> On Tue, Jan 19, 2016 at 3:57 PM, Bill Farner <wf...@apache.org> wrote:
>> > Unless the behavior differs significantly, repurposing sounds like the
>> > right thing to do.  We don't *need* a separate RPC to accomplish this, we
>> > can just add arguments or struct fields to accomplish dual use.
>> >
>> > On Tue, Jan 19, 2016 at 3:53 PM, Maxim Khutornenko <ma...@apache.org>
>> wrote:
>> >
>> >> That thought crossed my mind but due to addInstances() being reserved
>> >> for an existing RPC I filed AURORA-1581 to handle its graceful
>> >> deprecation first.
>> >>
>> >> Bill, are you suggesting re-purposing the existing 'addInstances()'
>> >> into a 'scaleOut()' equivalent? I am mostly +1 on the change but if we
>> >> do it within the same release we will not be following our deprecation
>> >> guidelines. Perhaps we can name it something like 'addTaskInstances()'
>> >> instead and let addInstances() go away naturally?
>> >>
>> >> On Tue, Jan 19, 2016 at 3:06 PM, Tony Dong <td...@twitter.com.invalid>
>> >> wrote:
>> >> > +1 to addInstances
>> >> >
>> >> > On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wf...@apache.org>
>> wrote:
>> >> >
>> >> >> At risk of devolving the discussion, is it worth calling the method
>> >> >> addInstances as opposed to scaleOut?  I find the former more
>> >> descriptive.
>> >> >>
>> >> >> On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <
>> maxim@apache.org>
>> >> >> wrote:
>> >> >>
>> >> >> > "Of course, the scaler could manually health check that all
>> instances
>> >> >> > have come up and are being used as expected, but I guess that is
>> what
>> >> >> > Aurora is for."
>> >> >> >
>> >> >> > I'd argue the updater "watch_secs" health checking isn't enough to
>> >> >> > ensure graceful rollout as instances may start flapping right after
>> >> >> > the updater signs off. Instances outside of update window may also
>> >> >> > flap (e.g. due to backend pressure) and updater will not be able to
>> >> >> > catch that. That's why a robust autoscaler has to rely on external
>> >> >> > monitoring tools and overall job health instead.
>> >> >> >
>> >> >> > A very basic approach, as you mentioned above, could be querying
>> job
>> >> >> > status repeatedly and count the ratio of tasks in RUNNING vs active
>> >> >> > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to
>> make
>> >> >> > a scaleOut decision. The more reliable approach though would also
>> rely
>> >> >> > on external monitoring stats exposed by user processes. That would
>> be
>> >> >> > a much higher fidelity signal than a decision based on task status
>> >> >> > alone. Scheduler does not (and should not for scalability reasons)
>> >> >> > have visibility into those stats, so the autoscaler would be in a
>> much
>> >> >> > better position to make an executive decision there.
>> >> >> >
>> >> >> > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
>> >> >> > <St...@blue-yonder.com> wrote:
>> >> >> > > I believe the operation is not that simple when you look at the
>> >> >> > end-to-end scenario.
>> >> >> > >
>> >> >> > > For example, the implementation of an auto-scaler  using the new
>> >> >> > scaleOut() API could look like:
>> >> >> > >
>> >> >> > > 1) check some KPI
>> >> >> > > 2) Infer an action based on this KPI such as scaleUp() or
>> >> scaleDown()
>> >> >> > > 3) wait until the effects of the adjusted instance count is
>> >> reflected
>> >> >> in
>> >> >> > the KPI. Go to  1 and repeat.
>> >> >> > >
>> >> >> > > The health checking capabilities of the existing updater (in
>> >> particular
>> >> >> > together with [1]) would be really helpful here. Still, the
>> simplified
>> >> >> > scaleOut() API would offer the great benefit that the auto-scaler
>> >> would
>> >> >> not
>> >> >> > need to know about the used aurora configuration.
>> >> >> > >
>> >> >> > > We even had an incident with a sub-optimal implementation of step
>> >> 3):
>> >> >> An
>> >> >> > overloaded package backend lead to slow service startups. The
>> service
>> >> >> > startup took longer than the grace-period of our auto-scaler. It
>> >> >> therefore
>> >> >> > decided to add more and more instances, because the KPI wasn't
>> >> improving
>> >> >> as
>> >> >> > expected. It had no way of knowing that these instances were not
>> even
>> >> >> > 'running'. The additionally added instances aggravated the overload
>> >> >> > situation of the package backend.  Of course, the scaler could
>> >> manually
>> >> >> > health check that all instances have come up and are being used as
>> >> >> > expected, but I guess that is what Aurora is for.
>> >> >> > >
>> >> >> > > [1]
>> >> >> >
>> >> >>
>> >>
>> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
>> >> >> > >
>> >> >> > > Best Regards,
>> >> >> > > Stephan
>> >> >> > > ________________________________________
>> >> >> > > From: Maxim Khutornenko <ma...@apache.org>
>> >> >> > > Sent: Friday, January 15, 2016 7:06 PM
>> >> >> > > To: dev@aurora.apache.org
>> >> >> > > Subject: Re: [PROPOSAL] Job instance scaling APIs
>> >> >> > >
>> >> >> > > I wasn't planning on using the rolling updater functionality
>> given
>> >> the
>> >> >> > > simplicity of the operation. I'd second Steve's earlier concerns
>> >> about
>> >> >> > > scaleOut() looking more like startJobUpdate() if we keep adding
>> >> >> > > features. If health watching, throttling (batch_size) or
>> rollback on
>> >> >> > > failure is required then I believe the startJobUpdate() should be
>> >> used
>> >> >> > > instead of scaleOut().
>> >> >> > >
>> >> >> > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
>> >> >> > > <St...@blue-yonder.com> wrote:
>> >> >> > >> I really like the proposal. The gain in simplicity on the
>> >> client-side
>> >> >> > by not having to provide an aurora config is quite significant.
>> >> >> > >>
>> >> >> > >> The implementation on the scheduler side is probably rather
>> >> straight
>> >> >> > forward as the update can be reused. That would also provide us
>> with
>> >> the
>> >> >> > update UI, which has shown to be quite useful when tracing
>> autoscaler
>> >> >> > events.
>> >> >> > >>
>> >> >> > >> Regards,
>> >> >> > >> Stephan
>> >> >> > >> ________________________________________
>> >> >> > >> From: Maxim Khutornenko <ma...@apache.org>
>> >> >> > >> Sent: Thursday, January 14, 2016 9:50 PM
>> >> >> > >> To: dev@aurora.apache.org
>> >> >> > >> Subject: Re: [PROPOSAL] Job instance scaling APIs
>> >> >> > >>
>> >> >> > >> "I'd be concerned that any
>> >> >> > >> scaling API to be powerful enough to fit all (most) use cases
>> would
>> >> >> just
>> >> >> > >> end up looking like the update API."
>> >> >> > >>
>> >> >> > >> There is a big difference between scaleOut and startJobUpdate
>> APIs
>> >> >> > >> that justifies the inclusion of the former. Namely, scaleOut may
>> >> only
>> >> >> > >> replicate the existing instances without changing/introducing
>> any
>> >> new
>> >> >> > >> scheduling requirements or performing instance
>> rollout/rollback. I
>> >> >> > >> don't see scaleOut ever becoming more powerful to threaten
>> >> >> > >> startJobUpdate. At the same time, the absence of aurora config
>> >> >> > >> requirement is a huge boost to autoscaling client
>> simplification.
>> >> >> > >>
>> >> >> > >> "For example, when scaling down we don't just kill the last N
>> >> >> > instances, we
>> >> >> > >> actually look at the least loaded hosts (globally) and kill
>> tasks
>> >> from
>> >> >> > >> those."
>> >> >> > >>
>> >> >> > >> I don't quite see why the same wouldn't be possible with a
>> scaleIn
>> >> >> > >> API. Isn't it always external process responsibility to pay due
>> >> >> > >> diligence before killing instances?
>> >> >> > >>
>> >> >> > >>
>> >> >> > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <
>> >> sniemitz@apache.org>
>> >> >> > wrote:
>> >> >> > >>> As some background, we handle scale up / down purely from the
>> >> client
>> >> >> > side,
>> >> >> > >>> using the update API for both directions.  I'd be concerned
>> that
>> >> any
>> >> >> > >>> scaling API to be powerful enough to fit all (most) use cases
>> >> would
>> >> >> > just
>> >> >> > >>> end up looking like the update API.
>> >> >> > >>>
>> >> >> > >>> For example, when scaling down we don't just kill the last N
>> >> >> > instances, we
>> >> >> > >>> actually look at the least loaded hosts (globally) and kill
>> tasks
>> >> >> from
>> >> >> > >>> those.
>> >> >> > >>>
>> >> >> > >>>
>> >> >> > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <
>> >> maxim@apache.org
>> >> >> >
>> >> >> > wrote:
>> >> >> > >>>
>> >> >> > >>>> "How is scaling down different from killing instances?"
>> >> >> > >>>>
>> >> >> > >>>> I found 'killTasks' syntax too different and way much more
>> >> powerful
>> >> >> to
>> >> >> > >>>> be used for scaling in. The TaskQuery allows killing instances
>> >> >> across
>> >> >> > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a
>> single
>> >> job.
>> >> >> > >>>> Additional benefit: it can be ACLed independently by allowing
>> >> >> external
>> >> >> > >>>> process kill tasks only within a given job. We may also add
>> rate
>> >> >> > >>>> limiting or backoff to it later.
>> >> >> > >>>>
>> >> >> > >>>> As for Joshua's question, I feel it should be an operator's
>> >> >> > >>>> responsibility to diff a job with its aurora config before
>> >> applying
>> >> >> an
>> >> >> > >>>> update. That said, if there is enough demand we can definitely
>> >> >> > >>>> consider adding something similar to what George suggested or
>> >> >> > >>>> resurrecting a 'large change' warning message we used to have
>> in
>> >> >> > >>>> client updater.
>> >> >> > >>>>
>> >> >> > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <
>> >> >> george@tellapart.com
>> >> >> > >
>> >> >> > >>>> wrote:
>> >> >> > >>>> > As a point of reference, we solved this problem by adding a
>> >> >> binding
>> >> >> > >>>> helper
>> >> >> > >>>> > that queries the scheduler for the current number of
>> instances
>> >> and
>> >> >> > uses
>> >> >> > >>>> > that number instead of a hardcoded config:
>> >> >> > >>>> >
>> >> >> > >>>> >    instances='{{scaling_instances[60]}}'
>> >> >> > >>>> >
>> >> >> > >>>> > In this example, instances will be set to the currently
>> running
>> >> >> > number
>> >> >> > >>>> > (unless there are none, in which case 60 instances will be
>> >> >> created).
>> >> >> > >>>> >
>> >> >> > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <
>> >> jcohen@apache.org>
>> >> >> > wrote:
>> >> >> > >>>> >
>> >> >> > >>>> >> What happens if a job has been scaled out, but the
>> underlying
>> >> >> > config is
>> >> >> > >>>> not
>> >> >> > >>>> >> updated to take that scaling into account? Would the next
>> >> update
>> >> >> > on that
>> >> >> > >>>> >> job revert the number of instances (presumably, because
>> what
>> >> else
>> >> >> > could
>> >> >> > >>>> we
>> >> >> > >>>> >> do)? Is there anything we can do, tooling-wise, to improve
>> >> upon
>> >> >> > this?
>> >> >> > >>>> >>
>> >> >> > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
>> >> >> > maxim@apache.org>
>> >> >> > >>>> >> wrote:
>> >> >> > >>>> >>
>> >> >> > >>>> >> > Our rolling update APIs can be quite inconvenient to work
>> >> with
>> >> >> > when it
>> >> >> > >>>> >> > comes to instance scaling [1]. It's especially
>> frustrating
>> >> when
>> >> >> > >>>> >> > adding/removing instances has to be done in an automated
>> >> >> fashion
>> >> >> > >>>> (e.g.:
>> >> >> > >>>> >> by
>> >> >> > >>>> >> > an external autoscaling process) as it requires holding
>> on
>> >> to
>> >> >> the
>> >> >> > >>>> >> original
>> >> >> > >>>> >> > aurora config at all times.
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > I propose we add simple instance scaling APIs to address
>> the
>> >> >> > above.
>> >> >> > >>>> Since
>> >> >> > >>>> >> > Aurora job may have instances at different configs at any
>> >> >> > moment, I
>> >> >> > >>>> >> propose
>> >> >> > >>>> >> > we accept an InstanceKey as a reference point when
>> scaling
>> >> out.
>> >> >> > For
>> >> >> > >>>> >> > example:
>> >> >> > >>>> >> >
>> >> >> > >>>> >> >     /** Scales out a given job by adding more instances
>> with
>> >> >> the
>> >> >> > task
>> >> >> > >>>> >> > config of the templateKey. */
>> >> >> > >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> >> >> > >>>> incrementCount)
>> >> >> > >>>> >> >
>> >> >> > >>>> >> >     /** Scales in a given job by removing existing
>> >> instances.
>> >> >> */
>> >> >> > >>>> >> >     Response scaleIn(1: JobKey job, 2: i32
>> decrementCount)
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > A correspondent client command could then look like:
>> >> >> > >>>> >> >
>> >> >> > >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1
>> 10
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > For the above command, a scheduler would take task
>> config of
>> >> >> > instance
>> >> >> > >>>> 1
>> >> >> > >>>> >> of
>> >> >> > >>>> >> > the 'hello' job and replicate it 10 more times thus
>> adding
>> >> 10
>> >> >> > >>>> additional
>> >> >> > >>>> >> > instances to the job.
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > There are, of course, some details to work out like
>> making
>> >> sure
>> >> >> > no
>> >> >> > >>>> active
>> >> >> > >>>> >> > update is in flight, scale out does not violate quota and
>> >> etc.
>> >> >> I
>> >> >> > >>>> intend
>> >> >> > >>>> >> to
>> >> >> > >>>> >> > address those during the implementation as things
>> progress.
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > Does the above make sense? Any concerns/suggestions?
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > Thanks,
>> >> >> > >>>> >> > Maxim
>> >> >> > >>>> >> >
>> >> >> > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >> >> > >>>> >> >
>> >> >> > >>>> >>
>> >> >> > >>>>
>> >> >> >
>> >> >>
>> >>
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Bill Farner <wf...@apache.org>.

Nit: your examples above include SessionKey, which no longer exists.

Otherwise LGTM.

On Tue, Jan 19, 2016 at 4:58 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> I think the behavior is spot on.
>
> So the plan is to move the current:
>
> Response addInstances(
>     1: AddInstancesConfig config,
>     2: Lock lock,
>     3: SessionKey session)
>
> to:
>
> Response addInstances(
>     1: AddInstancesConfig config,
>     2: Lock lock,
>     3: SessionKey session,
>     4: InstanceKey instanceKey,
>     5: int32 instanceCount)
>
> In the current release we will fork the behavior depending on whether
> the InstanceKey is present. In the next release, we drop 1,2 and 3 and
> end up with only 4 and 5.
>
> Any concerns?
>
> On Tue, Jan 19, 2016 at 3:57 PM, Bill Farner <wf...@apache.org> wrote:
> > Unless the behavior differs significantly, repurposing sounds like the
> > right thing to do.  We don't *need* a separate RPC to accomplish this, we
> > can just add arguments or struct fields to accomplish dual use.
> >
> > On Tue, Jan 19, 2016 at 3:53 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >
> >> That thought crossed my mind but due to addInstances() being reserved
> >> for an existing RPC I filed AURORA-1581 to handle its graceful
> >> deprecation first.
> >>
> >> Bill, are you suggesting re-purposing the existing 'addInstances()'
> >> into a 'scaleOut()' equivalent? I am mostly +1 on the change but if we
> >> do it within the same release we will not be following our deprecation
> >> guidelines. Perhaps we can name it something like 'addTaskInstances()'
> >> instead and let addInstances() go away naturally?
> >>
> >> On Tue, Jan 19, 2016 at 3:06 PM, Tony Dong <td...@twitter.com.invalid>
> >> wrote:
> >> > +1 to addInstances
> >> >
> >> > On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wf...@apache.org>
> wrote:
> >> >
> >> >> At risk of devolving the discussion, is it worth calling the method
> >> >> addInstances as opposed to scaleOut?  I find the former more
> >> descriptive.
> >> >>
> >> >> On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <
> maxim@apache.org>
> >> >> wrote:
> >> >>
> >> >> > "Of course, the scaler could manually health check that all
> instances
> >> >> > have come up and are being used as expected, but I guess that is
> what
> >> >> > Aurora is for."
> >> >> >
> >> >> > I'd argue the updater "watch_secs" health checking isn't enough to
> >> >> > ensure graceful rollout as instances may start flapping right after
> >> >> > the updater signs off. Instances outside of update window may also
> >> >> > flap (e.g. due to backend pressure) and updater will not be able to
> >> >> > catch that. That's why a robust autoscaler has to rely on external
> >> >> > monitoring tools and overall job health instead.
> >> >> >
> >> >> > A very basic approach, as you mentioned above, could be querying
> job
> >> >> > status repeatedly and count the ratio of tasks in RUNNING vs active
> >> >> > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to
> make
> >> >> > a scaleOut decision. The more reliable approach though would also
> rely
> >> >> > on external monitoring stats exposed by user processes. That would
> be
> >> >> > a much higher fidelity signal than a decision based on task status
> >> >> > alone. Scheduler does not (and should not for scalability reasons)
> >> >> > have visibility into those stats, so the autoscaler would be in a
> much
> >> >> > better position to make an executive decision there.
> >> >> >
> >> >> > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
> >> >> > <St...@blue-yonder.com> wrote:
> >> >> > > I believe the operation is not that simple when you look at the
> >> >> > end-to-end scenario.
> >> >> > >
> >> >> > > For example, the implementation of an auto-scaler  using the new
> >> >> > scaleOut() API could look like:
> >> >> > >
> >> >> > > 1) check some KPI
> >> >> > > 2) Infer an action based on this KPI such as scaleUp() or
> >> scaleDown()
> >> >> > > 3) wait until the effects of the adjusted instance count is
> >> reflected
> >> >> in
> >> >> > the KPI. Go to  1 and repeat.
> >> >> > >
> >> >> > > The health checking capabilities of the existing updater (in
> >> particular
> >> >> > together with [1]) would be really helpful here. Still, the
> simplified
> >> >> > scaleOut() API would offer the great benefit that the auto-scaler
> >> would
> >> >> not
> >> >> > need to know about the used aurora configuration.
> >> >> > >
> >> >> > > We even had an incident with a sub-optimal implementation of step
> >> 3):
> >> >> An
> >> >> > overloaded package backend lead to slow service startups. The
> service
> >> >> > startup took longer than the grace-period of our auto-scaler. It
> >> >> therefore
> >> >> > decided to add more and more instances, because the KPI wasn't
> >> improving
> >> >> as
> >> >> > expected. It had no way of knowing that these instances were not
> even
> >> >> > 'running'. The additionally added instances aggravated the overload
> >> >> > situation of the package backend.  Of course, the scaler could
> >> manually
> >> >> > health check that all instances have come up and are being used as
> >> >> > expected, but I guess that is what Aurora is for.
> >> >> > >
> >> >> > > [1]
> >> >> >
> >> >>
> >>
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
> >> >> > >
> >> >> > > Best Regards,
> >> >> > > Stephan
> >> >> > > ________________________________________
> >> >> > > From: Maxim Khutornenko <ma...@apache.org>
> >> >> > > Sent: Friday, January 15, 2016 7:06 PM
> >> >> > > To: dev@aurora.apache.org
> >> >> > > Subject: Re: [PROPOSAL] Job instance scaling APIs
> >> >> > >
> >> >> > > I wasn't planning on using the rolling updater functionality
> given
> >> the
> >> >> > > simplicity of the operation. I'd second Steve's earlier concerns
> >> about
> >> >> > > scaleOut() looking more like startJobUpdate() if we keep adding
> >> >> > > features. If health watching, throttling (batch_size) or
> rollback on
> >> >> > > failure is required then I believe the startJobUpdate() should be
> >> used
> >> >> > > instead of scaleOut().
> >> >> > >
> >> >> > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> >> >> > > <St...@blue-yonder.com> wrote:
> >> >> > >> I really like the proposal. The gain in simplicity on the
> >> client-side
> >> >> > by not having to provide an aurora config is quite significant.
> >> >> > >>
> >> >> > >> The implementation on the scheduler side is probably rather
> >> straight
> >> >> > forward as the update can be reused. That would also provide us
> with
> >> the
> >> >> > update UI, which has shown to be quite useful when tracing
> autoscaler
> >> >> > events.
> >> >> > >>
> >> >> > >> Regards,
> >> >> > >> Stephan
> >> >> > >> ________________________________________
> >> >> > >> From: Maxim Khutornenko <ma...@apache.org>
> >> >> > >> Sent: Thursday, January 14, 2016 9:50 PM
> >> >> > >> To: dev@aurora.apache.org
> >> >> > >> Subject: Re: [PROPOSAL] Job instance scaling APIs
> >> >> > >>
> >> >> > >> "I'd be concerned that any
> >> >> > >> scaling API to be powerful enough to fit all (most) use cases
> would
> >> >> just
> >> >> > >> end up looking like the update API."
> >> >> > >>
> >> >> > >> There is a big difference between scaleOut and startJobUpdate
> APIs
> >> >> > >> that justifies the inclusion of the former. Namely, scaleOut may
> >> only
> >> >> > >> replicate the existing instances without changing/introducing
> any
> >> new
> >> >> > >> scheduling requirements or performing instance
> rollout/rollback. I
> >> >> > >> don't see scaleOut ever becoming more powerful to threaten
> >> >> > >> startJobUpdate. At the same time, the absence of aurora config
> >> >> > >> requirement is a huge boost to autoscaling client
> simplification.
> >> >> > >>
> >> >> > >> "For example, when scaling down we don't just kill the last N
> >> >> > instances, we
> >> >> > >> actually look at the least loaded hosts (globally) and kill
> tasks
> >> from
> >> >> > >> those."
> >> >> > >>
> >> >> > >> I don't quite see why the same wouldn't be possible with a
> scaleIn
> >> >> > >> API. Isn't it always external process responsibility to pay due
> >> >> > >> diligence before killing instances?
> >> >> > >>
> >> >> > >>
> >> >> > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <
> >> sniemitz@apache.org>
> >> >> > wrote:
> >> >> > >>> As some background, we handle scale up / down purely from the
> >> client
> >> >> > side,
> >> >> > >>> using the update API for both directions.  I'd be concerned
> that
> >> any
> >> >> > >>> scaling API to be powerful enough to fit all (most) use cases
> >> would
> >> >> > just
> >> >> > >>> end up looking like the update API.
> >> >> > >>>
> >> >> > >>> For example, when scaling down we don't just kill the last N
> >> >> > instances, we
> >> >> > >>> actually look at the least loaded hosts (globally) and kill
> tasks
> >> >> from
> >> >> > >>> those.
> >> >> > >>>
> >> >> > >>>
> >> >> > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <
> >> maxim@apache.org
> >> >> >
> >> >> > wrote:
> >> >> > >>>
> >> >> > >>>> "How is scaling down different from killing instances?"
> >> >> > >>>>
> >> >> > >>>> I found 'killTasks' syntax too different and way much more
> >> powerful
> >> >> to
> >> >> > >>>> be used for scaling in. The TaskQuery allows killing instances
> >> >> across
> >> >> > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a
> single
> >> job.
> >> >> > >>>> Additional benefit: it can be ACLed independently by allowing
> >> >> external
> >> >> > >>>> process kill tasks only within a given job. We may also add
> rate
> >> >> > >>>> limiting or backoff to it later.
> >> >> > >>>>
> >> >> > >>>> As for Joshua's question, I feel it should be an operator's
> >> >> > >>>> responsibility to diff a job with its aurora config before
> >> applying
> >> >> an
> >> >> > >>>> update. That said, if there is enough demand we can definitely
> >> >> > >>>> consider adding something similar to what George suggested or
> >> >> > >>>> resurrecting a 'large change' warning message we used to have
> in
> >> >> > >>>> client updater.
> >> >> > >>>>
> >> >> > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <
> >> >> george@tellapart.com
> >> >> > >
> >> >> > >>>> wrote:
> >> >> > >>>> > As a point of reference, we solved this problem by adding a
> >> >> binding
> >> >> > >>>> helper
> >> >> > >>>> > that queries the scheduler for the current number of
> instances
> >> and
> >> >> > uses
> >> >> > >>>> > that number instead of a hardcoded config:
> >> >> > >>>> >
> >> >> > >>>> >    instances='{{scaling_instances[60]}}'
> >> >> > >>>> >
> >> >> > >>>> > In this example, instances will be set to the currently
> running
> >> >> > number
> >> >> > >>>> > (unless there are none, in which case 60 instances will be
> >> >> created).
> >> >> > >>>> >
> >> >> > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <
> >> jcohen@apache.org>
> >> >> > wrote:
> >> >> > >>>> >
> >> >> > >>>> >> What happens if a job has been scaled out, but the
> underlying
> >> >> > config is
> >> >> > >>>> not
> >> >> > >>>> >> updated to take that scaling into account? Would the next
> >> update
> >> >> > on that
> >> >> > >>>> >> job revert the number of instances (presumably, because
> what
> >> else
> >> >> > could
> >> >> > >>>> we
> >> >> > >>>> >> do)? Is there anything we can do, tooling-wise, to improve
> >> upon
> >> >> > this?
> >> >> > >>>> >>
> >> >> > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> >> >> > maxim@apache.org>
> >> >> > >>>> >> wrote:
> >> >> > >>>> >>
> >> >> > >>>> >> > Our rolling update APIs can be quite inconvenient to work
> >> with
> >> >> > when it
> >> >> > >>>> >> > comes to instance scaling [1]. It's especially
> frustrating
> >> when
> >> >> > >>>> >> > adding/removing instances has to be done in an automated
> >> >> fashion
> >> >> > >>>> (e.g.:
> >> >> > >>>> >> by
> >> >> > >>>> >> > an external autoscaling process) as it requires holding
> on
> >> to
> >> >> the
> >> >> > >>>> >> original
> >> >> > >>>> >> > aurora config at all times.
> >> >> > >>>> >> >
> >> >> > >>>> >> > I propose we add simple instance scaling APIs to address
> the
> >> >> > above.
> >> >> > >>>> Since
> >> >> > >>>> >> > Aurora job may have instances at different configs at any
> >> >> > moment, I
> >> >> > >>>> >> propose
> >> >> > >>>> >> > we accept an InstanceKey as a reference point when
> scaling
> >> out.
> >> >> > For
> >> >> > >>>> >> > example:
> >> >> > >>>> >> >
> >> >> > >>>> >> >     /** Scales out a given job by adding more instances
> with
> >> >> the
> >> >> > task
> >> >> > >>>> >> > config of the templateKey. */
> >> >> > >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> >> >> > >>>> incrementCount)
> >> >> > >>>> >> >
> >> >> > >>>> >> >     /** Scales in a given job by removing existing
> >> instances.
> >> >> */
> >> >> > >>>> >> >     Response scaleIn(1: JobKey job, 2: i32
> decrementCount)
> >> >> > >>>> >> >
> >> >> > >>>> >> > A correspondent client command could then look like:
> >> >> > >>>> >> >
> >> >> > >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1
> 10
> >> >> > >>>> >> >
> >> >> > >>>> >> > For the above command, a scheduler would take task
> config of
> >> >> > instance
> >> >> > >>>> 1
> >> >> > >>>> >> of
> >> >> > >>>> >> > the 'hello' job and replicate it 10 more times thus
> adding
> >> 10
> >> >> > >>>> additional
> >> >> > >>>> >> > instances to the job.
> >> >> > >>>> >> >
> >> >> > >>>> >> > There are, of course, some details to work out like
> making
> >> sure
> >> >> > no
> >> >> > >>>> active
> >> >> > >>>> >> > update is in flight, scale out does not violate quota and
> >> etc.
> >> >> I
> >> >> > >>>> intend
> >> >> > >>>> >> to
> >> >> > >>>> >> > address those during the implementation as things
> progress.
> >> >> > >>>> >> >
> >> >> > >>>> >> > Does the above make sense? Any concerns/suggestions?
> >> >> > >>>> >> >
> >> >> > >>>> >> > Thanks,
> >> >> > >>>> >> > Maxim
> >> >> > >>>> >> >
> >> >> > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >> >> > >>>> >> >
> >> >> > >>>> >>
> >> >> > >>>>
> >> >> >
> >> >>
> >>
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

I think the behavior is spot on.

So the plan is to move the current:

Response addInstances(
    1: AddInstancesConfig config,
    2: Lock lock,
    3: SessionKey session)

to:

Response addInstances(
    1: AddInstancesConfig config,
    2: Lock lock,
    3: SessionKey session,
    4: InstanceKey instanceKey,
    5: int32 instanceCount)

In the current release we will fork the behavior depending on whether
the InstanceKey is present. In the next release, we drop 1,2 and 3 and
end up with only 4 and 5.

Any concerns?

On Tue, Jan 19, 2016 at 3:57 PM, Bill Farner <wf...@apache.org> wrote:
> Unless the behavior differs significantly, repurposing sounds like the
> right thing to do.  We don't *need* a separate RPC to accomplish this, we
> can just add arguments or struct fields to accomplish dual use.
>
> On Tue, Jan 19, 2016 at 3:53 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> That thought crossed my mind but due to addInstances() being reserved
>> for an existing RPC I filed AURORA-1581 to handle its graceful
>> deprecation first.
>>
>> Bill, are you suggesting re-purposing the existing 'addInstances()'
>> into a 'scaleOut()' equivalent? I am mostly +1 on the change but if we
>> do it within the same release we will not be following our deprecation
>> guidelines. Perhaps we can name it something like 'addTaskInstances()'
>> instead and let addInstances() go away naturally?
>>
>> On Tue, Jan 19, 2016 at 3:06 PM, Tony Dong <td...@twitter.com.invalid>
>> wrote:
>> > +1 to addInstances
>> >
>> > On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wf...@apache.org> wrote:
>> >
>> >> At risk of devolving the discussion, is it worth calling the method
>> >> addInstances as opposed to scaleOut?  I find the former more
>> descriptive.
>> >>
>> >> On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org>
>> >> wrote:
>> >>
>> >> > "Of course, the scaler could manually health check that all instances
>> >> > have come up and are being used as expected, but I guess that is what
>> >> > Aurora is for."
>> >> >
>> >> > I'd argue the updater "watch_secs" health checking isn't enough to
>> >> > ensure graceful rollout as instances may start flapping right after
>> >> > the updater signs off. Instances outside of update window may also
>> >> > flap (e.g. due to backend pressure) and updater will not be able to
>> >> > catch that. That's why a robust autoscaler has to rely on external
>> >> > monitoring tools and overall job health instead.
>> >> >
>> >> > A very basic approach, as you mentioned above, could be querying job
>> >> > status repeatedly and count the ratio of tasks in RUNNING vs active
>> >> > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
>> >> > a scaleOut decision. The more reliable approach though would also rely
>> >> > on external monitoring stats exposed by user processes. That would be
>> >> > a much higher fidelity signal than a decision based on task status
>> >> > alone. Scheduler does not (and should not for scalability reasons)
>> >> > have visibility into those stats, so the autoscaler would be in a much
>> >> > better position to make an executive decision there.
>> >> >
>> >> > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
>> >> > <St...@blue-yonder.com> wrote:
>> >> > > I believe the operation is not that simple when you look at the
>> >> > end-to-end scenario.
>> >> > >
>> >> > > For example, the implementation of an auto-scaler  using the new
>> >> > scaleOut() API could look like:
>> >> > >
>> >> > > 1) check some KPI
>> >> > > 2) Infer an action based on this KPI such as scaleUp() or
>> scaleDown()
>> >> > > 3) wait until the effects of the adjusted instance count is
>> reflected
>> >> in
>> >> > the KPI. Go to  1 and repeat.
>> >> > >
>> >> > > The health checking capabilities of the existing updater (in
>> particular
>> >> > together with [1]) would be really helpful here. Still, the simplified
>> >> > scaleOut() API would offer the great benefit that the auto-scaler
>> would
>> >> not
>> >> > need to know about the used aurora configuration.
>> >> > >
>> >> > > We even had an incident with a sub-optimal implementation of step
>> 3):
>> >> An
>> >> > overloaded package backend lead to slow service startups. The service
>> >> > startup took longer than the grace-period of our auto-scaler. It
>> >> therefore
>> >> > decided to add more and more instances, because the KPI wasn't
>> improving
>> >> as
>> >> > expected. It had no way of knowing that these instances were not even
>> >> > 'running'. The additionally added instances aggravated the overload
>> >> > situation of the package backend.  Of course, the scaler could
>> manually
>> >> > health check that all instances have come up and are being used as
>> >> > expected, but I guess that is what Aurora is for.
>> >> > >
>> >> > > [1]
>> >> >
>> >>
>> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
>> >> > >
>> >> > > Best Regards,
>> >> > > Stephan
>> >> > > ________________________________________
>> >> > > From: Maxim Khutornenko <ma...@apache.org>
>> >> > > Sent: Friday, January 15, 2016 7:06 PM
>> >> > > To: dev@aurora.apache.org
>> >> > > Subject: Re: [PROPOSAL] Job instance scaling APIs
>> >> > >
>> >> > > I wasn't planning on using the rolling updater functionality given
>> the
>> >> > > simplicity of the operation. I'd second Steve's earlier concerns
>> about
>> >> > > scaleOut() looking more like startJobUpdate() if we keep adding
>> >> > > features. If health watching, throttling (batch_size) or rollback on
>> >> > > failure is required then I believe the startJobUpdate() should be
>> used
>> >> > > instead of scaleOut().
>> >> > >
>> >> > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
>> >> > > <St...@blue-yonder.com> wrote:
>> >> > >> I really like the proposal. The gain in simplicity on the
>> client-side
>> >> > by not having to provide an aurora config is quite significant.
>> >> > >>
>> >> > >> The implementation on the scheduler side is probably rather
>> straight
>> >> > forward as the update can be reused. That would also provide us with
>> the
>> >> > update UI, which has shown to be quite useful when tracing autoscaler
>> >> > events.
>> >> > >>
>> >> > >> Regards,
>> >> > >> Stephan
>> >> > >> ________________________________________
>> >> > >> From: Maxim Khutornenko <ma...@apache.org>
>> >> > >> Sent: Thursday, January 14, 2016 9:50 PM
>> >> > >> To: dev@aurora.apache.org
>> >> > >> Subject: Re: [PROPOSAL] Job instance scaling APIs
>> >> > >>
>> >> > >> "I'd be concerned that any
>> >> > >> scaling API to be powerful enough to fit all (most) use cases would
>> >> just
>> >> > >> end up looking like the update API."
>> >> > >>
>> >> > >> There is a big difference between scaleOut and startJobUpdate APIs
>> >> > >> that justifies the inclusion of the former. Namely, scaleOut may
>> only
>> >> > >> replicate the existing instances without changing/introducing any
>> new
>> >> > >> scheduling requirements or performing instance rollout/rollback. I
>> >> > >> don't see scaleOut ever becoming more powerful to threaten
>> >> > >> startJobUpdate. At the same time, the absence of aurora config
>> >> > >> requirement is a huge boost to autoscaling client simplification.
>> >> > >>
>> >> > >> "For example, when scaling down we don't just kill the last N
>> >> > instances, we
>> >> > >> actually look at the least loaded hosts (globally) and kill tasks
>> from
>> >> > >> those."
>> >> > >>
>> >> > >> I don't quite see why the same wouldn't be possible with a scaleIn
>> >> > >> API. Isn't it always external process responsibility to pay due
>> >> > >> diligence before killing instances?
>> >> > >>
>> >> > >>
>> >> > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <
>> sniemitz@apache.org>
>> >> > wrote:
>> >> > >>> As some background, we handle scale up / down purely from the
>> client
>> >> > side,
>> >> > >>> using the update API for both directions.  I'd be concerned that
>> any
>> >> > >>> scaling API to be powerful enough to fit all (most) use cases
>> would
>> >> > just
>> >> > >>> end up looking like the update API.
>> >> > >>>
>> >> > >>> For example, when scaling down we don't just kill the last N
>> >> > instances, we
>> >> > >>> actually look at the least loaded hosts (globally) and kill tasks
>> >> from
>> >> > >>> those.
>> >> > >>>
>> >> > >>>
>> >> > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <
>> maxim@apache.org
>> >> >
>> >> > wrote:
>> >> > >>>
>> >> > >>>> "How is scaling down different from killing instances?"
>> >> > >>>>
>> >> > >>>> I found 'killTasks' syntax too different and way much more
>> powerful
>> >> to
>> >> > >>>> be used for scaling in. The TaskQuery allows killing instances
>> >> across
>> >> > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single
>> job.
>> >> > >>>> Additional benefit: it can be ACLed independently by allowing
>> >> external
>> >> > >>>> process kill tasks only within a given job. We may also add rate
>> >> > >>>> limiting or backoff to it later.
>> >> > >>>>
>> >> > >>>> As for Joshua's question, I feel it should be an operator's
>> >> > >>>> responsibility to diff a job with its aurora config before
>> applying
>> >> an
>> >> > >>>> update. That said, if there is enough demand we can definitely
>> >> > >>>> consider adding something similar to what George suggested or
>> >> > >>>> resurrecting a 'large change' warning message we used to have in
>> >> > >>>> client updater.
>> >> > >>>>
>> >> > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <
>> >> george@tellapart.com
>> >> > >
>> >> > >>>> wrote:
>> >> > >>>> > As a point of reference, we solved this problem by adding a
>> >> binding
>> >> > >>>> helper
>> >> > >>>> > that queries the scheduler for the current number of instances
>> and
>> >> > uses
>> >> > >>>> > that number instead of a hardcoded config:
>> >> > >>>> >
>> >> > >>>> >    instances='{{scaling_instances[60]}}'
>> >> > >>>> >
>> >> > >>>> > In this example, instances will be set to the currently running
>> >> > number
>> >> > >>>> > (unless there are none, in which case 60 instances will be
>> >> created).
>> >> > >>>> >
>> >> > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <
>> jcohen@apache.org>
>> >> > wrote:
>> >> > >>>> >
>> >> > >>>> >> What happens if a job has been scaled out, but the underlying
>> >> > config is
>> >> > >>>> not
>> >> > >>>> >> updated to take that scaling into account? Would the next
>> update
>> >> > on that
>> >> > >>>> >> job revert the number of instances (presumably, because what
>> else
>> >> > could
>> >> > >>>> we
>> >> > >>>> >> do)? Is there anything we can do, tooling-wise, to improve
>> upon
>> >> > this?
>> >> > >>>> >>
>> >> > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
>> >> > maxim@apache.org>
>> >> > >>>> >> wrote:
>> >> > >>>> >>
>> >> > >>>> >> > Our rolling update APIs can be quite inconvenient to work
>> with
>> >> > when it
>> >> > >>>> >> > comes to instance scaling [1]. It's especially frustrating
>> when
>> >> > >>>> >> > adding/removing instances has to be done in an automated
>> >> fashion
>> >> > >>>> (e.g.:
>> >> > >>>> >> by
>> >> > >>>> >> > an external autoscaling process) as it requires holding on
>> to
>> >> the
>> >> > >>>> >> original
>> >> > >>>> >> > aurora config at all times.
>> >> > >>>> >> >
>> >> > >>>> >> > I propose we add simple instance scaling APIs to address the
>> >> > above.
>> >> > >>>> Since
>> >> > >>>> >> > Aurora job may have instances at different configs at any
>> >> > moment, I
>> >> > >>>> >> propose
>> >> > >>>> >> > we accept an InstanceKey as a reference point when scaling
>> out.
>> >> > For
>> >> > >>>> >> > example:
>> >> > >>>> >> >
>> >> > >>>> >> >     /** Scales out a given job by adding more instances with
>> >> the
>> >> > task
>> >> > >>>> >> > config of the templateKey. */
>> >> > >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> >> > >>>> incrementCount)
>> >> > >>>> >> >
>> >> > >>>> >> >     /** Scales in a given job by removing existing
>> instances.
>> >> */
>> >> > >>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> >> > >>>> >> >
>> >> > >>>> >> > A correspondent client command could then look like:
>> >> > >>>> >> >
>> >> > >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> >> > >>>> >> >
>> >> > >>>> >> > For the above command, a scheduler would take task config of
>> >> > instance
>> >> > >>>> 1
>> >> > >>>> >> of
>> >> > >>>> >> > the 'hello' job and replicate it 10 more times thus adding
>> 10
>> >> > >>>> additional
>> >> > >>>> >> > instances to the job.
>> >> > >>>> >> >
>> >> > >>>> >> > There are, of course, some details to work out like making
>> sure
>> >> > no
>> >> > >>>> active
>> >> > >>>> >> > update is in flight, scale out does not violate quota and
>> etc.
>> >> I
>> >> > >>>> intend
>> >> > >>>> >> to
>> >> > >>>> >> > address those during the implementation as things progress.
>> >> > >>>> >> >
>> >> > >>>> >> > Does the above make sense? Any concerns/suggestions?
>> >> > >>>> >> >
>> >> > >>>> >> > Thanks,
>> >> > >>>> >> > Maxim
>> >> > >>>> >> >
>> >> > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >> > >>>> >> >
>> >> > >>>> >>
>> >> > >>>>
>> >> >
>> >>
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Bill Farner <wf...@apache.org>.

Unless the behavior differs significantly, repurposing sounds like the
right thing to do.  We don't *need* a separate RPC to accomplish this, we
can just add arguments or struct fields to accomplish dual use.

On Tue, Jan 19, 2016 at 3:53 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> That thought crossed my mind but due to addInstances() being reserved
> for an existing RPC I filed AURORA-1581 to handle its graceful
> deprecation first.
>
> Bill, are you suggesting re-purposing the existing 'addInstances()'
> into a 'scaleOut()' equivalent? I am mostly +1 on the change but if we
> do it within the same release we will not be following our deprecation
> guidelines. Perhaps we can name it something like 'addTaskInstances()'
> instead and let addInstances() go away naturally?
>
> On Tue, Jan 19, 2016 at 3:06 PM, Tony Dong <td...@twitter.com.invalid>
> wrote:
> > +1 to addInstances
> >
> > On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wf...@apache.org> wrote:
> >
> >> At risk of devolving the discussion, is it worth calling the method
> >> addInstances as opposed to scaleOut?  I find the former more
> descriptive.
> >>
> >> On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org>
> >> wrote:
> >>
> >> > "Of course, the scaler could manually health check that all instances
> >> > have come up and are being used as expected, but I guess that is what
> >> > Aurora is for."
> >> >
> >> > I'd argue the updater "watch_secs" health checking isn't enough to
> >> > ensure graceful rollout as instances may start flapping right after
> >> > the updater signs off. Instances outside of update window may also
> >> > flap (e.g. due to backend pressure) and updater will not be able to
> >> > catch that. That's why a robust autoscaler has to rely on external
> >> > monitoring tools and overall job health instead.
> >> >
> >> > A very basic approach, as you mentioned above, could be querying job
> >> > status repeatedly and count the ratio of tasks in RUNNING vs active
> >> > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
> >> > a scaleOut decision. The more reliable approach though would also rely
> >> > on external monitoring stats exposed by user processes. That would be
> >> > a much higher fidelity signal than a decision based on task status
> >> > alone. Scheduler does not (and should not for scalability reasons)
> >> > have visibility into those stats, so the autoscaler would be in a much
> >> > better position to make an executive decision there.
> >> >
> >> > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
> >> > <St...@blue-yonder.com> wrote:
> >> > > I believe the operation is not that simple when you look at the
> >> > end-to-end scenario.
> >> > >
> >> > > For example, the implementation of an auto-scaler  using the new
> >> > scaleOut() API could look like:
> >> > >
> >> > > 1) check some KPI
> >> > > 2) Infer an action based on this KPI such as scaleUp() or
> scaleDown()
> >> > > 3) wait until the effects of the adjusted instance count is
> reflected
> >> in
> >> > the KPI. Go to  1 and repeat.
> >> > >
> >> > > The health checking capabilities of the existing updater (in
> particular
> >> > together with [1]) would be really helpful here. Still, the simplified
> >> > scaleOut() API would offer the great benefit that the auto-scaler
> would
> >> not
> >> > need to know about the used aurora configuration.
> >> > >
> >> > > We even had an incident with a sub-optimal implementation of step
> 3):
> >> An
> >> > overloaded package backend lead to slow service startups. The service
> >> > startup took longer than the grace-period of our auto-scaler. It
> >> therefore
> >> > decided to add more and more instances, because the KPI wasn't
> improving
> >> as
> >> > expected. It had no way of knowing that these instances were not even
> >> > 'running'. The additionally added instances aggravated the overload
> >> > situation of the package backend.  Of course, the scaler could
> manually
> >> > health check that all instances have come up and are being used as
> >> > expected, but I guess that is what Aurora is for.
> >> > >
> >> > > [1]
> >> >
> >>
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
> >> > >
> >> > > Best Regards,
> >> > > Stephan
> >> > > ________________________________________
> >> > > From: Maxim Khutornenko <ma...@apache.org>
> >> > > Sent: Friday, January 15, 2016 7:06 PM
> >> > > To: dev@aurora.apache.org
> >> > > Subject: Re: [PROPOSAL] Job instance scaling APIs
> >> > >
> >> > > I wasn't planning on using the rolling updater functionality given
> the
> >> > > simplicity of the operation. I'd second Steve's earlier concerns
> about
> >> > > scaleOut() looking more like startJobUpdate() if we keep adding
> >> > > features. If health watching, throttling (batch_size) or rollback on
> >> > > failure is required then I believe the startJobUpdate() should be
> used
> >> > > instead of scaleOut().
> >> > >
> >> > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> >> > > <St...@blue-yonder.com> wrote:
> >> > >> I really like the proposal. The gain in simplicity on the
> client-side
> >> > by not having to provide an aurora config is quite significant.
> >> > >>
> >> > >> The implementation on the scheduler side is probably rather
> straight
> >> > forward as the update can be reused. That would also provide us with
> the
> >> > update UI, which has shown to be quite useful when tracing autoscaler
> >> > events.
> >> > >>
> >> > >> Regards,
> >> > >> Stephan
> >> > >> ________________________________________
> >> > >> From: Maxim Khutornenko <ma...@apache.org>
> >> > >> Sent: Thursday, January 14, 2016 9:50 PM
> >> > >> To: dev@aurora.apache.org
> >> > >> Subject: Re: [PROPOSAL] Job instance scaling APIs
> >> > >>
> >> > >> "I'd be concerned that any
> >> > >> scaling API to be powerful enough to fit all (most) use cases would
> >> just
> >> > >> end up looking like the update API."
> >> > >>
> >> > >> There is a big difference between scaleOut and startJobUpdate APIs
> >> > >> that justifies the inclusion of the former. Namely, scaleOut may
> only
> >> > >> replicate the existing instances without changing/introducing any
> new
> >> > >> scheduling requirements or performing instance rollout/rollback. I
> >> > >> don't see scaleOut ever becoming more powerful to threaten
> >> > >> startJobUpdate. At the same time, the absence of aurora config
> >> > >> requirement is a huge boost to autoscaling client simplification.
> >> > >>
> >> > >> "For example, when scaling down we don't just kill the last N
> >> > instances, we
> >> > >> actually look at the least loaded hosts (globally) and kill tasks
> from
> >> > >> those."
> >> > >>
> >> > >> I don't quite see why the same wouldn't be possible with a scaleIn
> >> > >> API. Isn't it always external process responsibility to pay due
> >> > >> diligence before killing instances?
> >> > >>
> >> > >>
> >> > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <
> sniemitz@apache.org>
> >> > wrote:
> >> > >>> As some background, we handle scale up / down purely from the
> client
> >> > side,
> >> > >>> using the update API for both directions.  I'd be concerned that
> any
> >> > >>> scaling API to be powerful enough to fit all (most) use cases
> would
> >> > just
> >> > >>> end up looking like the update API.
> >> > >>>
> >> > >>> For example, when scaling down we don't just kill the last N
> >> > instances, we
> >> > >>> actually look at the least loaded hosts (globally) and kill tasks
> >> from
> >> > >>> those.
> >> > >>>
> >> > >>>
> >> > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <
> maxim@apache.org
> >> >
> >> > wrote:
> >> > >>>
> >> > >>>> "How is scaling down different from killing instances?"
> >> > >>>>
> >> > >>>> I found 'killTasks' syntax too different and way much more
> powerful
> >> to
> >> > >>>> be used for scaling in. The TaskQuery allows killing instances
> >> across
> >> > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single
> job.
> >> > >>>> Additional benefit: it can be ACLed independently by allowing
> >> external
> >> > >>>> process kill tasks only within a given job. We may also add rate
> >> > >>>> limiting or backoff to it later.
> >> > >>>>
> >> > >>>> As for Joshua's question, I feel it should be an operator's
> >> > >>>> responsibility to diff a job with its aurora config before
> applying
> >> an
> >> > >>>> update. That said, if there is enough demand we can definitely
> >> > >>>> consider adding something similar to what George suggested or
> >> > >>>> resurrecting a 'large change' warning message we used to have in
> >> > >>>> client updater.
> >> > >>>>
> >> > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <
> >> george@tellapart.com
> >> > >
> >> > >>>> wrote:
> >> > >>>> > As a point of reference, we solved this problem by adding a
> >> binding
> >> > >>>> helper
> >> > >>>> > that queries the scheduler for the current number of instances
> and
> >> > uses
> >> > >>>> > that number instead of a hardcoded config:
> >> > >>>> >
> >> > >>>> >    instances='{{scaling_instances[60]}}'
> >> > >>>> >
> >> > >>>> > In this example, instances will be set to the currently running
> >> > number
> >> > >>>> > (unless there are none, in which case 60 instances will be
> >> created).
> >> > >>>> >
> >> > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <
> jcohen@apache.org>
> >> > wrote:
> >> > >>>> >
> >> > >>>> >> What happens if a job has been scaled out, but the underlying
> >> > config is
> >> > >>>> not
> >> > >>>> >> updated to take that scaling into account? Would the next
> update
> >> > on that
> >> > >>>> >> job revert the number of instances (presumably, because what
> else
> >> > could
> >> > >>>> we
> >> > >>>> >> do)? Is there anything we can do, tooling-wise, to improve
> upon
> >> > this?
> >> > >>>> >>
> >> > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> >> > maxim@apache.org>
> >> > >>>> >> wrote:
> >> > >>>> >>
> >> > >>>> >> > Our rolling update APIs can be quite inconvenient to work
> with
> >> > when it
> >> > >>>> >> > comes to instance scaling [1]. It's especially frustrating
> when
> >> > >>>> >> > adding/removing instances has to be done in an automated
> >> fashion
> >> > >>>> (e.g.:
> >> > >>>> >> by
> >> > >>>> >> > an external autoscaling process) as it requires holding on
> to
> >> the
> >> > >>>> >> original
> >> > >>>> >> > aurora config at all times.
> >> > >>>> >> >
> >> > >>>> >> > I propose we add simple instance scaling APIs to address the
> >> > above.
> >> > >>>> Since
> >> > >>>> >> > Aurora job may have instances at different configs at any
> >> > moment, I
> >> > >>>> >> propose
> >> > >>>> >> > we accept an InstanceKey as a reference point when scaling
> out.
> >> > For
> >> > >>>> >> > example:
> >> > >>>> >> >
> >> > >>>> >> >     /** Scales out a given job by adding more instances with
> >> the
> >> > task
> >> > >>>> >> > config of the templateKey. */
> >> > >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> >> > >>>> incrementCount)
> >> > >>>> >> >
> >> > >>>> >> >     /** Scales in a given job by removing existing
> instances.
> >> */
> >> > >>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >> > >>>> >> >
> >> > >>>> >> > A correspondent client command could then look like:
> >> > >>>> >> >
> >> > >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >> > >>>> >> >
> >> > >>>> >> > For the above command, a scheduler would take task config of
> >> > instance
> >> > >>>> 1
> >> > >>>> >> of
> >> > >>>> >> > the 'hello' job and replicate it 10 more times thus adding
> 10
> >> > >>>> additional
> >> > >>>> >> > instances to the job.
> >> > >>>> >> >
> >> > >>>> >> > There are, of course, some details to work out like making
> sure
> >> > no
> >> > >>>> active
> >> > >>>> >> > update is in flight, scale out does not violate quota and
> etc.
> >> I
> >> > >>>> intend
> >> > >>>> >> to
> >> > >>>> >> > address those during the implementation as things progress.
> >> > >>>> >> >
> >> > >>>> >> > Does the above make sense? Any concerns/suggestions?
> >> > >>>> >> >
> >> > >>>> >> > Thanks,
> >> > >>>> >> > Maxim
> >> > >>>> >> >
> >> > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >> > >>>> >> >
> >> > >>>> >>
> >> > >>>>
> >> >
> >>
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

That thought crossed my mind but due to addInstances() being reserved
for an existing RPC I filed AURORA-1581 to handle its graceful
deprecation first.

Bill, are you suggesting re-purposing the existing 'addInstances()'
into a 'scaleOut()' equivalent? I am mostly +1 on the change but if we
do it within the same release we will not be following our deprecation
guidelines. Perhaps we can name it something like 'addTaskInstances()'
instead and let addInstances() go away naturally?

On Tue, Jan 19, 2016 at 3:06 PM, Tony Dong <td...@twitter.com.invalid> wrote:
> +1 to addInstances
>
> On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wf...@apache.org> wrote:
>
>> At risk of devolving the discussion, is it worth calling the method
>> addInstances as opposed to scaleOut?  I find the former more descriptive.
>>
>> On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org>
>> wrote:
>>
>> > "Of course, the scaler could manually health check that all instances
>> > have come up and are being used as expected, but I guess that is what
>> > Aurora is for."
>> >
>> > I'd argue the updater "watch_secs" health checking isn't enough to
>> > ensure graceful rollout as instances may start flapping right after
>> > the updater signs off. Instances outside of update window may also
>> > flap (e.g. due to backend pressure) and updater will not be able to
>> > catch that. That's why a robust autoscaler has to rely on external
>> > monitoring tools and overall job health instead.
>> >
>> > A very basic approach, as you mentioned above, could be querying job
>> > status repeatedly and count the ratio of tasks in RUNNING vs active
>> > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
>> > a scaleOut decision. The more reliable approach though would also rely
>> > on external monitoring stats exposed by user processes. That would be
>> > a much higher fidelity signal than a decision based on task status
>> > alone. Scheduler does not (and should not for scalability reasons)
>> > have visibility into those stats, so the autoscaler would be in a much
>> > better position to make an executive decision there.
>> >
>> > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
>> > <St...@blue-yonder.com> wrote:
>> > > I believe the operation is not that simple when you look at the
>> > end-to-end scenario.
>> > >
>> > > For example, the implementation of an auto-scaler  using the new
>> > scaleOut() API could look like:
>> > >
>> > > 1) check some KPI
>> > > 2) Infer an action based on this KPI such as scaleUp() or scaleDown()
>> > > 3) wait until the effects of the adjusted instance count is reflected
>> in
>> > the KPI. Go to  1 and repeat.
>> > >
>> > > The health checking capabilities of the existing updater (in particular
>> > together with [1]) would be really helpful here. Still, the simplified
>> > scaleOut() API would offer the great benefit that the auto-scaler would
>> not
>> > need to know about the used aurora configuration.
>> > >
>> > > We even had an incident with a sub-optimal implementation of step 3):
>> An
>> > overloaded package backend lead to slow service startups. The service
>> > startup took longer than the grace-period of our auto-scaler. It
>> therefore
>> > decided to add more and more instances, because the KPI wasn't improving
>> as
>> > expected. It had no way of knowing that these instances were not even
>> > 'running'. The additionally added instances aggravated the overload
>> > situation of the package backend.  Of course, the scaler could manually
>> > health check that all instances have come up and are being used as
>> > expected, but I guess that is what Aurora is for.
>> > >
>> > > [1]
>> >
>> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
>> > >
>> > > Best Regards,
>> > > Stephan
>> > > ________________________________________
>> > > From: Maxim Khutornenko <ma...@apache.org>
>> > > Sent: Friday, January 15, 2016 7:06 PM
>> > > To: dev@aurora.apache.org
>> > > Subject: Re: [PROPOSAL] Job instance scaling APIs
>> > >
>> > > I wasn't planning on using the rolling updater functionality given the
>> > > simplicity of the operation. I'd second Steve's earlier concerns about
>> > > scaleOut() looking more like startJobUpdate() if we keep adding
>> > > features. If health watching, throttling (batch_size) or rollback on
>> > > failure is required then I believe the startJobUpdate() should be used
>> > > instead of scaleOut().
>> > >
>> > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
>> > > <St...@blue-yonder.com> wrote:
>> > >> I really like the proposal. The gain in simplicity on the client-side
>> > by not having to provide an aurora config is quite significant.
>> > >>
>> > >> The implementation on the scheduler side is probably rather straight
>> > forward as the update can be reused. That would also provide us with the
>> > update UI, which has shown to be quite useful when tracing autoscaler
>> > events.
>> > >>
>> > >> Regards,
>> > >> Stephan
>> > >> ________________________________________
>> > >> From: Maxim Khutornenko <ma...@apache.org>
>> > >> Sent: Thursday, January 14, 2016 9:50 PM
>> > >> To: dev@aurora.apache.org
>> > >> Subject: Re: [PROPOSAL] Job instance scaling APIs
>> > >>
>> > >> "I'd be concerned that any
>> > >> scaling API to be powerful enough to fit all (most) use cases would
>> just
>> > >> end up looking like the update API."
>> > >>
>> > >> There is a big difference between scaleOut and startJobUpdate APIs
>> > >> that justifies the inclusion of the former. Namely, scaleOut may only
>> > >> replicate the existing instances without changing/introducing any new
>> > >> scheduling requirements or performing instance rollout/rollback. I
>> > >> don't see scaleOut ever becoming more powerful to threaten
>> > >> startJobUpdate. At the same time, the absence of aurora config
>> > >> requirement is a huge boost to autoscaling client simplification.
>> > >>
>> > >> "For example, when scaling down we don't just kill the last N
>> > instances, we
>> > >> actually look at the least loaded hosts (globally) and kill tasks from
>> > >> those."
>> > >>
>> > >> I don't quite see why the same wouldn't be possible with a scaleIn
>> > >> API. Isn't it always external process responsibility to pay due
>> > >> diligence before killing instances?
>> > >>
>> > >>
>> > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org>
>> > wrote:
>> > >>> As some background, we handle scale up / down purely from the client
>> > side,
>> > >>> using the update API for both directions.  I'd be concerned that any
>> > >>> scaling API to be powerful enough to fit all (most) use cases would
>> > just
>> > >>> end up looking like the update API.
>> > >>>
>> > >>> For example, when scaling down we don't just kill the last N
>> > instances, we
>> > >>> actually look at the least loaded hosts (globally) and kill tasks
>> from
>> > >>> those.
>> > >>>
>> > >>>
>> > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <maxim@apache.org
>> >
>> > wrote:
>> > >>>
>> > >>>> "How is scaling down different from killing instances?"
>> > >>>>
>> > >>>> I found 'killTasks' syntax too different and way much more powerful
>> to
>> > >>>> be used for scaling in. The TaskQuery allows killing instances
>> across
>> > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>> > >>>> Additional benefit: it can be ACLed independently by allowing
>> external
>> > >>>> process kill tasks only within a given job. We may also add rate
>> > >>>> limiting or backoff to it later.
>> > >>>>
>> > >>>> As for Joshua's question, I feel it should be an operator's
>> > >>>> responsibility to diff a job with its aurora config before applying
>> an
>> > >>>> update. That said, if there is enough demand we can definitely
>> > >>>> consider adding something similar to what George suggested or
>> > >>>> resurrecting a 'large change' warning message we used to have in
>> > >>>> client updater.
>> > >>>>
>> > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <
>> george@tellapart.com
>> > >
>> > >>>> wrote:
>> > >>>> > As a point of reference, we solved this problem by adding a
>> binding
>> > >>>> helper
>> > >>>> > that queries the scheduler for the current number of instances and
>> > uses
>> > >>>> > that number instead of a hardcoded config:
>> > >>>> >
>> > >>>> >    instances='{{scaling_instances[60]}}'
>> > >>>> >
>> > >>>> > In this example, instances will be set to the currently running
>> > number
>> > >>>> > (unless there are none, in which case 60 instances will be
>> created).
>> > >>>> >
>> > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org>
>> > wrote:
>> > >>>> >
>> > >>>> >> What happens if a job has been scaled out, but the underlying
>> > config is
>> > >>>> not
>> > >>>> >> updated to take that scaling into account? Would the next update
>> > on that
>> > >>>> >> job revert the number of instances (presumably, because what else
>> > could
>> > >>>> we
>> > >>>> >> do)? Is there anything we can do, tooling-wise, to improve upon
>> > this?
>> > >>>> >>
>> > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
>> > maxim@apache.org>
>> > >>>> >> wrote:
>> > >>>> >>
>> > >>>> >> > Our rolling update APIs can be quite inconvenient to work with
>> > when it
>> > >>>> >> > comes to instance scaling [1]. It's especially frustrating when
>> > >>>> >> > adding/removing instances has to be done in an automated
>> fashion
>> > >>>> (e.g.:
>> > >>>> >> by
>> > >>>> >> > an external autoscaling process) as it requires holding on to
>> the
>> > >>>> >> original
>> > >>>> >> > aurora config at all times.
>> > >>>> >> >
>> > >>>> >> > I propose we add simple instance scaling APIs to address the
>> > above.
>> > >>>> Since
>> > >>>> >> > Aurora job may have instances at different configs at any
>> > moment, I
>> > >>>> >> propose
>> > >>>> >> > we accept an InstanceKey as a reference point when scaling out.
>> > For
>> > >>>> >> > example:
>> > >>>> >> >
>> > >>>> >> >     /** Scales out a given job by adding more instances with
>> the
>> > task
>> > >>>> >> > config of the templateKey. */
>> > >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> > >>>> incrementCount)
>> > >>>> >> >
>> > >>>> >> >     /** Scales in a given job by removing existing instances.
>> */
>> > >>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> > >>>> >> >
>> > >>>> >> > A correspondent client command could then look like:
>> > >>>> >> >
>> > >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> > >>>> >> >
>> > >>>> >> > For the above command, a scheduler would take task config of
>> > instance
>> > >>>> 1
>> > >>>> >> of
>> > >>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>> > >>>> additional
>> > >>>> >> > instances to the job.
>> > >>>> >> >
>> > >>>> >> > There are, of course, some details to work out like making sure
>> > no
>> > >>>> active
>> > >>>> >> > update is in flight, scale out does not violate quota and etc.
>> I
>> > >>>> intend
>> > >>>> >> to
>> > >>>> >> > address those during the implementation as things progress.
>> > >>>> >> >
>> > >>>> >> > Does the above make sense? Any concerns/suggestions?
>> > >>>> >> >
>> > >>>> >> > Thanks,
>> > >>>> >> > Maxim
>> > >>>> >> >
>> > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> > >>>> >> >
>> > >>>> >>
>> > >>>>
>> >
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Tony Dong <td...@twitter.com.INVALID>.

+1 to addInstances

On Tue, Jan 19, 2016 at 3:00 PM, Bill Farner <wf...@apache.org> wrote:

> At risk of devolving the discussion, is it worth calling the method
> addInstances as opposed to scaleOut?  I find the former more descriptive.
>
> On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > "Of course, the scaler could manually health check that all instances
> > have come up and are being used as expected, but I guess that is what
> > Aurora is for."
> >
> > I'd argue the updater "watch_secs" health checking isn't enough to
> > ensure graceful rollout as instances may start flapping right after
> > the updater signs off. Instances outside of update window may also
> > flap (e.g. due to backend pressure) and updater will not be able to
> > catch that. That's why a robust autoscaler has to rely on external
> > monitoring tools and overall job health instead.
> >
> > A very basic approach, as you mentioned above, could be querying job
> > status repeatedly and count the ratio of tasks in RUNNING vs active
> > (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
> > a scaleOut decision. The more reliable approach though would also rely
> > on external monitoring stats exposed by user processes. That would be
> > a much higher fidelity signal than a decision based on task status
> > alone. Scheduler does not (and should not for scalability reasons)
> > have visibility into those stats, so the autoscaler would be in a much
> > better position to make an executive decision there.
> >
> > On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
> > <St...@blue-yonder.com> wrote:
> > > I believe the operation is not that simple when you look at the
> > end-to-end scenario.
> > >
> > > For example, the implementation of an auto-scaler  using the new
> > scaleOut() API could look like:
> > >
> > > 1) check some KPI
> > > 2) Infer an action based on this KPI such as scaleUp() or scaleDown()
> > > 3) wait until the effects of the adjusted instance count is reflected
> in
> > the KPI. Go to  1 and repeat.
> > >
> > > The health checking capabilities of the existing updater (in particular
> > together with [1]) would be really helpful here. Still, the simplified
> > scaleOut() API would offer the great benefit that the auto-scaler would
> not
> > need to know about the used aurora configuration.
> > >
> > > We even had an incident with a sub-optimal implementation of step 3):
> An
> > overloaded package backend lead to slow service startups. The service
> > startup took longer than the grace-period of our auto-scaler. It
> therefore
> > decided to add more and more instances, because the KPI wasn't improving
> as
> > expected. It had no way of knowing that these instances were not even
> > 'running'. The additionally added instances aggravated the overload
> > situation of the package backend.  Of course, the scaler could manually
> > health check that all instances have come up and are being used as
> > expected, but I guess that is what Aurora is for.
> > >
> > > [1]
> >
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
> > >
> > > Best Regards,
> > > Stephan
> > > ________________________________________
> > > From: Maxim Khutornenko <ma...@apache.org>
> > > Sent: Friday, January 15, 2016 7:06 PM
> > > To: dev@aurora.apache.org
> > > Subject: Re: [PROPOSAL] Job instance scaling APIs
> > >
> > > I wasn't planning on using the rolling updater functionality given the
> > > simplicity of the operation. I'd second Steve's earlier concerns about
> > > scaleOut() looking more like startJobUpdate() if we keep adding
> > > features. If health watching, throttling (batch_size) or rollback on
> > > failure is required then I believe the startJobUpdate() should be used
> > > instead of scaleOut().
> > >
> > > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> > > <St...@blue-yonder.com> wrote:
> > >> I really like the proposal. The gain in simplicity on the client-side
> > by not having to provide an aurora config is quite significant.
> > >>
> > >> The implementation on the scheduler side is probably rather straight
> > forward as the update can be reused. That would also provide us with the
> > update UI, which has shown to be quite useful when tracing autoscaler
> > events.
> > >>
> > >> Regards,
> > >> Stephan
> > >> ________________________________________
> > >> From: Maxim Khutornenko <ma...@apache.org>
> > >> Sent: Thursday, January 14, 2016 9:50 PM
> > >> To: dev@aurora.apache.org
> > >> Subject: Re: [PROPOSAL] Job instance scaling APIs
> > >>
> > >> "I'd be concerned that any
> > >> scaling API to be powerful enough to fit all (most) use cases would
> just
> > >> end up looking like the update API."
> > >>
> > >> There is a big difference between scaleOut and startJobUpdate APIs
> > >> that justifies the inclusion of the former. Namely, scaleOut may only
> > >> replicate the existing instances without changing/introducing any new
> > >> scheduling requirements or performing instance rollout/rollback. I
> > >> don't see scaleOut ever becoming more powerful to threaten
> > >> startJobUpdate. At the same time, the absence of aurora config
> > >> requirement is a huge boost to autoscaling client simplification.
> > >>
> > >> "For example, when scaling down we don't just kill the last N
> > instances, we
> > >> actually look at the least loaded hosts (globally) and kill tasks from
> > >> those."
> > >>
> > >> I don't quite see why the same wouldn't be possible with a scaleIn
> > >> API. Isn't it always external process responsibility to pay due
> > >> diligence before killing instances?
> > >>
> > >>
> > >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org>
> > wrote:
> > >>> As some background, we handle scale up / down purely from the client
> > side,
> > >>> using the update API for both directions.  I'd be concerned that any
> > >>> scaling API to be powerful enough to fit all (most) use cases would
> > just
> > >>> end up looking like the update API.
> > >>>
> > >>> For example, when scaling down we don't just kill the last N
> > instances, we
> > >>> actually look at the least loaded hosts (globally) and kill tasks
> from
> > >>> those.
> > >>>
> > >>>
> > >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <maxim@apache.org
> >
> > wrote:
> > >>>
> > >>>> "How is scaling down different from killing instances?"
> > >>>>
> > >>>> I found 'killTasks' syntax too different and way much more powerful
> to
> > >>>> be used for scaling in. The TaskQuery allows killing instances
> across
> > >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> > >>>> Additional benefit: it can be ACLed independently by allowing
> external
> > >>>> process kill tasks only within a given job. We may also add rate
> > >>>> limiting or backoff to it later.
> > >>>>
> > >>>> As for Joshua's question, I feel it should be an operator's
> > >>>> responsibility to diff a job with its aurora config before applying
> an
> > >>>> update. That said, if there is enough demand we can definitely
> > >>>> consider adding something similar to what George suggested or
> > >>>> resurrecting a 'large change' warning message we used to have in
> > >>>> client updater.
> > >>>>
> > >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <
> george@tellapart.com
> > >
> > >>>> wrote:
> > >>>> > As a point of reference, we solved this problem by adding a
> binding
> > >>>> helper
> > >>>> > that queries the scheduler for the current number of instances and
> > uses
> > >>>> > that number instead of a hardcoded config:
> > >>>> >
> > >>>> >    instances='{{scaling_instances[60]}}'
> > >>>> >
> > >>>> > In this example, instances will be set to the currently running
> > number
> > >>>> > (unless there are none, in which case 60 instances will be
> created).
> > >>>> >
> > >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org>
> > wrote:
> > >>>> >
> > >>>> >> What happens if a job has been scaled out, but the underlying
> > config is
> > >>>> not
> > >>>> >> updated to take that scaling into account? Would the next update
> > on that
> > >>>> >> job revert the number of instances (presumably, because what else
> > could
> > >>>> we
> > >>>> >> do)? Is there anything we can do, tooling-wise, to improve upon
> > this?
> > >>>> >>
> > >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> > maxim@apache.org>
> > >>>> >> wrote:
> > >>>> >>
> > >>>> >> > Our rolling update APIs can be quite inconvenient to work with
> > when it
> > >>>> >> > comes to instance scaling [1]. It's especially frustrating when
> > >>>> >> > adding/removing instances has to be done in an automated
> fashion
> > >>>> (e.g.:
> > >>>> >> by
> > >>>> >> > an external autoscaling process) as it requires holding on to
> the
> > >>>> >> original
> > >>>> >> > aurora config at all times.
> > >>>> >> >
> > >>>> >> > I propose we add simple instance scaling APIs to address the
> > above.
> > >>>> Since
> > >>>> >> > Aurora job may have instances at different configs at any
> > moment, I
> > >>>> >> propose
> > >>>> >> > we accept an InstanceKey as a reference point when scaling out.
> > For
> > >>>> >> > example:
> > >>>> >> >
> > >>>> >> >     /** Scales out a given job by adding more instances with
> the
> > task
> > >>>> >> > config of the templateKey. */
> > >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> > >>>> incrementCount)
> > >>>> >> >
> > >>>> >> >     /** Scales in a given job by removing existing instances.
> */
> > >>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> > >>>> >> >
> > >>>> >> > A correspondent client command could then look like:
> > >>>> >> >
> > >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> > >>>> >> >
> > >>>> >> > For the above command, a scheduler would take task config of
> > instance
> > >>>> 1
> > >>>> >> of
> > >>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
> > >>>> additional
> > >>>> >> > instances to the job.
> > >>>> >> >
> > >>>> >> > There are, of course, some details to work out like making sure
> > no
> > >>>> active
> > >>>> >> > update is in flight, scale out does not violate quota and etc.
> I
> > >>>> intend
> > >>>> >> to
> > >>>> >> > address those during the implementation as things progress.
> > >>>> >> >
> > >>>> >> > Does the above make sense? Any concerns/suggestions?
> > >>>> >> >
> > >>>> >> > Thanks,
> > >>>> >> > Maxim
> > >>>> >> >
> > >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> > >>>> >> >
> > >>>> >>
> > >>>>
> >
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Bill Farner <wf...@apache.org>.

At risk of devolving the discussion, is it worth calling the method
addInstances as opposed to scaleOut?  I find the former more descriptive.

On Tue, Jan 19, 2016 at 11:12 AM, Maxim Khutornenko <ma...@apache.org>
wrote:

> "Of course, the scaler could manually health check that all instances
> have come up and are being used as expected, but I guess that is what
> Aurora is for."
>
> I'd argue the updater "watch_secs" health checking isn't enough to
> ensure graceful rollout as instances may start flapping right after
> the updater signs off. Instances outside of update window may also
> flap (e.g. due to backend pressure) and updater will not be able to
> catch that. That's why a robust autoscaler has to rely on external
> monitoring tools and overall job health instead.
>
> A very basic approach, as you mentioned above, could be querying job
> status repeatedly and count the ratio of tasks in RUNNING vs active
> (ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
> a scaleOut decision. The more reliable approach though would also rely
> on external monitoring stats exposed by user processes. That would be
> a much higher fidelity signal than a decision based on task status
> alone. Scheduler does not (and should not for scalability reasons)
> have visibility into those stats, so the autoscaler would be in a much
> better position to make an executive decision there.
>
> On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
> <St...@blue-yonder.com> wrote:
> > I believe the operation is not that simple when you look at the
> end-to-end scenario.
> >
> > For example, the implementation of an auto-scaler  using the new
> scaleOut() API could look like:
> >
> > 1) check some KPI
> > 2) Infer an action based on this KPI such as scaleUp() or scaleDown()
> > 3) wait until the effects of the adjusted instance count is reflected in
> the KPI. Go to  1 and repeat.
> >
> > The health checking capabilities of the existing updater (in particular
> together with [1]) would be really helpful here. Still, the simplified
> scaleOut() API would offer the great benefit that the auto-scaler would not
> need to know about the used aurora configuration.
> >
> > We even had an incident with a sub-optimal implementation of step 3): An
> overloaded package backend lead to slow service startups. The service
> startup took longer than the grace-period of our auto-scaler. It  therefore
> decided to add more and more instances, because the KPI wasn't improving as
> expected. It had no way of knowing that these instances were not even
> 'running'. The additionally added instances aggravated the overload
> situation of the package backend.  Of course, the scaler could manually
> health check that all instances have come up and are being used as
> expected, but I guess that is what Aurora is for.
> >
> > [1]
> https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
> >
> > Best Regards,
> > Stephan
> > ________________________________________
> > From: Maxim Khutornenko <ma...@apache.org>
> > Sent: Friday, January 15, 2016 7:06 PM
> > To: dev@aurora.apache.org
> > Subject: Re: [PROPOSAL] Job instance scaling APIs
> >
> > I wasn't planning on using the rolling updater functionality given the
> > simplicity of the operation. I'd second Steve's earlier concerns about
> > scaleOut() looking more like startJobUpdate() if we keep adding
> > features. If health watching, throttling (batch_size) or rollback on
> > failure is required then I believe the startJobUpdate() should be used
> > instead of scaleOut().
> >
> > On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> > <St...@blue-yonder.com> wrote:
> >> I really like the proposal. The gain in simplicity on the client-side
> by not having to provide an aurora config is quite significant.
> >>
> >> The implementation on the scheduler side is probably rather straight
> forward as the update can be reused. That would also provide us with the
> update UI, which has shown to be quite useful when tracing autoscaler
> events.
> >>
> >> Regards,
> >> Stephan
> >> ________________________________________
> >> From: Maxim Khutornenko <ma...@apache.org>
> >> Sent: Thursday, January 14, 2016 9:50 PM
> >> To: dev@aurora.apache.org
> >> Subject: Re: [PROPOSAL] Job instance scaling APIs
> >>
> >> "I'd be concerned that any
> >> scaling API to be powerful enough to fit all (most) use cases would just
> >> end up looking like the update API."
> >>
> >> There is a big difference between scaleOut and startJobUpdate APIs
> >> that justifies the inclusion of the former. Namely, scaleOut may only
> >> replicate the existing instances without changing/introducing any new
> >> scheduling requirements or performing instance rollout/rollback. I
> >> don't see scaleOut ever becoming more powerful to threaten
> >> startJobUpdate. At the same time, the absence of aurora config
> >> requirement is a huge boost to autoscaling client simplification.
> >>
> >> "For example, when scaling down we don't just kill the last N
> instances, we
> >> actually look at the least loaded hosts (globally) and kill tasks from
> >> those."
> >>
> >> I don't quite see why the same wouldn't be possible with a scaleIn
> >> API. Isn't it always external process responsibility to pay due
> >> diligence before killing instances?
> >>
> >>
> >> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org>
> wrote:
> >>> As some background, we handle scale up / down purely from the client
> side,
> >>> using the update API for both directions.  I'd be concerned that any
> >>> scaling API to be powerful enough to fit all (most) use cases would
> just
> >>> end up looking like the update API.
> >>>
> >>> For example, when scaling down we don't just kill the last N
> instances, we
> >>> actually look at the least loaded hosts (globally) and kill tasks from
> >>> those.
> >>>
> >>>
> >>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
> >>>
> >>>> "How is scaling down different from killing instances?"
> >>>>
> >>>> I found 'killTasks' syntax too different and way much more powerful to
> >>>> be used for scaling in. The TaskQuery allows killing instances across
> >>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> >>>> Additional benefit: it can be ACLed independently by allowing external
> >>>> process kill tasks only within a given job. We may also add rate
> >>>> limiting or backoff to it later.
> >>>>
> >>>> As for Joshua's question, I feel it should be an operator's
> >>>> responsibility to diff a job with its aurora config before applying an
> >>>> update. That said, if there is enough demand we can definitely
> >>>> consider adding something similar to what George suggested or
> >>>> resurrecting a 'large change' warning message we used to have in
> >>>> client updater.
> >>>>
> >>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <george@tellapart.com
> >
> >>>> wrote:
> >>>> > As a point of reference, we solved this problem by adding a binding
> >>>> helper
> >>>> > that queries the scheduler for the current number of instances and
> uses
> >>>> > that number instead of a hardcoded config:
> >>>> >
> >>>> >    instances='{{scaling_instances[60]}}'
> >>>> >
> >>>> > In this example, instances will be set to the currently running
> number
> >>>> > (unless there are none, in which case 60 instances will be created).
> >>>> >
> >>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org>
> wrote:
> >>>> >
> >>>> >> What happens if a job has been scaled out, but the underlying
> config is
> >>>> not
> >>>> >> updated to take that scaling into account? Would the next update
> on that
> >>>> >> job revert the number of instances (presumably, because what else
> could
> >>>> we
> >>>> >> do)? Is there anything we can do, tooling-wise, to improve upon
> this?
> >>>> >>
> >>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <
> maxim@apache.org>
> >>>> >> wrote:
> >>>> >>
> >>>> >> > Our rolling update APIs can be quite inconvenient to work with
> when it
> >>>> >> > comes to instance scaling [1]. It's especially frustrating when
> >>>> >> > adding/removing instances has to be done in an automated fashion
> >>>> (e.g.:
> >>>> >> by
> >>>> >> > an external autoscaling process) as it requires holding on to the
> >>>> >> original
> >>>> >> > aurora config at all times.
> >>>> >> >
> >>>> >> > I propose we add simple instance scaling APIs to address the
> above.
> >>>> Since
> >>>> >> > Aurora job may have instances at different configs at any
> moment, I
> >>>> >> propose
> >>>> >> > we accept an InstanceKey as a reference point when scaling out.
> For
> >>>> >> > example:
> >>>> >> >
> >>>> >> >     /** Scales out a given job by adding more instances with the
> task
> >>>> >> > config of the templateKey. */
> >>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> >>>> incrementCount)
> >>>> >> >
> >>>> >> >     /** Scales in a given job by removing existing instances. */
> >>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >>>> >> >
> >>>> >> > A correspondent client command could then look like:
> >>>> >> >
> >>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >>>> >> >
> >>>> >> > For the above command, a scheduler would take task config of
> instance
> >>>> 1
> >>>> >> of
> >>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
> >>>> additional
> >>>> >> > instances to the job.
> >>>> >> >
> >>>> >> > There are, of course, some details to work out like making sure
> no
> >>>> active
> >>>> >> > update is in flight, scale out does not violate quota and etc. I
> >>>> intend
> >>>> >> to
> >>>> >> > address those during the implementation as things progress.
> >>>> >> >
> >>>> >> > Does the above make sense? Any concerns/suggestions?
> >>>> >> >
> >>>> >> > Thanks,
> >>>> >> > Maxim
> >>>> >> >
> >>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >>>> >> >
> >>>> >>
> >>>>
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

"Of course, the scaler could manually health check that all instances
have come up and are being used as expected, but I guess that is what
Aurora is for."

I'd argue the updater "watch_secs" health checking isn't enough to
ensure graceful rollout as instances may start flapping right after
the updater signs off. Instances outside of update window may also
flap (e.g. due to backend pressure) and updater will not be able to
catch that. That's why a robust autoscaler has to rely on external
monitoring tools and overall job health instead.

A very basic approach, as you mentioned above, could be querying job
status repeatedly and count the ratio of tasks in RUNNING vs active
(ASSIGNED, PENDING, THROTTLED, STARTING, etc.) states in order to make
a scaleOut decision. The more reliable approach though would also rely
on external monitoring stats exposed by user processes. That would be
a much higher fidelity signal than a decision based on task status
alone. Scheduler does not (and should not for scalability reasons)
have visibility into those stats, so the autoscaler would be in a much
better position to make an executive decision there.

On Sun, Jan 17, 2016 at 9:00 AM, Erb, Stephan
<St...@blue-yonder.com> wrote:
> I believe the operation is not that simple when you look at the end-to-end scenario.
>
> For example, the implementation of an auto-scaler  using the new scaleOut() API could look like:
>
> 1) check some KPI
> 2) Infer an action based on this KPI such as scaleUp() or scaleDown()
> 3) wait until the effects of the adjusted instance count is reflected in the KPI. Go to  1 and repeat.
>
> The health checking capabilities of the existing updater (in particular together with [1]) would be really helpful here. Still, the simplified scaleOut() API would offer the great benefit that the auto-scaler would not need to know about the used aurora configuration.
>
> We even had an incident with a sub-optimal implementation of step 3): An overloaded package backend lead to slow service startups. The service startup took longer than the grace-period of our auto-scaler. It  therefore decided to add more and more instances, because the KPI wasn't improving as expected. It had no way of knowing that these instances were not even 'running'. The additionally added instances aggravated the overload situation of the package backend.  Of course, the scaler could manually health check that all instances have come up and are being used as expected, but I guess that is what Aurora is for.
>
> [1] https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua
>
> Best Regards,
> Stephan
> ________________________________________
> From: Maxim Khutornenko <ma...@apache.org>
> Sent: Friday, January 15, 2016 7:06 PM
> To: dev@aurora.apache.org
> Subject: Re: [PROPOSAL] Job instance scaling APIs
>
> I wasn't planning on using the rolling updater functionality given the
> simplicity of the operation. I'd second Steve's earlier concerns about
> scaleOut() looking more like startJobUpdate() if we keep adding
> features. If health watching, throttling (batch_size) or rollback on
> failure is required then I believe the startJobUpdate() should be used
> instead of scaleOut().
>
> On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
> <St...@blue-yonder.com> wrote:
>> I really like the proposal. The gain in simplicity on the client-side by not having to provide an aurora config is quite significant.
>>
>> The implementation on the scheduler side is probably rather straight forward as the update can be reused. That would also provide us with the update UI, which has shown to be quite useful when tracing autoscaler events.
>>
>> Regards,
>> Stephan
>> ________________________________________
>> From: Maxim Khutornenko <ma...@apache.org>
>> Sent: Thursday, January 14, 2016 9:50 PM
>> To: dev@aurora.apache.org
>> Subject: Re: [PROPOSAL] Job instance scaling APIs
>>
>> "I'd be concerned that any
>> scaling API to be powerful enough to fit all (most) use cases would just
>> end up looking like the update API."
>>
>> There is a big difference between scaleOut and startJobUpdate APIs
>> that justifies the inclusion of the former. Namely, scaleOut may only
>> replicate the existing instances without changing/introducing any new
>> scheduling requirements or performing instance rollout/rollback. I
>> don't see scaleOut ever becoming more powerful to threaten
>> startJobUpdate. At the same time, the absence of aurora config
>> requirement is a huge boost to autoscaling client simplification.
>>
>> "For example, when scaling down we don't just kill the last N instances, we
>> actually look at the least loaded hosts (globally) and kill tasks from
>> those."
>>
>> I don't quite see why the same wouldn't be possible with a scaleIn
>> API. Isn't it always external process responsibility to pay due
>> diligence before killing instances?
>>
>>
>> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org> wrote:
>>> As some background, we handle scale up / down purely from the client side,
>>> using the update API for both directions.  I'd be concerned that any
>>> scaling API to be powerful enough to fit all (most) use cases would just
>>> end up looking like the update API.
>>>
>>> For example, when scaling down we don't just kill the last N instances, we
>>> actually look at the least loaded hosts (globally) and kill tasks from
>>> those.
>>>
>>>
>>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>>>
>>>> "How is scaling down different from killing instances?"
>>>>
>>>> I found 'killTasks' syntax too different and way much more powerful to
>>>> be used for scaling in. The TaskQuery allows killing instances across
>>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>>>> Additional benefit: it can be ACLed independently by allowing external
>>>> process kill tasks only within a given job. We may also add rate
>>>> limiting or backoff to it later.
>>>>
>>>> As for Joshua's question, I feel it should be an operator's
>>>> responsibility to diff a job with its aurora config before applying an
>>>> update. That said, if there is enough demand we can definitely
>>>> consider adding something similar to what George suggested or
>>>> resurrecting a 'large change' warning message we used to have in
>>>> client updater.
>>>>
>>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
>>>> wrote:
>>>> > As a point of reference, we solved this problem by adding a binding
>>>> helper
>>>> > that queries the scheduler for the current number of instances and uses
>>>> > that number instead of a hardcoded config:
>>>> >
>>>> >    instances='{{scaling_instances[60]}}'
>>>> >
>>>> > In this example, instances will be set to the currently running number
>>>> > (unless there are none, in which case 60 instances will be created).
>>>> >
>>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
>>>> >
>>>> >> What happens if a job has been scaled out, but the underlying config is
>>>> not
>>>> >> updated to take that scaling into account? Would the next update on that
>>>> >> job revert the number of instances (presumably, because what else could
>>>> we
>>>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>>>> >>
>>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>>>> >> wrote:
>>>> >>
>>>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>>>> >> > comes to instance scaling [1]. It's especially frustrating when
>>>> >> > adding/removing instances has to be done in an automated fashion
>>>> (e.g.:
>>>> >> by
>>>> >> > an external autoscaling process) as it requires holding on to the
>>>> >> original
>>>> >> > aurora config at all times.
>>>> >> >
>>>> >> > I propose we add simple instance scaling APIs to address the above.
>>>> Since
>>>> >> > Aurora job may have instances at different configs at any moment, I
>>>> >> propose
>>>> >> > we accept an InstanceKey as a reference point when scaling out. For
>>>> >> > example:
>>>> >> >
>>>> >> >     /** Scales out a given job by adding more instances with the task
>>>> >> > config of the templateKey. */
>>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>>>> incrementCount)
>>>> >> >
>>>> >> >     /** Scales in a given job by removing existing instances. */
>>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>>>> >> >
>>>> >> > A correspondent client command could then look like:
>>>> >> >
>>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>>>> >> >
>>>> >> > For the above command, a scheduler would take task config of instance
>>>> 1
>>>> >> of
>>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>>>> additional
>>>> >> > instances to the job.
>>>> >> >
>>>> >> > There are, of course, some details to work out like making sure no
>>>> active
>>>> >> > update is in flight, scale out does not violate quota and etc. I
>>>> intend
>>>> >> to
>>>> >> > address those during the implementation as things progress.
>>>> >> >
>>>> >> > Does the above make sense? Any concerns/suggestions?
>>>> >> >
>>>> >> > Thanks,
>>>> >> > Maxim
>>>> >> >
>>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>>>> >> >
>>>> >>
>>>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

I believe the operation is not that simple when you look at the end-to-end scenario. 

For example, the implementation of an auto-scaler  using the new scaleOut() API could look like:

1) check some KPI 
2) Infer an action based on this KPI such as scaleUp() or scaleDown()
3) wait until the effects of the adjusted instance count is reflected in the KPI. Go to  1 and repeat.

The health checking capabilities of the existing updater (in particular together with [1]) would be really helpful here. Still, the simplified scaleOut() API would offer the great benefit that the auto-scaler would not need to know about the used aurora configuration. 

We even had an incident with a sub-optimal implementation of step 3): An overloaded package backend lead to slow service startups. The service startup took longer than the grace-period of our auto-scaler. It  therefore decided to add more and more instances, because the KPI wasn't improving as expected. It had no way of knowing that these instances were not even 'running'. The additionally added instances aggravated the overload situation of the package backend.  Of course, the scaler could manually health check that all instances have come up and are being used as expected, but I guess that is what Aurora is for.

[1] https://docs.google.com/document/d/1ZdgW8S4xMhvKW7iQUX99xZm10NXSxEWR0a-21FP5d94/edit?pref=2&pli=1#heading=h.n0kb37aiy8ua

Best Regards,
Stephan 
________________________________________
From: Maxim Khutornenko <ma...@apache.org>
Sent: Friday, January 15, 2016 7:06 PM
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Job instance scaling APIs

I wasn't planning on using the rolling updater functionality given the
simplicity of the operation. I'd second Steve's earlier concerns about
scaleOut() looking more like startJobUpdate() if we keep adding
features. If health watching, throttling (batch_size) or rollback on
failure is required then I believe the startJobUpdate() should be used
instead of scaleOut().

On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
<St...@blue-yonder.com> wrote:
> I really like the proposal. The gain in simplicity on the client-side by not having to provide an aurora config is quite significant.
>
> The implementation on the scheduler side is probably rather straight forward as the update can be reused. That would also provide us with the update UI, which has shown to be quite useful when tracing autoscaler events.
>
> Regards,
> Stephan
> ________________________________________
> From: Maxim Khutornenko <ma...@apache.org>
> Sent: Thursday, January 14, 2016 9:50 PM
> To: dev@aurora.apache.org
> Subject: Re: [PROPOSAL] Job instance scaling APIs
>
> "I'd be concerned that any
> scaling API to be powerful enough to fit all (most) use cases would just
> end up looking like the update API."
>
> There is a big difference between scaleOut and startJobUpdate APIs
> that justifies the inclusion of the former. Namely, scaleOut may only
> replicate the existing instances without changing/introducing any new
> scheduling requirements or performing instance rollout/rollback. I
> don't see scaleOut ever becoming more powerful to threaten
> startJobUpdate. At the same time, the absence of aurora config
> requirement is a huge boost to autoscaling client simplification.
>
> "For example, when scaling down we don't just kill the last N instances, we
> actually look at the least loaded hosts (globally) and kill tasks from
> those."
>
> I don't quite see why the same wouldn't be possible with a scaleIn
> API. Isn't it always external process responsibility to pay due
> diligence before killing instances?
>
>
> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org> wrote:
>> As some background, we handle scale up / down purely from the client side,
>> using the update API for both directions.  I'd be concerned that any
>> scaling API to be powerful enough to fit all (most) use cases would just
>> end up looking like the update API.
>>
>> For example, when scaling down we don't just kill the last N instances, we
>> actually look at the least loaded hosts (globally) and kill tasks from
>> those.
>>
>>
>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>>
>>> "How is scaling down different from killing instances?"
>>>
>>> I found 'killTasks' syntax too different and way much more powerful to
>>> be used for scaling in. The TaskQuery allows killing instances across
>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>>> Additional benefit: it can be ACLed independently by allowing external
>>> process kill tasks only within a given job. We may also add rate
>>> limiting or backoff to it later.
>>>
>>> As for Joshua's question, I feel it should be an operator's
>>> responsibility to diff a job with its aurora config before applying an
>>> update. That said, if there is enough demand we can definitely
>>> consider adding something similar to what George suggested or
>>> resurrecting a 'large change' warning message we used to have in
>>> client updater.
>>>
>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
>>> wrote:
>>> > As a point of reference, we solved this problem by adding a binding
>>> helper
>>> > that queries the scheduler for the current number of instances and uses
>>> > that number instead of a hardcoded config:
>>> >
>>> >    instances='{{scaling_instances[60]}}'
>>> >
>>> > In this example, instances will be set to the currently running number
>>> > (unless there are none, in which case 60 instances will be created).
>>> >
>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
>>> >
>>> >> What happens if a job has been scaled out, but the underlying config is
>>> not
>>> >> updated to take that scaling into account? Would the next update on that
>>> >> job revert the number of instances (presumably, because what else could
>>> we
>>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>>> >>
>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>>> >> wrote:
>>> >>
>>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>>> >> > comes to instance scaling [1]. It's especially frustrating when
>>> >> > adding/removing instances has to be done in an automated fashion
>>> (e.g.:
>>> >> by
>>> >> > an external autoscaling process) as it requires holding on to the
>>> >> original
>>> >> > aurora config at all times.
>>> >> >
>>> >> > I propose we add simple instance scaling APIs to address the above.
>>> Since
>>> >> > Aurora job may have instances at different configs at any moment, I
>>> >> propose
>>> >> > we accept an InstanceKey as a reference point when scaling out. For
>>> >> > example:
>>> >> >
>>> >> >     /** Scales out a given job by adding more instances with the task
>>> >> > config of the templateKey. */
>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>>> incrementCount)
>>> >> >
>>> >> >     /** Scales in a given job by removing existing instances. */
>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>>> >> >
>>> >> > A correspondent client command could then look like:
>>> >> >
>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>>> >> >
>>> >> > For the above command, a scheduler would take task config of instance
>>> 1
>>> >> of
>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>>> additional
>>> >> > instances to the job.
>>> >> >
>>> >> > There are, of course, some details to work out like making sure no
>>> active
>>> >> > update is in flight, scale out does not violate quota and etc. I
>>> intend
>>> >> to
>>> >> > address those during the implementation as things progress.
>>> >> >
>>> >> > Does the above make sense? Any concerns/suggestions?
>>> >> >
>>> >> > Thanks,
>>> >> > Maxim
>>> >> >
>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>>> >> >
>>> >>
>>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

I wasn't planning on using the rolling updater functionality given the
simplicity of the operation. I'd second Steve's earlier concerns about
scaleOut() looking more like startJobUpdate() if we keep adding
features. If health watching, throttling (batch_size) or rollback on
failure is required then I believe the startJobUpdate() should be used
instead of scaleOut().

On Fri, Jan 15, 2016 at 1:09 AM, Erb, Stephan
<St...@blue-yonder.com> wrote:
> I really like the proposal. The gain in simplicity on the client-side by not having to provide an aurora config is quite significant.
>
> The implementation on the scheduler side is probably rather straight forward as the update can be reused. That would also provide us with the update UI, which has shown to be quite useful when tracing autoscaler events.
>
> Regards,
> Stephan
> ________________________________________
> From: Maxim Khutornenko <ma...@apache.org>
> Sent: Thursday, January 14, 2016 9:50 PM
> To: dev@aurora.apache.org
> Subject: Re: [PROPOSAL] Job instance scaling APIs
>
> "I'd be concerned that any
> scaling API to be powerful enough to fit all (most) use cases would just
> end up looking like the update API."
>
> There is a big difference between scaleOut and startJobUpdate APIs
> that justifies the inclusion of the former. Namely, scaleOut may only
> replicate the existing instances without changing/introducing any new
> scheduling requirements or performing instance rollout/rollback. I
> don't see scaleOut ever becoming more powerful to threaten
> startJobUpdate. At the same time, the absence of aurora config
> requirement is a huge boost to autoscaling client simplification.
>
> "For example, when scaling down we don't just kill the last N instances, we
> actually look at the least loaded hosts (globally) and kill tasks from
> those."
>
> I don't quite see why the same wouldn't be possible with a scaleIn
> API. Isn't it always external process responsibility to pay due
> diligence before killing instances?
>
>
> On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org> wrote:
>> As some background, we handle scale up / down purely from the client side,
>> using the update API for both directions.  I'd be concerned that any
>> scaling API to be powerful enough to fit all (most) use cases would just
>> end up looking like the update API.
>>
>> For example, when scaling down we don't just kill the last N instances, we
>> actually look at the least loaded hosts (globally) and kill tasks from
>> those.
>>
>>
>> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>>
>>> "How is scaling down different from killing instances?"
>>>
>>> I found 'killTasks' syntax too different and way much more powerful to
>>> be used for scaling in. The TaskQuery allows killing instances across
>>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>>> Additional benefit: it can be ACLed independently by allowing external
>>> process kill tasks only within a given job. We may also add rate
>>> limiting or backoff to it later.
>>>
>>> As for Joshua's question, I feel it should be an operator's
>>> responsibility to diff a job with its aurora config before applying an
>>> update. That said, if there is enough demand we can definitely
>>> consider adding something similar to what George suggested or
>>> resurrecting a 'large change' warning message we used to have in
>>> client updater.
>>>
>>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
>>> wrote:
>>> > As a point of reference, we solved this problem by adding a binding
>>> helper
>>> > that queries the scheduler for the current number of instances and uses
>>> > that number instead of a hardcoded config:
>>> >
>>> >    instances='{{scaling_instances[60]}}'
>>> >
>>> > In this example, instances will be set to the currently running number
>>> > (unless there are none, in which case 60 instances will be created).
>>> >
>>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
>>> >
>>> >> What happens if a job has been scaled out, but the underlying config is
>>> not
>>> >> updated to take that scaling into account? Would the next update on that
>>> >> job revert the number of instances (presumably, because what else could
>>> we
>>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>>> >>
>>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>>> >> wrote:
>>> >>
>>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>>> >> > comes to instance scaling [1]. It's especially frustrating when
>>> >> > adding/removing instances has to be done in an automated fashion
>>> (e.g.:
>>> >> by
>>> >> > an external autoscaling process) as it requires holding on to the
>>> >> original
>>> >> > aurora config at all times.
>>> >> >
>>> >> > I propose we add simple instance scaling APIs to address the above.
>>> Since
>>> >> > Aurora job may have instances at different configs at any moment, I
>>> >> propose
>>> >> > we accept an InstanceKey as a reference point when scaling out. For
>>> >> > example:
>>> >> >
>>> >> >     /** Scales out a given job by adding more instances with the task
>>> >> > config of the templateKey. */
>>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>>> incrementCount)
>>> >> >
>>> >> >     /** Scales in a given job by removing existing instances. */
>>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>>> >> >
>>> >> > A correspondent client command could then look like:
>>> >> >
>>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>>> >> >
>>> >> > For the above command, a scheduler would take task config of instance
>>> 1
>>> >> of
>>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>>> additional
>>> >> > instances to the job.
>>> >> >
>>> >> > There are, of course, some details to work out like making sure no
>>> active
>>> >> > update is in flight, scale out does not violate quota and etc. I
>>> intend
>>> >> to
>>> >> > address those during the implementation as things progress.
>>> >> >
>>> >> > Does the above make sense? Any concerns/suggestions?
>>> >> >
>>> >> > Thanks,
>>> >> > Maxim
>>> >> >
>>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>>> >> >
>>> >>
>>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

I really like the proposal. The gain in simplicity on the client-side by not having to provide an aurora config is quite significant.

The implementation on the scheduler side is probably rather straight forward as the update can be reused. That would also provide us with the update UI, which has shown to be quite useful when tracing autoscaler events.

Regards,
Stephan
________________________________________
From: Maxim Khutornenko <ma...@apache.org>
Sent: Thursday, January 14, 2016 9:50 PM
To: dev@aurora.apache.org
Subject: Re: [PROPOSAL] Job instance scaling APIs

"I'd be concerned that any
scaling API to be powerful enough to fit all (most) use cases would just
end up looking like the update API."

There is a big difference between scaleOut and startJobUpdate APIs
that justifies the inclusion of the former. Namely, scaleOut may only
replicate the existing instances without changing/introducing any new
scheduling requirements or performing instance rollout/rollback. I
don't see scaleOut ever becoming more powerful to threaten
startJobUpdate. At the same time, the absence of aurora config
requirement is a huge boost to autoscaling client simplification.

"For example, when scaling down we don't just kill the last N instances, we
actually look at the least loaded hosts (globally) and kill tasks from
those."

I don't quite see why the same wouldn't be possible with a scaleIn
API. Isn't it always external process responsibility to pay due
diligence before killing instances?


On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org> wrote:
> As some background, we handle scale up / down purely from the client side,
> using the update API for both directions.  I'd be concerned that any
> scaling API to be powerful enough to fit all (most) use cases would just
> end up looking like the update API.
>
> For example, when scaling down we don't just kill the last N instances, we
> actually look at the least loaded hosts (globally) and kill tasks from
> those.
>
>
> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> "How is scaling down different from killing instances?"
>>
>> I found 'killTasks' syntax too different and way much more powerful to
>> be used for scaling in. The TaskQuery allows killing instances across
>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>> Additional benefit: it can be ACLed independently by allowing external
>> process kill tasks only within a given job. We may also add rate
>> limiting or backoff to it later.
>>
>> As for Joshua's question, I feel it should be an operator's
>> responsibility to diff a job with its aurora config before applying an
>> update. That said, if there is enough demand we can definitely
>> consider adding something similar to what George suggested or
>> resurrecting a 'large change' warning message we used to have in
>> client updater.
>>
>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
>> wrote:
>> > As a point of reference, we solved this problem by adding a binding
>> helper
>> > that queries the scheduler for the current number of instances and uses
>> > that number instead of a hardcoded config:
>> >
>> >    instances='{{scaling_instances[60]}}'
>> >
>> > In this example, instances will be set to the currently running number
>> > (unless there are none, in which case 60 instances will be created).
>> >
>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
>> >
>> >> What happens if a job has been scaled out, but the underlying config is
>> not
>> >> updated to take that scaling into account? Would the next update on that
>> >> job revert the number of instances (presumably, because what else could
>> we
>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>> >>
>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>> >> wrote:
>> >>
>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>> >> > comes to instance scaling [1]. It's especially frustrating when
>> >> > adding/removing instances has to be done in an automated fashion
>> (e.g.:
>> >> by
>> >> > an external autoscaling process) as it requires holding on to the
>> >> original
>> >> > aurora config at all times.
>> >> >
>> >> > I propose we add simple instance scaling APIs to address the above.
>> Since
>> >> > Aurora job may have instances at different configs at any moment, I
>> >> propose
>> >> > we accept an InstanceKey as a reference point when scaling out. For
>> >> > example:
>> >> >
>> >> >     /** Scales out a given job by adding more instances with the task
>> >> > config of the templateKey. */
>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> incrementCount)
>> >> >
>> >> >     /** Scales in a given job by removing existing instances. */
>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> >> >
>> >> > A correspondent client command could then look like:
>> >> >
>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> >> >
>> >> > For the above command, a scheduler would take task config of instance
>> 1
>> >> of
>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>> additional
>> >> > instances to the job.
>> >> >
>> >> > There are, of course, some details to work out like making sure no
>> active
>> >> > update is in flight, scale out does not violate quota and etc. I
>> intend
>> >> to
>> >> > address those during the implementation as things progress.
>> >> >
>> >> > Does the above make sense? Any concerns/suggestions?
>> >> >
>> >> > Thanks,
>> >> > Maxim
>> >> >
>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >> >
>> >>
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

"I'd be concerned that any
scaling API to be powerful enough to fit all (most) use cases would just
end up looking like the update API."

There is a big difference between scaleOut and startJobUpdate APIs
that justifies the inclusion of the former. Namely, scaleOut may only
replicate the existing instances without changing/introducing any new
scheduling requirements or performing instance rollout/rollback. I
don't see scaleOut ever becoming more powerful to threaten
startJobUpdate. At the same time, the absence of aurora config
requirement is a huge boost to autoscaling client simplification.

"For example, when scaling down we don't just kill the last N instances, we
actually look at the least loaded hosts (globally) and kill tasks from
those."

I don't quite see why the same wouldn't be possible with a scaleIn
API. Isn't it always external process responsibility to pay due
diligence before killing instances?


On Thu, Jan 14, 2016 at 12:35 PM, Steve Niemitz <sn...@apache.org> wrote:
> As some background, we handle scale up / down purely from the client side,
> using the update API for both directions.  I'd be concerned that any
> scaling API to be powerful enough to fit all (most) use cases would just
> end up looking like the update API.
>
> For example, when scaling down we don't just kill the last N instances, we
> actually look at the least loaded hosts (globally) and kill tasks from
> those.
>
>
> On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> "How is scaling down different from killing instances?"
>>
>> I found 'killTasks' syntax too different and way much more powerful to
>> be used for scaling in. The TaskQuery allows killing instances across
>> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
>> Additional benefit: it can be ACLed independently by allowing external
>> process kill tasks only within a given job. We may also add rate
>> limiting or backoff to it later.
>>
>> As for Joshua's question, I feel it should be an operator's
>> responsibility to diff a job with its aurora config before applying an
>> update. That said, if there is enough demand we can definitely
>> consider adding something similar to what George suggested or
>> resurrecting a 'large change' warning message we used to have in
>> client updater.
>>
>> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
>> wrote:
>> > As a point of reference, we solved this problem by adding a binding
>> helper
>> > that queries the scheduler for the current number of instances and uses
>> > that number instead of a hardcoded config:
>> >
>> >    instances='{{scaling_instances[60]}}'
>> >
>> > In this example, instances will be set to the currently running number
>> > (unless there are none, in which case 60 instances will be created).
>> >
>> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
>> >
>> >> What happens if a job has been scaled out, but the underlying config is
>> not
>> >> updated to take that scaling into account? Would the next update on that
>> >> job revert the number of instances (presumably, because what else could
>> we
>> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
>> >>
>> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>> >> wrote:
>> >>
>> >> > Our rolling update APIs can be quite inconvenient to work with when it
>> >> > comes to instance scaling [1]. It's especially frustrating when
>> >> > adding/removing instances has to be done in an automated fashion
>> (e.g.:
>> >> by
>> >> > an external autoscaling process) as it requires holding on to the
>> >> original
>> >> > aurora config at all times.
>> >> >
>> >> > I propose we add simple instance scaling APIs to address the above.
>> Since
>> >> > Aurora job may have instances at different configs at any moment, I
>> >> propose
>> >> > we accept an InstanceKey as a reference point when scaling out. For
>> >> > example:
>> >> >
>> >> >     /** Scales out a given job by adding more instances with the task
>> >> > config of the templateKey. */
>> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
>> incrementCount)
>> >> >
>> >> >     /** Scales in a given job by removing existing instances. */
>> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> >> >
>> >> > A correspondent client command could then look like:
>> >> >
>> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> >> >
>> >> > For the above command, a scheduler would take task config of instance
>> 1
>> >> of
>> >> > the 'hello' job and replicate it 10 more times thus adding 10
>> additional
>> >> > instances to the job.
>> >> >
>> >> > There are, of course, some details to work out like making sure no
>> active
>> >> > update is in flight, scale out does not violate quota and etc. I
>> intend
>> >> to
>> >> > address those during the implementation as things progress.
>> >> >
>> >> > Does the above make sense? Any concerns/suggestions?
>> >> >
>> >> > Thanks,
>> >> > Maxim
>> >> >
>> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >> >
>> >>
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Steve Niemitz <sn...@apache.org>.

As some background, we handle scale up / down purely from the client side,
using the update API for both directions.  I'd be concerned that any
scaling API to be powerful enough to fit all (most) use cases would just
end up looking like the update API.

For example, when scaling down we don't just kill the last N instances, we
actually look at the least loaded hosts (globally) and kill tasks from
those.


On Thu, Jan 14, 2016 at 3:28 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> "How is scaling down different from killing instances?"
>
> I found 'killTasks' syntax too different and way much more powerful to
> be used for scaling in. The TaskQuery allows killing instances across
> jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
> Additional benefit: it can be ACLed independently by allowing external
> process kill tasks only within a given job. We may also add rate
> limiting or backoff to it later.
>
> As for Joshua's question, I feel it should be an operator's
> responsibility to diff a job with its aurora config before applying an
> update. That said, if there is enough demand we can definitely
> consider adding something similar to what George suggested or
> resurrecting a 'large change' warning message we used to have in
> client updater.
>
> On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com>
> wrote:
> > As a point of reference, we solved this problem by adding a binding
> helper
> > that queries the scheduler for the current number of instances and uses
> > that number instead of a hardcoded config:
> >
> >    instances='{{scaling_instances[60]}}'
> >
> > In this example, instances will be set to the currently running number
> > (unless there are none, in which case 60 instances will be created).
> >
> > On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
> >
> >> What happens if a job has been scaled out, but the underlying config is
> not
> >> updated to take that scaling into account? Would the next update on that
> >> job revert the number of instances (presumably, because what else could
> we
> >> do)? Is there anything we can do, tooling-wise, to improve upon this?
> >>
> >> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
> >> wrote:
> >>
> >> > Our rolling update APIs can be quite inconvenient to work with when it
> >> > comes to instance scaling [1]. It's especially frustrating when
> >> > adding/removing instances has to be done in an automated fashion
> (e.g.:
> >> by
> >> > an external autoscaling process) as it requires holding on to the
> >> original
> >> > aurora config at all times.
> >> >
> >> > I propose we add simple instance scaling APIs to address the above.
> Since
> >> > Aurora job may have instances at different configs at any moment, I
> >> propose
> >> > we accept an InstanceKey as a reference point when scaling out. For
> >> > example:
> >> >
> >> >     /** Scales out a given job by adding more instances with the task
> >> > config of the templateKey. */
> >> >     Response scaleOut(1: InstanceKey templateKey, 2: i32
> incrementCount)
> >> >
> >> >     /** Scales in a given job by removing existing instances. */
> >> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >> >
> >> > A correspondent client command could then look like:
> >> >
> >> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >> >
> >> > For the above command, a scheduler would take task config of instance
> 1
> >> of
> >> > the 'hello' job and replicate it 10 more times thus adding 10
> additional
> >> > instances to the job.
> >> >
> >> > There are, of course, some details to work out like making sure no
> active
> >> > update is in flight, scale out does not violate quota and etc. I
> intend
> >> to
> >> > address those during the implementation as things progress.
> >> >
> >> > Does the above make sense? Any concerns/suggestions?
> >> >
> >> > Thanks,
> >> > Maxim
> >> >
> >> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >> >
> >>
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Maxim Khutornenko <ma...@apache.org>.

"How is scaling down different from killing instances?"

I found 'killTasks' syntax too different and way much more powerful to
be used for scaling in. The TaskQuery allows killing instances across
jobs/roles, whereas 'scaleIn' is narrowed down to just a single job.
Additional benefit: it can be ACLed independently by allowing external
process kill tasks only within a given job. We may also add rate
limiting or backoff to it later.

As for Joshua's question, I feel it should be an operator's
responsibility to diff a job with its aurora config before applying an
update. That said, if there is enough demand we can definitely
consider adding something similar to what George suggested or
resurrecting a 'large change' warning message we used to have in
client updater.

On Thu, Jan 14, 2016 at 12:06 PM, George Sirois <ge...@tellapart.com> wrote:
> As a point of reference, we solved this problem by adding a binding helper
> that queries the scheduler for the current number of instances and uses
> that number instead of a hardcoded config:
>
>    instances='{{scaling_instances[60]}}'
>
> In this example, instances will be set to the currently running number
> (unless there are none, in which case 60 instances will be created).
>
> On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:
>
>> What happens if a job has been scaled out, but the underlying config is not
>> updated to take that scaling into account? Would the next update on that
>> job revert the number of instances (presumably, because what else could we
>> do)? Is there anything we can do, tooling-wise, to improve upon this?
>>
>> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
>> wrote:
>>
>> > Our rolling update APIs can be quite inconvenient to work with when it
>> > comes to instance scaling [1]. It's especially frustrating when
>> > adding/removing instances has to be done in an automated fashion (e.g.:
>> by
>> > an external autoscaling process) as it requires holding on to the
>> original
>> > aurora config at all times.
>> >
>> > I propose we add simple instance scaling APIs to address the above. Since
>> > Aurora job may have instances at different configs at any moment, I
>> propose
>> > we accept an InstanceKey as a reference point when scaling out. For
>> > example:
>> >
>> >     /** Scales out a given job by adding more instances with the task
>> > config of the templateKey. */
>> >     Response scaleOut(1: InstanceKey templateKey, 2: i32 incrementCount)
>> >
>> >     /** Scales in a given job by removing existing instances. */
>> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>> >
>> > A correspondent client command could then look like:
>> >
>> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
>> >
>> > For the above command, a scheduler would take task config of instance 1
>> of
>> > the 'hello' job and replicate it 10 more times thus adding 10 additional
>> > instances to the job.
>> >
>> > There are, of course, some details to work out like making sure no active
>> > update is in flight, scale out does not violate quota and etc. I intend
>> to
>> > address those during the implementation as things progress.
>> >
>> > Does the above make sense? Any concerns/suggestions?
>> >
>> > Thanks,
>> > Maxim
>> >
>> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
>> >
>>

Re: [PROPOSAL] Job instance scaling APIs

Posted by George Sirois <ge...@tellapart.com>.

As a point of reference, we solved this problem by adding a binding helper
that queries the scheduler for the current number of instances and uses
that number instead of a hardcoded config:

   instances='{{scaling_instances[60]}}'

In this example, instances will be set to the currently running number
(unless there are none, in which case 60 instances will be created).

On Thu, Jan 14, 2016 at 2:44 PM, Joshua Cohen <jc...@apache.org> wrote:

> What happens if a job has been scaled out, but the underlying config is not
> updated to take that scaling into account? Would the next update on that
> job revert the number of instances (presumably, because what else could we
> do)? Is there anything we can do, tooling-wise, to improve upon this?
>
> On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > Our rolling update APIs can be quite inconvenient to work with when it
> > comes to instance scaling [1]. It's especially frustrating when
> > adding/removing instances has to be done in an automated fashion (e.g.:
> by
> > an external autoscaling process) as it requires holding on to the
> original
> > aurora config at all times.
> >
> > I propose we add simple instance scaling APIs to address the above. Since
> > Aurora job may have instances at different configs at any moment, I
> propose
> > we accept an InstanceKey as a reference point when scaling out. For
> > example:
> >
> >     /** Scales out a given job by adding more instances with the task
> > config of the templateKey. */
> >     Response scaleOut(1: InstanceKey templateKey, 2: i32 incrementCount)
> >
> >     /** Scales in a given job by removing existing instances. */
> >     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
> >
> > A correspondent client command could then look like:
> >
> >     aurora job scale-out devcluster/vagrant/test/hello/1 10
> >
> > For the above command, a scheduler would take task config of instance 1
> of
> > the 'hello' job and replicate it 10 more times thus adding 10 additional
> > instances to the job.
> >
> > There are, of course, some details to work out like making sure no active
> > update is in flight, scale out does not violate quota and etc. I intend
> to
> > address those during the implementation as things progress.
> >
> > Does the above make sense? Any concerns/suggestions?
> >
> > Thanks,
> > Maxim
> >
> > [1] - https://issues.apache.org/jira/browse/AURORA-1258
> >
>

Re: [PROPOSAL] Job instance scaling APIs

Posted by Joshua Cohen <jc...@apache.org>.

What happens if a job has been scaled out, but the underlying config is not
updated to take that scaling into account? Would the next update on that
job revert the number of instances (presumably, because what else could we
do)? Is there anything we can do, tooling-wise, to improve upon this?

On Thu, Jan 14, 2016 at 1:40 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> Our rolling update APIs can be quite inconvenient to work with when it
> comes to instance scaling [1]. It's especially frustrating when
> adding/removing instances has to be done in an automated fashion (e.g.: by
> an external autoscaling process) as it requires holding on to the original
> aurora config at all times.
>
> I propose we add simple instance scaling APIs to address the above. Since
> Aurora job may have instances at different configs at any moment, I propose
> we accept an InstanceKey as a reference point when scaling out. For
> example:
>
>     /** Scales out a given job by adding more instances with the task
> config of the templateKey. */
>     Response scaleOut(1: InstanceKey templateKey, 2: i32 incrementCount)
>
>     /** Scales in a given job by removing existing instances. */
>     Response scaleIn(1: JobKey job, 2: i32 decrementCount)
>
> A correspondent client command could then look like:
>
>     aurora job scale-out devcluster/vagrant/test/hello/1 10
>
> For the above command, a scheduler would take task config of instance 1 of
> the 'hello' job and replicate it 10 more times thus adding 10 additional
> instances to the job.
>
> There are, of course, some details to work out like making sure no active
> update is in flight, scale out does not violate quota and etc. I intend to
> address those during the implementation as things progress.
>
> Does the above make sense? Any concerns/suggestions?
>
> Thanks,
> Maxim
>
> [1] - https://issues.apache.org/jira/browse/AURORA-1258
>