You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@aurora.apache.org by Brian Hatfield <bh...@twitter.com> on 2016/01/12 23:56:40 UTC

Pre-checking if job can be scheduled?

Hi,

We currently run a (relatively) small Mesos/Aurora cluster, and don't
always have significant resource overhead available.

Sometimes, we go to schedule a job and we're just short of what we
estimated-by-hand we'd need in the cluster for it. Most of the tasks
schedule - but a few stay "PENDING" because of the resource constraint.
This often confuses users, or in some cases, causes the command to block
for a while until it eventually times out.

We're currently working on automating somewhat-more-precise basic
estimation with information sourced from /offers to get a sense of "nope,
your task won't schedule" to provide fast feedback that doesn't manipulate
the state of the cluster.

A friend recommended that I suggest to this mailing list something
integrated into Aurora to accomplish this instead - since our basic
estimation doesn't include co-scheduling constraints, quotas, etc.

So: We believe that this feature doesn't exist in Aurora today, and wanted
to suggest it as a future feature for the project.

Thanks :-)
Brian

Re: Pre-checking if job can be scheduled?

Posted by Bill Farner <wf...@apache.org>.
Quick pointers for after you read the contributing doc:

1. Skim the doc for developing on the scheduler
https://github.com/apache/aurora/blob/master/docs/developing-aurora-scheduler.md

2. Add the new API method
https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift#L953

3. Implement the API method
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ReadOnlySchedulerImpl.java#L113

4. To answer the question you're asking, you need guice-injected
OfferManager:
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/offers/OfferManager.java#L113
and
SchedulingFilter:
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/filter/SchedulingFilter.java#L326-L334






On Tue, Jan 12, 2016 at 6:53 PM, Brian Hatfield <bh...@twitter.com>
wrote:

> Wow!
>
> Thanks for the positive feedback and fast responses!
>
> @john/bill - Yes, I'd be happy to do at a minimum [1], and I am willing to
> do [2] but am currently completely unfamiliar with the codebase. I'll read
> the contributing docs and pull down the code and see if I can figure out a
> guess of a way forward, and then report in if I think I can do it.
>
> Thanks!
> Brian
>
> On Tue, Jan 12, 2016 at 6:22 PM, Andrew Jorgensen <
> andrew@andrewjorgensen.com> wrote:
>
>> One other case to take into account which complicates the logic a bit is
>> we have some jobs that need to be stopped and then started again usually
>> with either code changes or capacity increases. In this case we would
>> need to have the resources already consumed for the job factored back in
>> to determine whether there is enough room to run the job. I think for a
>> first pass a simple yes/no on outstanding offers would be good but for
>> our use case we would need to supply an existing job as an argument to
>> tell the offers check to add those resources back when considering
>> whether there is enough room or not.
>>
>> This can get a bit race conditiony if you have multiple people starting
>> and stopping jobs in the cluster. It may also be interesting to have an
>> addition to the deploy task that says something like "if you can deploy
>> this do it if not then don't do anything and exit with an error" or
>> something like that. I'm not sure what guarantees you can make between
>> the check and the actual deploy based on other things that are going on
>> in the cluster but that would definitely be an awesome improvement for
>> that use case.
>>
>> --
>> Andrew Jorgensen
>> @ajorgensen
>>
>> On Tue, Jan 12, 2016, at 06:14 PM, John Sirois wrote:
>> > On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bh...@twitter.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > We currently run a (relatively) small Mesos/Aurora cluster, and don't
>> > > always have significant resource overhead available.
>> > >
>> > > Sometimes, we go to schedule a job and we're just short of what we
>> > > estimated-by-hand we'd need in the cluster for it. Most of the tasks
>> > > schedule - but a few stay "PENDING" because of the resource
>> constraint.
>> > > This often confuses users, or in some cases, causes the command to
>> block
>> > > for a while until it eventually times out.
>> > >
>> > > We're currently working on automating somewhat-more-precise basic
>> > > estimation with information sourced from /offers to get a sense of
>> "nope,
>> > > your task won't schedule" to provide fast feedback that doesn't
>> manipulate
>> > > the state of the cluster.
>> > >
>> > > A friend recommended that I suggest to this mailing list something
>> > > integrated into Aurora to accomplish this instead - since our basic
>> > > estimation doesn't include co-scheduling constraints, quotas, etc.
>> > >
>> > > So: We believe that this feature doesn't exist in Aurora today, and
>> wanted
>> > > to suggest it as a future feature for the project.
>> > >
>> >
>> > I think this would be a great feature from simple yes/no to more
>> > sophisticated likelyhood estimates even based on time of day (cron job
>> > scheduling taken into account):
>> > 1. A ticket [1] describing the minimum viable feature.
>> > 2. Work towards implementation [2].
>> >
>> > Would you be willing to do any of these? I'd be willing to review
>> designs
>> > and reviews.
>> >
>> > [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
>> > [2] http://aurora.apache.org/documentation/latest/contributing/
>> >
>> >
>> > > Thanks :-)
>> > > Brian
>> > >
>>
>
>

Re: Pre-checking if job can be scheduled?

Posted by Brian Hatfield <bh...@twitter.com>.
Wow!

Thanks for the positive feedback and fast responses!

@john/bill - Yes, I'd be happy to do at a minimum [1], and I am willing to
do [2] but am currently completely unfamiliar with the codebase. I'll read
the contributing docs and pull down the code and see if I can figure out a
guess of a way forward, and then report in if I think I can do it.

Thanks!
Brian

On Tue, Jan 12, 2016 at 6:22 PM, Andrew Jorgensen <
andrew@andrewjorgensen.com> wrote:

> One other case to take into account which complicates the logic a bit is
> we have some jobs that need to be stopped and then started again usually
> with either code changes or capacity increases. In this case we would
> need to have the resources already consumed for the job factored back in
> to determine whether there is enough room to run the job. I think for a
> first pass a simple yes/no on outstanding offers would be good but for
> our use case we would need to supply an existing job as an argument to
> tell the offers check to add those resources back when considering
> whether there is enough room or not.
>
> This can get a bit race conditiony if you have multiple people starting
> and stopping jobs in the cluster. It may also be interesting to have an
> addition to the deploy task that says something like "if you can deploy
> this do it if not then don't do anything and exit with an error" or
> something like that. I'm not sure what guarantees you can make between
> the check and the actual deploy based on other things that are going on
> in the cluster but that would definitely be an awesome improvement for
> that use case.
>
> --
> Andrew Jorgensen
> @ajorgensen
>
> On Tue, Jan 12, 2016, at 06:14 PM, John Sirois wrote:
> > On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bh...@twitter.com>
> > wrote:
> >
> > > Hi,
> > >
> > > We currently run a (relatively) small Mesos/Aurora cluster, and don't
> > > always have significant resource overhead available.
> > >
> > > Sometimes, we go to schedule a job and we're just short of what we
> > > estimated-by-hand we'd need in the cluster for it. Most of the tasks
> > > schedule - but a few stay "PENDING" because of the resource constraint.
> > > This often confuses users, or in some cases, causes the command to
> block
> > > for a while until it eventually times out.
> > >
> > > We're currently working on automating somewhat-more-precise basic
> > > estimation with information sourced from /offers to get a sense of
> "nope,
> > > your task won't schedule" to provide fast feedback that doesn't
> manipulate
> > > the state of the cluster.
> > >
> > > A friend recommended that I suggest to this mailing list something
> > > integrated into Aurora to accomplish this instead - since our basic
> > > estimation doesn't include co-scheduling constraints, quotas, etc.
> > >
> > > So: We believe that this feature doesn't exist in Aurora today, and
> wanted
> > > to suggest it as a future feature for the project.
> > >
> >
> > I think this would be a great feature from simple yes/no to more
> > sophisticated likelyhood estimates even based on time of day (cron job
> > scheduling taken into account):
> > 1. A ticket [1] describing the minimum viable feature.
> > 2. Work towards implementation [2].
> >
> > Would you be willing to do any of these? I'd be willing to review designs
> > and reviews.
> >
> > [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
> > [2] http://aurora.apache.org/documentation/latest/contributing/
> >
> >
> > > Thanks :-)
> > > Brian
> > >
>

Re: Pre-checking if job can be scheduled?

Posted by Andrew Jorgensen <an...@andrewjorgensen.com>.
One other case to take into account which complicates the logic a bit is
we have some jobs that need to be stopped and then started again usually
with either code changes or capacity increases. In this case we would
need to have the resources already consumed for the job factored back in
to determine whether there is enough room to run the job. I think for a
first pass a simple yes/no on outstanding offers would be good but for
our use case we would need to supply an existing job as an argument to
tell the offers check to add those resources back when considering
whether there is enough room or not. 

This can get a bit race conditiony if you have multiple people starting
and stopping jobs in the cluster. It may also be interesting to have an
addition to the deploy task that says something like "if you can deploy
this do it if not then don't do anything and exit with an error" or
something like that. I'm not sure what guarantees you can make between
the check and the actual deploy based on other things that are going on
in the cluster but that would definitely be an awesome improvement for
that use case. 

-- 
Andrew Jorgensen
@ajorgensen

On Tue, Jan 12, 2016, at 06:14 PM, John Sirois wrote:
> On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bh...@twitter.com>
> wrote:
> 
> > Hi,
> >
> > We currently run a (relatively) small Mesos/Aurora cluster, and don't
> > always have significant resource overhead available.
> >
> > Sometimes, we go to schedule a job and we're just short of what we
> > estimated-by-hand we'd need in the cluster for it. Most of the tasks
> > schedule - but a few stay "PENDING" because of the resource constraint.
> > This often confuses users, or in some cases, causes the command to block
> > for a while until it eventually times out.
> >
> > We're currently working on automating somewhat-more-precise basic
> > estimation with information sourced from /offers to get a sense of "nope,
> > your task won't schedule" to provide fast feedback that doesn't manipulate
> > the state of the cluster.
> >
> > A friend recommended that I suggest to this mailing list something
> > integrated into Aurora to accomplish this instead - since our basic
> > estimation doesn't include co-scheduling constraints, quotas, etc.
> >
> > So: We believe that this feature doesn't exist in Aurora today, and wanted
> > to suggest it as a future feature for the project.
> >
> 
> I think this would be a great feature from simple yes/no to more
> sophisticated likelyhood estimates even based on time of day (cron job
> scheduling taken into account):
> 1. A ticket [1] describing the minimum viable feature.
> 2. Work towards implementation [2].
> 
> Would you be willing to do any of these? I'd be willing to review designs
> and reviews.
> 
> [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
> [2] http://aurora.apache.org/documentation/latest/contributing/
> 
> 
> > Thanks :-)
> > Brian
> >

Re: Pre-checking if job can be scheduled?

Posted by Brian Hatfield <bh...@twitter.com>.
Thanks!

As John requested, I filed https://issues.apache.org/jira/browse/AURORA-1579
.

I also spent a little time sketching out this feature last night, and
though Bill mentioned that this seems like an easy feature and provided
excellent references, I did hit a little friction simply due to my lack of
knowledge in this codebase. That said, I'll plug away a little bit, and
perhaps work with a teammate who could potentially help as well.

Brian

On Wed, Jan 13, 2016 at 1:06 PM, Chris Lambert <ch...@gmail.com>
wrote:

> Hi Brian,
>
> This is great!  As Maxim said, something like this has been long-desired.
>
> I have one additional thought to add:  it still may make sense to improve
> the messaging around because tasks could still end up pending (consider
> the case where an oddly shape task is created but later rescheduled).
>
> Chris
>
>
> On Tuesday, January 12, 2016, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> +1 to this feature! We have always wanted something like this but
>> never had bandwidth to pull it off.
>>
>> On Tue, Jan 12, 2016 at 3:14 PM, John Sirois <js...@apache.org> wrote:
>> >
>> >
>> > On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bh...@twitter.com>
>> > wrote:
>> >>
>> >> Hi,
>> >>
>> >> We currently run a (relatively) small Mesos/Aurora cluster, and don't
>> >> always have significant resource overhead available.
>> >>
>> >> Sometimes, we go to schedule a job and we're just short of what we
>> >> estimated-by-hand we'd need in the cluster for it. Most of the tasks
>> >> schedule - but a few stay "PENDING" because of the resource
>> constraint. This
>> >> often confuses users, or in some cases, causes the command to block
>> for a
>> >> while until it eventually times out.
>> >>
>> >> We're currently working on automating somewhat-more-precise basic
>> >> estimation with information sourced from /offers to get a sense of
>> "nope,
>> >> your task won't schedule" to provide fast feedback that doesn't
>> manipulate
>> >> the state of the cluster.
>> >>
>> >> A friend recommended that I suggest to this mailing list something
>> >> integrated into Aurora to accomplish this instead - since our basic
>> >> estimation doesn't include co-scheduling constraints, quotas, etc.
>> >>
>> >> So: We believe that this feature doesn't exist in Aurora today, and
>> wanted
>> >> to suggest it as a future feature for the project.
>> >
>> >
>> > I think this would be a great feature from simple yes/no to more
>> > sophisticated likelyhood estimates even based on time of day (cron job
>> > scheduling taken into account):
>> > 1. A ticket [1] describing the minimum viable feature.
>> > 2. Work towards implementation [2].
>> >
>> > Would you be willing to do any of these? I'd be willing to review
>> designs
>> > and reviews.
>> >
>> > [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
>> > [2] http://aurora.apache.org/documentation/latest/contributing/
>> >
>> >>
>> >> Thanks :-)
>> >> Brian
>> >
>> >
>>
>

Re: Pre-checking if job can be scheduled?

Posted by Chris Lambert <ch...@gmail.com>.
Hi Brian,

This is great!  As Maxim said, something like this has been long-desired.

I have one additional thought to add:  it still may make sense to improve
the messaging around because tasks could still end up pending (consider the
case where an oddly shape task is created but later rescheduled).

Chris


On Tuesday, January 12, 2016, Maxim Khutornenko <ma...@apache.org> wrote:

> +1 to this feature! We have always wanted something like this but
> never had bandwidth to pull it off.
>
> On Tue, Jan 12, 2016 at 3:14 PM, John Sirois <jsirois@apache.org
> <javascript:;>> wrote:
> >
> >
> > On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bhatfield@twitter.com
> <javascript:;>>
> > wrote:
> >>
> >> Hi,
> >>
> >> We currently run a (relatively) small Mesos/Aurora cluster, and don't
> >> always have significant resource overhead available.
> >>
> >> Sometimes, we go to schedule a job and we're just short of what we
> >> estimated-by-hand we'd need in the cluster for it. Most of the tasks
> >> schedule - but a few stay "PENDING" because of the resource constraint.
> This
> >> often confuses users, or in some cases, causes the command to block for
> a
> >> while until it eventually times out.
> >>
> >> We're currently working on automating somewhat-more-precise basic
> >> estimation with information sourced from /offers to get a sense of
> "nope,
> >> your task won't schedule" to provide fast feedback that doesn't
> manipulate
> >> the state of the cluster.
> >>
> >> A friend recommended that I suggest to this mailing list something
> >> integrated into Aurora to accomplish this instead - since our basic
> >> estimation doesn't include co-scheduling constraints, quotas, etc.
> >>
> >> So: We believe that this feature doesn't exist in Aurora today, and
> wanted
> >> to suggest it as a future feature for the project.
> >
> >
> > I think this would be a great feature from simple yes/no to more
> > sophisticated likelyhood estimates even based on time of day (cron job
> > scheduling taken into account):
> > 1. A ticket [1] describing the minimum viable feature.
> > 2. Work towards implementation [2].
> >
> > Would you be willing to do any of these? I'd be willing to review designs
> > and reviews.
> >
> > [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
> > [2] http://aurora.apache.org/documentation/latest/contributing/
> >
> >>
> >> Thanks :-)
> >> Brian
> >
> >
>

Re: Pre-checking if job can be scheduled?

Posted by Maxim Khutornenko <ma...@apache.org>.
+1 to this feature! We have always wanted something like this but
never had bandwidth to pull it off.

On Tue, Jan 12, 2016 at 3:14 PM, John Sirois <js...@apache.org> wrote:
>
>
> On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bh...@twitter.com>
> wrote:
>>
>> Hi,
>>
>> We currently run a (relatively) small Mesos/Aurora cluster, and don't
>> always have significant resource overhead available.
>>
>> Sometimes, we go to schedule a job and we're just short of what we
>> estimated-by-hand we'd need in the cluster for it. Most of the tasks
>> schedule - but a few stay "PENDING" because of the resource constraint. This
>> often confuses users, or in some cases, causes the command to block for a
>> while until it eventually times out.
>>
>> We're currently working on automating somewhat-more-precise basic
>> estimation with information sourced from /offers to get a sense of "nope,
>> your task won't schedule" to provide fast feedback that doesn't manipulate
>> the state of the cluster.
>>
>> A friend recommended that I suggest to this mailing list something
>> integrated into Aurora to accomplish this instead - since our basic
>> estimation doesn't include co-scheduling constraints, quotas, etc.
>>
>> So: We believe that this feature doesn't exist in Aurora today, and wanted
>> to suggest it as a future feature for the project.
>
>
> I think this would be a great feature from simple yes/no to more
> sophisticated likelyhood estimates even based on time of day (cron job
> scheduling taken into account):
> 1. A ticket [1] describing the minimum viable feature.
> 2. Work towards implementation [2].
>
> Would you be willing to do any of these? I'd be willing to review designs
> and reviews.
>
> [1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
> [2] http://aurora.apache.org/documentation/latest/contributing/
>
>>
>> Thanks :-)
>> Brian
>
>

Re: Pre-checking if job can be scheduled?

Posted by John Sirois <js...@apache.org>.
On Tue, Jan 12, 2016 at 3:56 PM, Brian Hatfield <bh...@twitter.com>
wrote:

> Hi,
>
> We currently run a (relatively) small Mesos/Aurora cluster, and don't
> always have significant resource overhead available.
>
> Sometimes, we go to schedule a job and we're just short of what we
> estimated-by-hand we'd need in the cluster for it. Most of the tasks
> schedule - but a few stay "PENDING" because of the resource constraint.
> This often confuses users, or in some cases, causes the command to block
> for a while until it eventually times out.
>
> We're currently working on automating somewhat-more-precise basic
> estimation with information sourced from /offers to get a sense of "nope,
> your task won't schedule" to provide fast feedback that doesn't manipulate
> the state of the cluster.
>
> A friend recommended that I suggest to this mailing list something
> integrated into Aurora to accomplish this instead - since our basic
> estimation doesn't include co-scheduling constraints, quotas, etc.
>
> So: We believe that this feature doesn't exist in Aurora today, and wanted
> to suggest it as a future feature for the project.
>

I think this would be a great feature from simple yes/no to more
sophisticated likelyhood estimates even based on time of day (cron job
scheduling taken into account):
1. A ticket [1] describing the minimum viable feature.
2. Work towards implementation [2].

Would you be willing to do any of these? I'd be willing to review designs
and reviews.

[1] https://issues.apache.org/jira/secure/CreateIssue!default.jspa
[2] http://aurora.apache.org/documentation/latest/contributing/


> Thanks :-)
> Brian
>

Re: Pre-checking if job can be scheduled?

Posted by Bill Farner <wf...@apache.org>.
I think that would be a cool addition to the API, and relatively easy to
implement.  I'd be happy to shepherd if you are willing to take a crack at
a patch!

On Tue, Jan 12, 2016 at 2:56 PM, Brian Hatfield <bh...@twitter.com>
wrote:

> Hi,
>
> We currently run a (relatively) small Mesos/Aurora cluster, and don't
> always have significant resource overhead available.
>
> Sometimes, we go to schedule a job and we're just short of what we
> estimated-by-hand we'd need in the cluster for it. Most of the tasks
> schedule - but a few stay "PENDING" because of the resource constraint.
> This often confuses users, or in some cases, causes the command to block
> for a while until it eventually times out.
>
> We're currently working on automating somewhat-more-precise basic
> estimation with information sourced from /offers to get a sense of "nope,
> your task won't schedule" to provide fast feedback that doesn't manipulate
> the state of the cluster.
>
> A friend recommended that I suggest to this mailing list something
> integrated into Aurora to accomplish this instead - since our basic
> estimation doesn't include co-scheduling constraints, quotas, etc.
>
> So: We believe that this feature doesn't exist in Aurora today, and wanted
> to suggest it as a future feature for the project.
>
> Thanks :-)
> Brian
>