You are viewing a plain text version of this content. The canonical link for it is here.

Posted to architecture@airavata.apache.org by Suresh Marru <sm...@apache.org> on 2014/09/02 13:50:12 UTC

Scheduling stratergies for Airavata

Hi All,

Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:

* If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue.

* Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth.

* As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.

Other use cases?

We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives.

Thanks in advance for your time and insights,

Suresh

[1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
[2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
[3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
[4] - https://issues.apache.org/jira/browse/HADOOP-3746
[5] - https://issues.apache.org/jira/browse/YARN-326

Re: Scheduling stratergies for Airavata

Posted by Marlon Pierce <ma...@iu.edu>.

We are motivated by a parameter sweep problem, but this is really a 
general problem for any gateway using a community credential.

Marlon

On 9/2/14, 7:50 AM, Suresh Marru wrote:
> Hi All,
>
> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
>
> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue.
>   
> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth.
>
> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
>
> Other use cases?
>
> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives.
>
> Thanks in advance for your time and insights,
>
> Suresh
>
> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> [5] - https://issues.apache.org/jira/browse/YARN-326
>
>
>

Re: Scheduling stratergies for Airavata

Posted by Marlon Pierce <ma...@iu.edu>.

One internal note: I think we need to include "Launched" state when 
determining how many jobs a gateway is currently running.

Marlon

On 9/2/14, 10:29 AM, Miller, Mark wrote:
> We would have the same issue Borries mentioned: the community account user under xsede owns all the jobs. Luckily, Sdsc makes allowances for known gateway, trusting that we represent many individuals. We are building throttling tools to prevent users from submitting more than x running jobs, and placing reserves against their allocation for running jobs.
>
>
> I don't see how to solve the problem of xsede or other resources seeing a gateway user as equivalent to a regular user without help from xsede policy decisions/infrastructure changes; esp for the case where  a code requires a single resource, and is submitted by many users at once.
>
> I think solving that would require a resource providers to disambiguate regular and community users.
>
> Mark
>
>> On Sep 2, 2014, at 10:11 AM, "Borries Demeler" <de...@biochem.uthscsa.edu> wrote:
>>
>> Our application involves submission of several hundred quite small (a couple of minutes for most
>> clusters, ~128 cores, give or take) computational jobs, running the same code on multiple datasets.
>>
>> We are hitting the limit of 50 jobs on TACC resources, with all others failing. The problem is
>> made worse because all users submit under a community account, which treats every submission to
>> be part of the same allocation account.
>>
>> I see a few possibilities:
>>
>> 1. a separate FIFO queue, making sure none of the resources get overloaded by any community account user
>>
>> 2. submitting all jobs as a single job somehow to where the job is submitted for the aggregate walltime
>> for all jobs. A special workscript would spawn jobs underneath the parent submission. Not sure if this
>> is feasable or reasonable.
>>
>> 3. spreading the jobs around all possible resources
>>
>> 4. a combination of 1 and 3.
>>
>> -Borries
>>
>>
>>
>>
>>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>>> Hi All,
>>>
>>> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
>>>
>>> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue.
>>>
>>> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth.
>>>
>>> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
>>>
>>> Other use cases?
>>>
>>> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives.
>>>
>>> Thanks in advance for your time and insights,
>>>
>>> Suresh
>>>
>>> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>>> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>>> [5] - https://issues.apache.org/jira/browse/YARN-326
>>>
>>>

Re: Scheduling stratergies for Airavata

Posted by "Miller, Mark" <mm...@sdsc.edu>.

We would have the same issue Borries mentioned: the community account user under xsede owns all the jobs. Luckily, Sdsc makes allowances for known gateway, trusting that we represent many individuals. We are building throttling tools to prevent users from submitting more than x running jobs, and placing reserves against their allocation for running jobs. 


I don't see how to solve the problem of xsede or other resources seeing a gateway user as equivalent to a regular user without help from xsede policy decisions/infrastructure changes; esp for the case where  a code requires a single resource, and is submitted by many users at once.

I think solving that would require a resource providers to disambiguate regular and community users.

Mark

> On Sep 2, 2014, at 10:11 AM, "Borries Demeler" <de...@biochem.uthscsa.edu> wrote:
> 
> Our application involves submission of several hundred quite small (a couple of minutes for most
> clusters, ~128 cores, give or take) computational jobs, running the same code on multiple datasets.
> 
> We are hitting the limit of 50 jobs on TACC resources, with all others failing. The problem is 
> made worse because all users submit under a community account, which treats every submission to
> be part of the same allocation account.
> 
> I see a few possibilities:
> 
> 1. a separate FIFO queue, making sure none of the resources get overloaded by any community account user
> 
> 2. submitting all jobs as a single job somehow to where the job is submitted for the aggregate walltime
> for all jobs. A special workscript would spawn jobs underneath the parent submission. Not sure if this
> is feasable or reasonable.
> 
> 3. spreading the jobs around all possible resources
> 
> 4. a combination of 1 and 3.
> 
> -Borries
> 
> 
> 
> 
>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>> Hi All,
>> 
>> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
>> 
>> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
>> 
>> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
>> 
>> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
>> 
>> Other use cases? 
>> 
>> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
>> 
>> Thanks in advance for your time and insights,
>> 
>> Suresh
>> 
>> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>> [5] - https://issues.apache.org/jira/browse/YARN-326
>> 
>>

Re: Scheduling stratergies for Airavata

Posted by Borries Demeler <de...@biochem.uthscsa.edu>.

Yes, Kenneth, this is a good idea. I had already contacted TACC (Chris
Hempel) and he is looking into this already. At this point, we will likely
get some relieve on Lonestar, which is no longer an XSEDE resource,
but as a UT member I continue to get access there through a separate
non-XSEDE UT-only allocation.  They have less pressure on Lonestar now
that XSEDE users are no longer using it.

-Borries



On Tue, Sep 02, 2014 at 09:46:57AM -0700, K Yoshimoto wrote:
> 
> You could try requesting an increased job limit for the community user.
> SDSC sets different queued job limits for gateway vs individual users.
> I think TACC would probably be receptive to that.
> 
> On Tue, Sep 02, 2014 at 09:11:00AM -0500, Borries Demeler wrote:
> > Our application involves submission of several hundred quite small (a couple of minutes for most
> > clusters, ~128 cores, give or take) computational jobs, running the same code on multiple datasets.
> > 
> > We are hitting the limit of 50 jobs on TACC resources, with all others failing. The problem is 
> > made worse because all users submit under a community account, which treats every submission to
> > be part of the same allocation account.
> > 
> > I see a few possibilities:
> > 
> > 1. a separate FIFO queue, making sure none of the resources get overloaded by any community account user
> > 
> > 2. submitting all jobs as a single job somehow to where the job is submitted for the aggregate walltime
> > for all jobs. A special workscript would spawn jobs underneath the parent submission. Not sure if this
> > is feasable or reasonable.
> > 
> > 3. spreading the jobs around all possible resources
> > 
> > 4. a combination of 1 and 3.
> > 
> > -Borries
> > 
> > 
> > 
> > 
> > On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> > > Hi All,
> > > 
> > > Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
> > > 
> > > * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
> > >  
> > > * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
> > > 
> > > * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
> > > 
> > > Other use cases? 
> > > 
> > > We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
> > > 
> > > Thanks in advance for your time and insights,
> > > 
> > > Suresh
> > > 
> > > [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> > > [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> > > [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> > > [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> > > [5] - https://issues.apache.org/jira/browse/YARN-326
> > > 
> > >

Re: Scheduling stratergies for Airavata

Posted by K Yoshimoto <ke...@sdsc.edu>.

You could try requesting an increased job limit for the community user.
SDSC sets different queued job limits for gateway vs individual users.
I think TACC would probably be receptive to that.

On Tue, Sep 02, 2014 at 09:11:00AM -0500, Borries Demeler wrote:
> Our application involves submission of several hundred quite small (a couple of minutes for most
> clusters, ~128 cores, give or take) computational jobs, running the same code on multiple datasets.
> 
> We are hitting the limit of 50 jobs on TACC resources, with all others failing. The problem is 
> made worse because all users submit under a community account, which treats every submission to
> be part of the same allocation account.
> 
> I see a few possibilities:
> 
> 1. a separate FIFO queue, making sure none of the resources get overloaded by any community account user
> 
> 2. submitting all jobs as a single job somehow to where the job is submitted for the aggregate walltime
> for all jobs. A special workscript would spawn jobs underneath the parent submission. Not sure if this
> is feasable or reasonable.
> 
> 3. spreading the jobs around all possible resources
> 
> 4. a combination of 1 and 3.
> 
> -Borries
> 
> 
> 
> 
> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> > Hi All,
> > 
> > Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
> > 
> > * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
> >  
> > * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
> > 
> > * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
> > 
> > Other use cases? 
> > 
> > We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
> > 
> > Thanks in advance for your time and insights,
> > 
> > Suresh
> > 
> > [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> > [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> > [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> > [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> > [5] - https://issues.apache.org/jira/browse/YARN-326
> > 
> >

Re: Scheduling stratergies for Airavata

Posted by Borries Demeler <de...@biochem.uthscsa.edu>.

Our application involves submission of several hundred quite small (a couple of minutes for most
clusters, ~128 cores, give or take) computational jobs, running the same code on multiple datasets.

We are hitting the limit of 50 jobs on TACC resources, with all others failing. The problem is 
made worse because all users submit under a community account, which treats every submission to
be part of the same allocation account.

I see a few possibilities:

1. a separate FIFO queue, making sure none of the resources get overloaded by any community account user

2. submitting all jobs as a single job somehow to where the job is submitted for the aggregate walltime
for all jobs. A special workscript would spawn jobs underneath the parent submission. Not sure if this
is feasable or reasonable.

3. spreading the jobs around all possible resources

4. a combination of 1 and 3.

-Borries




On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> Hi All,
> 
> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
> 
> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
>  
> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
> 
> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
> 
> Other use cases? 
> 
> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
> 
> Thanks in advance for your time and insights,
> 
> Suresh
> 
> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> [5] - https://issues.apache.org/jira/browse/YARN-326
> 
>

Re: Scheduling stratergies for Airavata

Posted by K Yoshimoto <ke...@sdsc.edu>.

On Tue, Sep 02, 2014 at 01:02:18PM -0400, Suresh Marru wrote:
> Hi Kenneth,
> 
> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
> 
> > 
> > The tricky thing is the need to maintain an internal queue of
> > jobs when the Stampede queued jobs limit is reached.  If airavata
> > has an internal representation for jobs to be submitted, I think you
> > are most of the way there.
> 
> Airavata has an internal representation of jobs, but there is no good global view of all the jobs running on a given resource for a given community account. We are trying to fix this, once this is done, as you say, the FIFO implementation should be straight forward. 
> 
> > It is tricky to do resource-matching scheduling when the job mix
> > is not known.  For example, the scheduler does not know whether
> > to preserve memory vs cores when deciding where to place a job.
> > Also, the interactions of the gateway scheduler and the local
> > schedulers may be complicated to predict.
> > 
> > Fair share is probably not a good idea.  In practice, it tends
> > to disrupt the other scheduling policies such that one group is
> > penalized and the others don't run much earlier.
> 
> Interesting. What do you think of the capacity based scheduling algorithm (linked below)?

I scanned through the YARN stuff, and it was not clear to me
what their scheduling algorithm is.  It looks like they only do
resource-based scheduling for memory requirements.  Also, it
looks more like a way to schedule a cluster than a metascheduler.

> 
> > 
> > One option is to maintain the gateway job queue internally,
> > then use the MCP brute force approach: submit to all resources,
> > then cancel after the first job start.  You may also want to
> > allow the gateway to set per-resource policy limits on
> > number of jobs, job duration, job core size, SUs, etc.
> 
> MCP is something we should try. The limits per gateway per resource exists, but we need to exercise these capabilities. 

I don't think there's a need to use any of the MCP python code.  Instead,
just implement the simple brute-force approach in airavata scheduling
routines.

Kenneth

> 
> Suresh
> 
> > 
> > On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> >> Hi All,
> >> 
> >> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
> >> 
> >> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
> >> 
> >> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
> >> 
> >> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
> >> 
> >> Other use cases? 
> >> 
> >> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
> >> 
> >> Thanks in advance for your time and insights,
> >> 
> >> Suresh
> >> 
> >> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> >> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> >> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> >> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> >> [5] - https://issues.apache.org/jira/browse/YARN-326
> >> 
> >>

Re: Scheduling stratergies for Airavata

Posted by Borries Demeler <de...@biochem.uthscsa.edu>.

Thanks, Suresh.
One comment for now: I can't determine from your writeup how your
scheduler proposal will deal with the fact that we have many community
users submitting to a common community account. So what the scheduler
really needs to do somehow is to *globally* track all jobs from all
gateways that potentially utilize our community account on the same
resource before it can decide how to throttle/buffer.

You already have a field there for checking the health of a resource. I
think your "Validate allocations..."  field also needs to include
something like "Count current jobs on resource that charge the same
community account" because you will need to know how many more jobs can
be submitted from any user, any gateway so the throttle-Job component
can make the right decision. Otherwise, it looks good. I also agree
that we should investigate other programs that have been written with
meta scheduling in mind to see if they have good solutions that we can
integrate rather than re-invent. You guys are the experts there :-)

Thanks, -b.

On Wed, Sep 03, 2014 at 08:50:15AM -0400, Suresh Marru wrote:
> Thank you all for comments and suggestions. I summarized the discussion as a implementation plan on a wiki page:
> 
> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
> 
> If this is amenable, we can take this to dev list to plan the development in two phases. First implement the Throttle-Job in and short term and then plan the Auto-Scheduling capabilities. 
> 
> Suresh
> 
> On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com> wrote:
> 
> > It seems to me that among many possible functions a metascheduler (MS) would provide, there are two separate ones that must be addressed first. The two use cases implied are as follows.
> > 
> > (1) The gateway submits a group of jobs to a specified resource where the count of jobs exceeds the resource?s queued job limit. Let?s say 300 very quick jobs are submitted, where the limit is 50 per community user. The MS must maintain an internal queue and release jobs to the resource in groups with job counts under the limit (say, 40 at a time).
> > 
> > (2) The gateway submits a job or set of jobs with a flag that specifies that Airavata choose the resource. Here, MCP or some other mechanism arrives eventually at the specific resource that completes the job(s).
> > 
> > Where both uses are needed - unspecified resource and a group of jobs with count exceeding limits - the MS action would be best defined by knowing the definitions and mechanisms employed in the two separate functions. For example, if MCP is employed, the initial brute force test submissions might need to be done using the determined number of jobs at a time (e.g., 40). But the design here must adhere to design criteria arrived at for both function (1) and function (2).
> > 
> > In UltraScan?s case, the most immediate need is for (1). The user could manually determine the best resource or just make a reasonable guess. What the user does not want to do is manually release jobs 40 at a time. The gateway interface allows specification of a group of 300 jobs and the user does not care what is going on under the covers to effect the running of all of them eventually. So, I guess I am lobbying for addressing (1) first; both to meet UltraScan?s immediate need and to elucidate the design of more sophisticated functionality.
> > 
> > - Gary
> > 
> > On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
> > 
> >> Hi Kenneth,
> >> 
> >> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
> >> 
> >>> 
> >>> The tricky thing is the need to maintain an internal queue of
> >>> jobs when the Stampede queued jobs limit is reached.  If airavata
> >>> has an internal representation for jobs to be submitted, I think you
> >>> are most of the way there.
> >> 
> >> Airavata has an internal representation of jobs, but there is no good global view of all the jobs running on a given resource for a given community account. We are trying to fix this, once this is done, as you say, the FIFO implementation should be straight forward. 
> >> 
> >>> It is tricky to do resource-matching scheduling when the job mix
> >>> is not known.  For example, the scheduler does not know whether
> >>> to preserve memory vs cores when deciding where to place a job.
> >>> Also, the interactions of the gateway scheduler and the local
> >>> schedulers may be complicated to predict.
> >>> 
> >>> Fair share is probably not a good idea.  In practice, it tends
> >>> to disrupt the other scheduling policies such that one group is
> >>> penalized and the others don't run much earlier.
> >> 
> >> Interesting. What do you think of the capacity based scheduling algorithm (linked below)?
> >> 
> >>> 
> >>> One option is to maintain the gateway job queue internally,
> >>> then use the MCP brute force approach: submit to all resources,
> >>> then cancel after the first job start.  You may also want to
> >>> allow the gateway to set per-resource policy limits on
> >>> number of jobs, job duration, job core size, SUs, etc.
> >> 
> >> MCP is something we should try. The limits per gateway per resource exists, but we need to exercise these capabilities. 
> >> 
> >> Suresh
> >> 
> >>> 
> >>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> >>>> Hi All,
> >>>> 
> >>>> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
> >>>> 
> >>>> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
> >>>> 
> >>>> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
> >>>> 
> >>>> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
> >>>> 
> >>>> Other use cases? 
> >>>> 
> >>>> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
> >>>> 
> >>>> Thanks in advance for your time and insights,
> >>>> 
> >>>> Suresh
> >>>> 
> >>>> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> >>>> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> >>>> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> >>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> >>>> [5] - https://issues.apache.org/jira/browse/YARN-326
> >>>> 
> >>>> 
> >> 
> >

Re: Scheduling stratergies for Airavata

Posted by Amila Jayasekara <th...@gmail.com>.

On Thu, Sep 4, 2014 at 5:55 PM, Suresh Marru <sm...@apache.org> wrote:

> Eran,
>
> This is a good read and infact sounds very similar in situation (picking a
> well known solution vs writing our own).



> "As you may recollect, Airavata’s key challenge is in identifying the
> resources which have the shortest queue time across many resources. " -
> Well... to be precise Airavata needs to identify the resource which allows
> user application to execute with a minimum time. Queue time is only one
> factor which decides that. Resources accessible to community account is
> another factor. There are more factors scheduler needs to take into
> account; e.g. :- speed, memory, number of cores per node etc ... If you
> want to make scheduler more interesting you can also consider parameters
> such as job placement within nodes, network connectivity, NUMA pattern etc
> ... But I think those are too much at least for initial version of the
> scheduler.
>

Thanks
-Amila



> And of course, it will have use cases like re-using cloud resources for
> individual jobs part of a larger workflow (a flavor of your thesis topic if
> you still remember) and so on. So my question is, are Mesos or Aurora’s use
> cases in managing a fixed set of resources, I mean the challenge in
> spreading M jobs across N resources efficiently with fair-share, varying
> memory and I/O requirements and so on? Or did you also come across examples
> which will resonate with meta-schedulers interacting with multiple lower
> level schedulers?
>
> Thanks,
> Suresh
>
> On Sep 4, 2014, at 5:38 PM, Eran Chinthaka Withana <
> eran.chinthaka@gmail.com> wrote:
>
> > oops, sorry. Here it is:
> > http://www.mail-archive.com/user@mesos.apache.org/msg01417.html
> >
> > Thanks,
> > Eran Chinthaka Withana
> >
> >
> > On Thu, Sep 4, 2014 at 2:22 PM, Suresh Marru <sm...@apache.org> wrote:
> >
> >> Hi Eran, Jijoe
> >>
> >> Can you share the missing reference you indicate below?
> >>
> >> Ofcourse by all means its good for Airavata to build over projects like
> >> Mesos, thats my motivation for this discussion. I am not yet suggesting
> >> implementing a scheduler, that will be a distraction. The meta
> scheduler I
> >> illustrated is a mere routing to be injected into airavata job
> management
> >> with a simple FIFO. We looking forward to hearing options from you all
> on
> >> whats the right third party software is. Manu Singh a first year
> graduate
> >> student at IU volunteers to do a academic study of these solutions, so
> will
> >> appreciate pointers.
> >>
> >> Suresh
> >>
> >> On Sep 3, 2014, at 11:59 AM, Eran Chinthaka Withana <
> >> eran.chinthaka@gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> Before you go ahead and implement on your own, consider reading this
> mail
> >>> thread[1] and looking at how frameworks like Apache Aurora does it on
> top
> >>> of Apache Mesos. These may provide good inputs for this implementation.
> >>>
> >>> (thanks to Jijoe also who provided input for this)
> >>>
> >>>
> >>>
> >>> Thanks,
> >>> Eran Chinthaka Withana
> >>>
> >>>
> >>> On Wed, Sep 3, 2014 at 5:50 AM, Suresh Marru <sm...@apache.org>
> wrote:
> >>>
> >>>> Thank you all for comments and suggestions. I summarized the
> discussion
> >> as
> >>>> a implementation plan on a wiki page:
> >>>>
> >>>>
> >>
> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
> >>>>
> >>>> If this is amenable, we can take this to dev list to plan the
> >> development
> >>>> in two phases. First implement the Throttle-Job in and short term and
> >> then
> >>>> plan the Auto-Scheduling capabilities.
> >>>>
> >>>> Suresh
> >>>>
> >>>> On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com>
> wrote:
> >>>>
> >>>>> It seems to me that among many possible functions a metascheduler
> (MS)
> >>>> would provide, there are two separate ones that must be addressed
> first.
> >>>> The two use cases implied are as follows.
> >>>>>
> >>>>> (1) The gateway submits a group of jobs to a specified resource where
> >>>> the count of jobs exceeds the resource’s queued job limit. Let’s say
> 300
> >>>> very quick jobs are submitted, where the limit is 50 per community
> user.
> >>>> The MS must maintain an internal queue and release jobs to the
> resource
> >> in
> >>>> groups with job counts under the limit (say, 40 at a time).
> >>>>>
> >>>>> (2) The gateway submits a job or set of jobs with a flag that
> specifies
> >>>> that Airavata choose the resource. Here, MCP or some other mechanism
> >>>> arrives eventually at the specific resource that completes the job(s).
> >>>>>
> >>>>> Where both uses are needed - unspecified resource and a group of jobs
> >>>> with count exceeding limits - the MS action would be best defined by
> >>>> knowing the definitions and mechanisms employed in the two separate
> >>>> functions. For example, if MCP is employed, the initial brute force
> test
> >>>> submissions might need to be done using the determined number of jobs
> >> at a
> >>>> time (e.g., 40). But the design here must adhere to design criteria
> >> arrived
> >>>> at for both function (1) and function (2).
> >>>>>
> >>>>> In UltraScan’s case, the most immediate need is for (1). The user
> could
> >>>> manually determine the best resource or just make a reasonable guess.
> >> What
> >>>> the user does not want to do is manually release jobs 40 at a time.
> The
> >>>> gateway interface allows specification of a group of 300 jobs and the
> >> user
> >>>> does not care what is going on under the covers to effect the running
> of
> >>>> all of them eventually. So, I guess I am lobbying for addressing (1)
> >> first;
> >>>> both to meet UltraScan’s immediate need and to elucidate the design of
> >> more
> >>>> sophisticated functionality.
> >>>>>
> >>>>> - Gary
> >>>>>
> >>>>> On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
> >>>>>
> >>>>>> Hi Kenneth,
> >>>>>>
> >>>>>> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>> The tricky thing is the need to maintain an internal queue of
> >>>>>>> jobs when the Stampede queued jobs limit is reached.  If airavata
> >>>>>>> has an internal representation for jobs to be submitted, I think
> you
> >>>>>>> are most of the way there.
> >>>>>>
> >>>>>> Airavata has an internal representation of jobs, but there is no
> good
> >>>> global view of all the jobs running on a given resource for a given
> >>>> community account. We are trying to fix this, once this is done, as
> you
> >>>> say, the FIFO implementation should be straight forward.
> >>>>>>
> >>>>>>> It is tricky to do resource-matching scheduling when the job mix
> >>>>>>> is not known.  For example, the scheduler does not know whether
> >>>>>>> to preserve memory vs cores when deciding where to place a job.
> >>>>>>> Also, the interactions of the gateway scheduler and the local
> >>>>>>> schedulers may be complicated to predict.
> >>>>>>>
> >>>>>>> Fair share is probably not a good idea.  In practice, it tends
> >>>>>>> to disrupt the other scheduling policies such that one group is
> >>>>>>> penalized and the others don't run much earlier.
> >>>>>>
> >>>>>> Interesting. What do you think of the capacity based scheduling
> >>>> algorithm (linked below)?
> >>>>>>
> >>>>>>>
> >>>>>>> One option is to maintain the gateway job queue internally,
> >>>>>>> then use the MCP brute force approach: submit to all resources,
> >>>>>>> then cancel after the first job start.  You may also want to
> >>>>>>> allow the gateway to set per-resource policy limits on
> >>>>>>> number of jobs, job duration, job core size, SUs, etc.
> >>>>>>
> >>>>>> MCP is something we should try. The limits per gateway per resource
> >>>> exists, but we need to exercise these capabilities.
> >>>>>>
> >>>>>> Suresh
> >>>>>>
> >>>>>>>
> >>>>>>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> >>>>>>>> Hi All,
> >>>>>>>>
> >>>>>>>> Need some guidance on identifying a scheduling strategy and a
> >>>> pluggable third party implementation for airavata scheduling needs.
> For
> >>>> context let me describe the use cases for scheduling within airavata:
> >>>>>>>>
> >>>>>>>> * If we gateway/user is submitting a series of jobs, airavata is
> >>>> currently not throttling them and sending them to compute clusters
> (in a
> >>>> FIFO way). Resources enforce per user job limit within a queue and
> >> ensure
> >>>> fair use of the clusters ((example: stampede allows 50 jobs per user
> in
> >> the
> >>>> normal queue [1]). Airavata will need to implement queues and throttle
> >> jobs
> >>>> respecting the max-job-per-queue limits of a underlying resource
> queue.
> >>>>>>>>
> >>>>>>>> * Current version of Airavata is also not performing job
> scheduling
> >>>> across available computational resources and expecting gateways/users
> to
> >>>> pick resources during experiment launch. Airavata will need to
> implement
> >>>> schedulers which become aware of existing loads on the clusters and
> >> spread
> >>>> jobs efficiently. The scheduler should be able to get access to
> >> heuristics
> >>>> on previous executions and current requirements which includes job
> size
> >>>> (number of nodes/cores), memory requirements, wall time estimates and
> so
> >>>> forth.
> >>>>>>>>
> >>>>>>>> * As Airavata is mapping multiple individual user jobs into one or
> >>>> more community account submissions, it also becomes critical to
> >> implement
> >>>> fair-share scheduling among these users to ensure fair use of
> >> allocations
> >>>> as well as allowable queue limits.
> >>>>>>>>
> >>>>>>>> Other use cases?
> >>>>>>>>
> >>>>>>>> We will greatly appreciate if folks on this list can shed light on
> >>>> experiences using schedulers implemented in hadoop, mesos, storm or
> >> other
> >>>> frameworks outside of their intended use. For instance, hadoop (yarn)
> >>>> capacity [2] and fair schedulers [3][4][5] seem to meet the needs of
> >>>> airavata. Is it a good idea to attempt to reuse these implementations?
> >> Any
> >>>> other pluggable third-party alternatives.
> >>>>>>>>
> >>>>>>>> Thanks in advance for your time and insights,
> >>>>>>>>
> >>>>>>>> Suresh
> >>>>>>>>
> >>>>>>>> [1] -
> >>>>
> >>
> https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> >>>>>>>> [2] -
> >>>>
> >>
> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> >>>>>>>> [3] -
> >>>>
> >>
> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> >>>>>>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> >>>>>>>> [5] - https://issues.apache.org/jira/browse/YARN-326
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Re: Scheduling stratergies for Airavata

Posted by Suresh Marru <sm...@apache.org>.

Eran, 

This is a good read and infact sounds very similar in situation (picking a well known solution vs writing our own). As you may recollect, Airavata’s key challenge is in identifying the resources which have the shortest queue time across many resources. And of course, it will have use cases like re-using cloud resources for individual jobs part of a larger workflow (a flavor of your thesis topic if you still remember) and so on. So my question is, are Mesos or Aurora’s use cases in managing a fixed set of resources, I mean the challenge in spreading M jobs across N resources efficiently with fair-share, varying memory and I/O requirements and so on? Or did you also come across examples which will resonate with meta-schedulers interacting with multiple lower level schedulers? 

Thanks,
Suresh

On Sep 4, 2014, at 5:38 PM, Eran Chinthaka Withana <er...@gmail.com> wrote:

> oops, sorry. Here it is:
> http://www.mail-archive.com/user@mesos.apache.org/msg01417.html
> 
> Thanks,
> Eran Chinthaka Withana
> 
> 
> On Thu, Sep 4, 2014 at 2:22 PM, Suresh Marru <sm...@apache.org> wrote:
> 
>> Hi Eran, Jijoe
>> 
>> Can you share the missing reference you indicate below?
>> 
>> Ofcourse by all means its good for Airavata to build over projects like
>> Mesos, thats my motivation for this discussion. I am not yet suggesting
>> implementing a scheduler, that will be a distraction. The meta scheduler I
>> illustrated is a mere routing to be injected into airavata job management
>> with a simple FIFO. We looking forward to hearing options from you all on
>> whats the right third party software is. Manu Singh a first year graduate
>> student at IU volunteers to do a academic study of these solutions, so will
>> appreciate pointers.
>> 
>> Suresh
>> 
>> On Sep 3, 2014, at 11:59 AM, Eran Chinthaka Withana <
>> eran.chinthaka@gmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> Before you go ahead and implement on your own, consider reading this mail
>>> thread[1] and looking at how frameworks like Apache Aurora does it on top
>>> of Apache Mesos. These may provide good inputs for this implementation.
>>> 
>>> (thanks to Jijoe also who provided input for this)
>>> 
>>> 
>>> 
>>> Thanks,
>>> Eran Chinthaka Withana
>>> 
>>> 
>>> On Wed, Sep 3, 2014 at 5:50 AM, Suresh Marru <sm...@apache.org> wrote:
>>> 
>>>> Thank you all for comments and suggestions. I summarized the discussion
>> as
>>>> a implementation plan on a wiki page:
>>>> 
>>>> 
>> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
>>>> 
>>>> If this is amenable, we can take this to dev list to plan the
>> development
>>>> in two phases. First implement the Throttle-Job in and short term and
>> then
>>>> plan the Auto-Scheduling capabilities.
>>>> 
>>>> Suresh
>>>> 
>>>> On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com> wrote:
>>>> 
>>>>> It seems to me that among many possible functions a metascheduler (MS)
>>>> would provide, there are two separate ones that must be addressed first.
>>>> The two use cases implied are as follows.
>>>>> 
>>>>> (1) The gateway submits a group of jobs to a specified resource where
>>>> the count of jobs exceeds the resource’s queued job limit. Let’s say 300
>>>> very quick jobs are submitted, where the limit is 50 per community user.
>>>> The MS must maintain an internal queue and release jobs to the resource
>> in
>>>> groups with job counts under the limit (say, 40 at a time).
>>>>> 
>>>>> (2) The gateway submits a job or set of jobs with a flag that specifies
>>>> that Airavata choose the resource. Here, MCP or some other mechanism
>>>> arrives eventually at the specific resource that completes the job(s).
>>>>> 
>>>>> Where both uses are needed - unspecified resource and a group of jobs
>>>> with count exceeding limits - the MS action would be best defined by
>>>> knowing the definitions and mechanisms employed in the two separate
>>>> functions. For example, if MCP is employed, the initial brute force test
>>>> submissions might need to be done using the determined number of jobs
>> at a
>>>> time (e.g., 40). But the design here must adhere to design criteria
>> arrived
>>>> at for both function (1) and function (2).
>>>>> 
>>>>> In UltraScan’s case, the most immediate need is for (1). The user could
>>>> manually determine the best resource or just make a reasonable guess.
>> What
>>>> the user does not want to do is manually release jobs 40 at a time. The
>>>> gateway interface allows specification of a group of 300 jobs and the
>> user
>>>> does not care what is going on under the covers to effect the running of
>>>> all of them eventually. So, I guess I am lobbying for addressing (1)
>> first;
>>>> both to meet UltraScan’s immediate need and to elucidate the design of
>> more
>>>> sophisticated functionality.
>>>>> 
>>>>> - Gary
>>>>> 
>>>>> On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
>>>>> 
>>>>>> Hi Kenneth,
>>>>>> 
>>>>>> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> The tricky thing is the need to maintain an internal queue of
>>>>>>> jobs when the Stampede queued jobs limit is reached.  If airavata
>>>>>>> has an internal representation for jobs to be submitted, I think you
>>>>>>> are most of the way there.
>>>>>> 
>>>>>> Airavata has an internal representation of jobs, but there is no good
>>>> global view of all the jobs running on a given resource for a given
>>>> community account. We are trying to fix this, once this is done, as you
>>>> say, the FIFO implementation should be straight forward.
>>>>>> 
>>>>>>> It is tricky to do resource-matching scheduling when the job mix
>>>>>>> is not known.  For example, the scheduler does not know whether
>>>>>>> to preserve memory vs cores when deciding where to place a job.
>>>>>>> Also, the interactions of the gateway scheduler and the local
>>>>>>> schedulers may be complicated to predict.
>>>>>>> 
>>>>>>> Fair share is probably not a good idea.  In practice, it tends
>>>>>>> to disrupt the other scheduling policies such that one group is
>>>>>>> penalized and the others don't run much earlier.
>>>>>> 
>>>>>> Interesting. What do you think of the capacity based scheduling
>>>> algorithm (linked below)?
>>>>>> 
>>>>>>> 
>>>>>>> One option is to maintain the gateway job queue internally,
>>>>>>> then use the MCP brute force approach: submit to all resources,
>>>>>>> then cancel after the first job start.  You may also want to
>>>>>>> allow the gateway to set per-resource policy limits on
>>>>>>> number of jobs, job duration, job core size, SUs, etc.
>>>>>> 
>>>>>> MCP is something we should try. The limits per gateway per resource
>>>> exists, but we need to exercise these capabilities.
>>>>>> 
>>>>>> Suresh
>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> Need some guidance on identifying a scheduling strategy and a
>>>> pluggable third party implementation for airavata scheduling needs. For
>>>> context let me describe the use cases for scheduling within airavata:
>>>>>>>> 
>>>>>>>> * If we gateway/user is submitting a series of jobs, airavata is
>>>> currently not throttling them and sending them to compute clusters (in a
>>>> FIFO way). Resources enforce per user job limit within a queue and
>> ensure
>>>> fair use of the clusters ((example: stampede allows 50 jobs per user in
>> the
>>>> normal queue [1]). Airavata will need to implement queues and throttle
>> jobs
>>>> respecting the max-job-per-queue limits of a underlying resource queue.
>>>>>>>> 
>>>>>>>> * Current version of Airavata is also not performing job scheduling
>>>> across available computational resources and expecting gateways/users to
>>>> pick resources during experiment launch. Airavata will need to implement
>>>> schedulers which become aware of existing loads on the clusters and
>> spread
>>>> jobs efficiently. The scheduler should be able to get access to
>> heuristics
>>>> on previous executions and current requirements which includes job size
>>>> (number of nodes/cores), memory requirements, wall time estimates and so
>>>> forth.
>>>>>>>> 
>>>>>>>> * As Airavata is mapping multiple individual user jobs into one or
>>>> more community account submissions, it also becomes critical to
>> implement
>>>> fair-share scheduling among these users to ensure fair use of
>> allocations
>>>> as well as allowable queue limits.
>>>>>>>> 
>>>>>>>> Other use cases?
>>>>>>>> 
>>>>>>>> We will greatly appreciate if folks on this list can shed light on
>>>> experiences using schedulers implemented in hadoop, mesos, storm or
>> other
>>>> frameworks outside of their intended use. For instance, hadoop (yarn)
>>>> capacity [2] and fair schedulers [3][4][5] seem to meet the needs of
>>>> airavata. Is it a good idea to attempt to reuse these implementations?
>> Any
>>>> other pluggable third-party alternatives.
>>>>>>>> 
>>>>>>>> Thanks in advance for your time and insights,
>>>>>>>> 
>>>>>>>> Suresh
>>>>>>>> 
>>>>>>>> [1] -
>>>> 
>> https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>>>>>>>> [2] -
>>>> 
>> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>>>>>>> [3] -
>>>> 
>> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>>>>>>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>>>>>>>> [5] - https://issues.apache.org/jira/browse/YARN-326
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>>

Re: Scheduling stratergies for Airavata

Posted by Eran Chinthaka Withana <er...@gmail.com>.

oops, sorry. Here it is:
http://www.mail-archive.com/user@mesos.apache.org/msg01417.html

Thanks,
Eran Chinthaka Withana


On Thu, Sep 4, 2014 at 2:22 PM, Suresh Marru <sm...@apache.org> wrote:

> Hi Eran, Jijoe
>
> Can you share the missing reference you indicate below?
>
> Ofcourse by all means its good for Airavata to build over projects like
> Mesos, thats my motivation for this discussion. I am not yet suggesting
> implementing a scheduler, that will be a distraction. The meta scheduler I
> illustrated is a mere routing to be injected into airavata job management
> with a simple FIFO. We looking forward to hearing options from you all on
> whats the right third party software is. Manu Singh a first year graduate
> student at IU volunteers to do a academic study of these solutions, so will
> appreciate pointers.
>
> Suresh
>
> On Sep 3, 2014, at 11:59 AM, Eran Chinthaka Withana <
> eran.chinthaka@gmail.com> wrote:
>
> > Hi,
> >
> > Before you go ahead and implement on your own, consider reading this mail
> > thread[1] and looking at how frameworks like Apache Aurora does it on top
> > of Apache Mesos. These may provide good inputs for this implementation.
> >
> > (thanks to Jijoe also who provided input for this)
> >
> >
> >
> > Thanks,
> > Eran Chinthaka Withana
> >
> >
> > On Wed, Sep 3, 2014 at 5:50 AM, Suresh Marru <sm...@apache.org> wrote:
> >
> >> Thank you all for comments and suggestions. I summarized the discussion
> as
> >> a implementation plan on a wiki page:
> >>
> >>
> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
> >>
> >> If this is amenable, we can take this to dev list to plan the
> development
> >> in two phases. First implement the Throttle-Job in and short term and
> then
> >> plan the Auto-Scheduling capabilities.
> >>
> >> Suresh
> >>
> >> On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com> wrote:
> >>
> >>> It seems to me that among many possible functions a metascheduler (MS)
> >> would provide, there are two separate ones that must be addressed first.
> >> The two use cases implied are as follows.
> >>>
> >>> (1) The gateway submits a group of jobs to a specified resource where
> >> the count of jobs exceeds the resource’s queued job limit. Let’s say 300
> >> very quick jobs are submitted, where the limit is 50 per community user.
> >> The MS must maintain an internal queue and release jobs to the resource
> in
> >> groups with job counts under the limit (say, 40 at a time).
> >>>
> >>> (2) The gateway submits a job or set of jobs with a flag that specifies
> >> that Airavata choose the resource. Here, MCP or some other mechanism
> >> arrives eventually at the specific resource that completes the job(s).
> >>>
> >>> Where both uses are needed - unspecified resource and a group of jobs
> >> with count exceeding limits - the MS action would be best defined by
> >> knowing the definitions and mechanisms employed in the two separate
> >> functions. For example, if MCP is employed, the initial brute force test
> >> submissions might need to be done using the determined number of jobs
> at a
> >> time (e.g., 40). But the design here must adhere to design criteria
> arrived
> >> at for both function (1) and function (2).
> >>>
> >>> In UltraScan’s case, the most immediate need is for (1). The user could
> >> manually determine the best resource or just make a reasonable guess.
> What
> >> the user does not want to do is manually release jobs 40 at a time. The
> >> gateway interface allows specification of a group of 300 jobs and the
> user
> >> does not care what is going on under the covers to effect the running of
> >> all of them eventually. So, I guess I am lobbying for addressing (1)
> first;
> >> both to meet UltraScan’s immediate need and to elucidate the design of
> more
> >> sophisticated functionality.
> >>>
> >>> - Gary
> >>>
> >>> On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
> >>>
> >>>> Hi Kenneth,
> >>>>
> >>>> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
> >>>>
> >>>>>
> >>>>> The tricky thing is the need to maintain an internal queue of
> >>>>> jobs when the Stampede queued jobs limit is reached.  If airavata
> >>>>> has an internal representation for jobs to be submitted, I think you
> >>>>> are most of the way there.
> >>>>
> >>>> Airavata has an internal representation of jobs, but there is no good
> >> global view of all the jobs running on a given resource for a given
> >> community account. We are trying to fix this, once this is done, as you
> >> say, the FIFO implementation should be straight forward.
> >>>>
> >>>>> It is tricky to do resource-matching scheduling when the job mix
> >>>>> is not known.  For example, the scheduler does not know whether
> >>>>> to preserve memory vs cores when deciding where to place a job.
> >>>>> Also, the interactions of the gateway scheduler and the local
> >>>>> schedulers may be complicated to predict.
> >>>>>
> >>>>> Fair share is probably not a good idea.  In practice, it tends
> >>>>> to disrupt the other scheduling policies such that one group is
> >>>>> penalized and the others don't run much earlier.
> >>>>
> >>>> Interesting. What do you think of the capacity based scheduling
> >> algorithm (linked below)?
> >>>>
> >>>>>
> >>>>> One option is to maintain the gateway job queue internally,
> >>>>> then use the MCP brute force approach: submit to all resources,
> >>>>> then cancel after the first job start.  You may also want to
> >>>>> allow the gateway to set per-resource policy limits on
> >>>>> number of jobs, job duration, job core size, SUs, etc.
> >>>>
> >>>> MCP is something we should try. The limits per gateway per resource
> >> exists, but we need to exercise these capabilities.
> >>>>
> >>>> Suresh
> >>>>
> >>>>>
> >>>>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> >>>>>> Hi All,
> >>>>>>
> >>>>>> Need some guidance on identifying a scheduling strategy and a
> >> pluggable third party implementation for airavata scheduling needs. For
> >> context let me describe the use cases for scheduling within airavata:
> >>>>>>
> >>>>>> * If we gateway/user is submitting a series of jobs, airavata is
> >> currently not throttling them and sending them to compute clusters (in a
> >> FIFO way). Resources enforce per user job limit within a queue and
> ensure
> >> fair use of the clusters ((example: stampede allows 50 jobs per user in
> the
> >> normal queue [1]). Airavata will need to implement queues and throttle
> jobs
> >> respecting the max-job-per-queue limits of a underlying resource queue.
> >>>>>>
> >>>>>> * Current version of Airavata is also not performing job scheduling
> >> across available computational resources and expecting gateways/users to
> >> pick resources during experiment launch. Airavata will need to implement
> >> schedulers which become aware of existing loads on the clusters and
> spread
> >> jobs efficiently. The scheduler should be able to get access to
> heuristics
> >> on previous executions and current requirements which includes job size
> >> (number of nodes/cores), memory requirements, wall time estimates and so
> >> forth.
> >>>>>>
> >>>>>> * As Airavata is mapping multiple individual user jobs into one or
> >> more community account submissions, it also becomes critical to
> implement
> >> fair-share scheduling among these users to ensure fair use of
> allocations
> >> as well as allowable queue limits.
> >>>>>>
> >>>>>> Other use cases?
> >>>>>>
> >>>>>> We will greatly appreciate if folks on this list can shed light on
> >> experiences using schedulers implemented in hadoop, mesos, storm or
> other
> >> frameworks outside of their intended use. For instance, hadoop (yarn)
> >> capacity [2] and fair schedulers [3][4][5] seem to meet the needs of
> >> airavata. Is it a good idea to attempt to reuse these implementations?
> Any
> >> other pluggable third-party alternatives.
> >>>>>>
> >>>>>> Thanks in advance for your time and insights,
> >>>>>>
> >>>>>> Suresh
> >>>>>>
> >>>>>> [1] -
> >>
> https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> >>>>>> [2] -
> >>
> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> >>>>>> [3] -
> >>
> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> >>>>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> >>>>>> [5] - https://issues.apache.org/jira/browse/YARN-326
> >>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >>
>
>

Re: Scheduling stratergies for Airavata

Posted by Suresh Marru <sm...@apache.org>.

Hi Eran, Jijoe

Can you share the missing reference you indicate below? 

Ofcourse by all means its good for Airavata to build over projects like Mesos, thats my motivation for this discussion. I am not yet suggesting implementing a scheduler, that will be a distraction. The meta scheduler I illustrated is a mere routing to be injected into airavata job management with a simple FIFO. We looking forward to hearing options from you all on whats the right third party software is. Manu Singh a first year graduate student at IU volunteers to do a academic study of these solutions, so will appreciate pointers. 

Suresh

On Sep 3, 2014, at 11:59 AM, Eran Chinthaka Withana <er...@gmail.com> wrote:

> Hi,
> 
> Before you go ahead and implement on your own, consider reading this mail
> thread[1] and looking at how frameworks like Apache Aurora does it on top
> of Apache Mesos. These may provide good inputs for this implementation.
> 
> (thanks to Jijoe also who provided input for this)
> 
> 
> 
> Thanks,
> Eran Chinthaka Withana
> 
> 
> On Wed, Sep 3, 2014 at 5:50 AM, Suresh Marru <sm...@apache.org> wrote:
> 
>> Thank you all for comments and suggestions. I summarized the discussion as
>> a implementation plan on a wiki page:
>> 
>> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
>> 
>> If this is amenable, we can take this to dev list to plan the development
>> in two phases. First implement the Throttle-Job in and short term and then
>> plan the Auto-Scheduling capabilities.
>> 
>> Suresh
>> 
>> On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com> wrote:
>> 
>>> It seems to me that among many possible functions a metascheduler (MS)
>> would provide, there are two separate ones that must be addressed first.
>> The two use cases implied are as follows.
>>> 
>>> (1) The gateway submits a group of jobs to a specified resource where
>> the count of jobs exceeds the resource’s queued job limit. Let’s say 300
>> very quick jobs are submitted, where the limit is 50 per community user.
>> The MS must maintain an internal queue and release jobs to the resource in
>> groups with job counts under the limit (say, 40 at a time).
>>> 
>>> (2) The gateway submits a job or set of jobs with a flag that specifies
>> that Airavata choose the resource. Here, MCP or some other mechanism
>> arrives eventually at the specific resource that completes the job(s).
>>> 
>>> Where both uses are needed - unspecified resource and a group of jobs
>> with count exceeding limits - the MS action would be best defined by
>> knowing the definitions and mechanisms employed in the two separate
>> functions. For example, if MCP is employed, the initial brute force test
>> submissions might need to be done using the determined number of jobs at a
>> time (e.g., 40). But the design here must adhere to design criteria arrived
>> at for both function (1) and function (2).
>>> 
>>> In UltraScan’s case, the most immediate need is for (1). The user could
>> manually determine the best resource or just make a reasonable guess. What
>> the user does not want to do is manually release jobs 40 at a time. The
>> gateway interface allows specification of a group of 300 jobs and the user
>> does not care what is going on under the covers to effect the running of
>> all of them eventually. So, I guess I am lobbying for addressing (1) first;
>> both to meet UltraScan’s immediate need and to elucidate the design of more
>> sophisticated functionality.
>>> 
>>> - Gary
>>> 
>>> On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
>>> 
>>>> Hi Kenneth,
>>>> 
>>>> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
>>>> 
>>>>> 
>>>>> The tricky thing is the need to maintain an internal queue of
>>>>> jobs when the Stampede queued jobs limit is reached.  If airavata
>>>>> has an internal representation for jobs to be submitted, I think you
>>>>> are most of the way there.
>>>> 
>>>> Airavata has an internal representation of jobs, but there is no good
>> global view of all the jobs running on a given resource for a given
>> community account. We are trying to fix this, once this is done, as you
>> say, the FIFO implementation should be straight forward.
>>>> 
>>>>> It is tricky to do resource-matching scheduling when the job mix
>>>>> is not known.  For example, the scheduler does not know whether
>>>>> to preserve memory vs cores when deciding where to place a job.
>>>>> Also, the interactions of the gateway scheduler and the local
>>>>> schedulers may be complicated to predict.
>>>>> 
>>>>> Fair share is probably not a good idea.  In practice, it tends
>>>>> to disrupt the other scheduling policies such that one group is
>>>>> penalized and the others don't run much earlier.
>>>> 
>>>> Interesting. What do you think of the capacity based scheduling
>> algorithm (linked below)?
>>>> 
>>>>> 
>>>>> One option is to maintain the gateway job queue internally,
>>>>> then use the MCP brute force approach: submit to all resources,
>>>>> then cancel after the first job start.  You may also want to
>>>>> allow the gateway to set per-resource policy limits on
>>>>> number of jobs, job duration, job core size, SUs, etc.
>>>> 
>>>> MCP is something we should try. The limits per gateway per resource
>> exists, but we need to exercise these capabilities.
>>>> 
>>>> Suresh
>>>> 
>>>>> 
>>>>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>>>>>> Hi All,
>>>>>> 
>>>>>> Need some guidance on identifying a scheduling strategy and a
>> pluggable third party implementation for airavata scheduling needs. For
>> context let me describe the use cases for scheduling within airavata:
>>>>>> 
>>>>>> * If we gateway/user is submitting a series of jobs, airavata is
>> currently not throttling them and sending them to compute clusters (in a
>> FIFO way). Resources enforce per user job limit within a queue and ensure
>> fair use of the clusters ((example: stampede allows 50 jobs per user in the
>> normal queue [1]). Airavata will need to implement queues and throttle jobs
>> respecting the max-job-per-queue limits of a underlying resource queue.
>>>>>> 
>>>>>> * Current version of Airavata is also not performing job scheduling
>> across available computational resources and expecting gateways/users to
>> pick resources during experiment launch. Airavata will need to implement
>> schedulers which become aware of existing loads on the clusters and spread
>> jobs efficiently. The scheduler should be able to get access to heuristics
>> on previous executions and current requirements which includes job size
>> (number of nodes/cores), memory requirements, wall time estimates and so
>> forth.
>>>>>> 
>>>>>> * As Airavata is mapping multiple individual user jobs into one or
>> more community account submissions, it also becomes critical to implement
>> fair-share scheduling among these users to ensure fair use of allocations
>> as well as allowable queue limits.
>>>>>> 
>>>>>> Other use cases?
>>>>>> 
>>>>>> We will greatly appreciate if folks on this list can shed light on
>> experiences using schedulers implemented in hadoop, mesos, storm or other
>> frameworks outside of their intended use. For instance, hadoop (yarn)
>> capacity [2] and fair schedulers [3][4][5] seem to meet the needs of
>> airavata. Is it a good idea to attempt to reuse these implementations? Any
>> other pluggable third-party alternatives.
>>>>>> 
>>>>>> Thanks in advance for your time and insights,
>>>>>> 
>>>>>> Suresh
>>>>>> 
>>>>>> [1] -
>> https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>>>>>> [2] -
>> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>>>>> [3] -
>> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>>>>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>>>>>> [5] - https://issues.apache.org/jira/browse/YARN-326
>>>>>> 
>>>>>> 
>>>> 
>>> 
>> 
>>

Re: Scheduling stratergies for Airavata

Posted by Eran Chinthaka Withana <er...@gmail.com>.

Hi,

Before you go ahead and implement on your own, consider reading this mail
thread[1] and looking at how frameworks like Apache Aurora does it on top
of Apache Mesos. These may provide good inputs for this implementation.

(thanks to Jijoe also who provided input for this)



Thanks,
Eran Chinthaka Withana


On Wed, Sep 3, 2014 at 5:50 AM, Suresh Marru <sm...@apache.org> wrote:

> Thank you all for comments and suggestions. I summarized the discussion as
> a implementation plan on a wiki page:
>
> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler
>
> If this is amenable, we can take this to dev list to plan the development
> in two phases. First implement the Throttle-Job in and short term and then
> plan the Auto-Scheduling capabilities.
>
> Suresh
>
> On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com> wrote:
>
> > It seems to me that among many possible functions a metascheduler (MS)
> would provide, there are two separate ones that must be addressed first.
> The two use cases implied are as follows.
> >
> > (1) The gateway submits a group of jobs to a specified resource where
> the count of jobs exceeds the resource’s queued job limit. Let’s say 300
> very quick jobs are submitted, where the limit is 50 per community user.
> The MS must maintain an internal queue and release jobs to the resource in
> groups with job counts under the limit (say, 40 at a time).
> >
> > (2) The gateway submits a job or set of jobs with a flag that specifies
> that Airavata choose the resource. Here, MCP or some other mechanism
> arrives eventually at the specific resource that completes the job(s).
> >
> > Where both uses are needed - unspecified resource and a group of jobs
> with count exceeding limits - the MS action would be best defined by
> knowing the definitions and mechanisms employed in the two separate
> functions. For example, if MCP is employed, the initial brute force test
> submissions might need to be done using the determined number of jobs at a
> time (e.g., 40). But the design here must adhere to design criteria arrived
> at for both function (1) and function (2).
> >
> > In UltraScan’s case, the most immediate need is for (1). The user could
> manually determine the best resource or just make a reasonable guess. What
> the user does not want to do is manually release jobs 40 at a time. The
> gateway interface allows specification of a group of 300 jobs and the user
> does not care what is going on under the covers to effect the running of
> all of them eventually. So, I guess I am lobbying for addressing (1) first;
> both to meet UltraScan’s immediate need and to elucidate the design of more
> sophisticated functionality.
> >
> > - Gary
> >
> > On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
> >
> >> Hi Kenneth,
> >>
> >> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
> >>
> >>>
> >>> The tricky thing is the need to maintain an internal queue of
> >>> jobs when the Stampede queued jobs limit is reached.  If airavata
> >>> has an internal representation for jobs to be submitted, I think you
> >>> are most of the way there.
> >>
> >> Airavata has an internal representation of jobs, but there is no good
> global view of all the jobs running on a given resource for a given
> community account. We are trying to fix this, once this is done, as you
> say, the FIFO implementation should be straight forward.
> >>
> >>> It is tricky to do resource-matching scheduling when the job mix
> >>> is not known.  For example, the scheduler does not know whether
> >>> to preserve memory vs cores when deciding where to place a job.
> >>> Also, the interactions of the gateway scheduler and the local
> >>> schedulers may be complicated to predict.
> >>>
> >>> Fair share is probably not a good idea.  In practice, it tends
> >>> to disrupt the other scheduling policies such that one group is
> >>> penalized and the others don't run much earlier.
> >>
> >> Interesting. What do you think of the capacity based scheduling
> algorithm (linked below)?
> >>
> >>>
> >>> One option is to maintain the gateway job queue internally,
> >>> then use the MCP brute force approach: submit to all resources,
> >>> then cancel after the first job start.  You may also want to
> >>> allow the gateway to set per-resource policy limits on
> >>> number of jobs, job duration, job core size, SUs, etc.
> >>
> >> MCP is something we should try. The limits per gateway per resource
> exists, but we need to exercise these capabilities.
> >>
> >> Suresh
> >>
> >>>
> >>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> >>>> Hi All,
> >>>>
> >>>> Need some guidance on identifying a scheduling strategy and a
> pluggable third party implementation for airavata scheduling needs. For
> context let me describe the use cases for scheduling within airavata:
> >>>>
> >>>> * If we gateway/user is submitting a series of jobs, airavata is
> currently not throttling them and sending them to compute clusters (in a
> FIFO way). Resources enforce per user job limit within a queue and ensure
> fair use of the clusters ((example: stampede allows 50 jobs per user in the
> normal queue [1]). Airavata will need to implement queues and throttle jobs
> respecting the max-job-per-queue limits of a underlying resource queue.
> >>>>
> >>>> * Current version of Airavata is also not performing job scheduling
> across available computational resources and expecting gateways/users to
> pick resources during experiment launch. Airavata will need to implement
> schedulers which become aware of existing loads on the clusters and spread
> jobs efficiently. The scheduler should be able to get access to heuristics
> on previous executions and current requirements which includes job size
> (number of nodes/cores), memory requirements, wall time estimates and so
> forth.
> >>>>
> >>>> * As Airavata is mapping multiple individual user jobs into one or
> more community account submissions, it also becomes critical to implement
> fair-share scheduling among these users to ensure fair use of allocations
> as well as allowable queue limits.
> >>>>
> >>>> Other use cases?
> >>>>
> >>>> We will greatly appreciate if folks on this list can shed light on
> experiences using schedulers implemented in hadoop, mesos, storm or other
> frameworks outside of their intended use. For instance, hadoop (yarn)
> capacity [2] and fair schedulers [3][4][5] seem to meet the needs of
> airavata. Is it a good idea to attempt to reuse these implementations? Any
> other pluggable third-party alternatives.
> >>>>
> >>>> Thanks in advance for your time and insights,
> >>>>
> >>>> Suresh
> >>>>
> >>>> [1] -
> https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> >>>> [2] -
> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> >>>> [3] -
> http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> >>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> >>>> [5] - https://issues.apache.org/jira/browse/YARN-326
> >>>>
> >>>>
> >>
> >
>
>

Re: Scheduling stratergies for Airavata

Posted by Suresh Marru <sm...@apache.org>.

Thank you all for comments and suggestions. I summarized the discussion as a implementation plan on a wiki page:

https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Metascheduler

If this is amenable, we can take this to dev list to plan the development in two phases. First implement the Throttle-Job in and short term and then plan the Auto-Scheduling capabilities. 

Suresh

On Sep 2, 2014, at 1:50 PM, Gary E. Gorbet <ge...@gmail.com> wrote:

> It seems to me that among many possible functions a metascheduler (MS) would provide, there are two separate ones that must be addressed first. The two use cases implied are as follows.
> 
> (1) The gateway submits a group of jobs to a specified resource where the count of jobs exceeds the resource’s queued job limit. Let’s say 300 very quick jobs are submitted, where the limit is 50 per community user. The MS must maintain an internal queue and release jobs to the resource in groups with job counts under the limit (say, 40 at a time).
> 
> (2) The gateway submits a job or set of jobs with a flag that specifies that Airavata choose the resource. Here, MCP or some other mechanism arrives eventually at the specific resource that completes the job(s).
> 
> Where both uses are needed - unspecified resource and a group of jobs with count exceeding limits - the MS action would be best defined by knowing the definitions and mechanisms employed in the two separate functions. For example, if MCP is employed, the initial brute force test submissions might need to be done using the determined number of jobs at a time (e.g., 40). But the design here must adhere to design criteria arrived at for both function (1) and function (2).
> 
> In UltraScan’s case, the most immediate need is for (1). The user could manually determine the best resource or just make a reasonable guess. What the user does not want to do is manually release jobs 40 at a time. The gateway interface allows specification of a group of 300 jobs and the user does not care what is going on under the covers to effect the running of all of them eventually. So, I guess I am lobbying for addressing (1) first; both to meet UltraScan’s immediate need and to elucidate the design of more sophisticated functionality.
> 
> - Gary
> 
> On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:
> 
>> Hi Kenneth,
>> 
>> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
>> 
>>> 
>>> The tricky thing is the need to maintain an internal queue of
>>> jobs when the Stampede queued jobs limit is reached.  If airavata
>>> has an internal representation for jobs to be submitted, I think you
>>> are most of the way there.
>> 
>> Airavata has an internal representation of jobs, but there is no good global view of all the jobs running on a given resource for a given community account. We are trying to fix this, once this is done, as you say, the FIFO implementation should be straight forward. 
>> 
>>> It is tricky to do resource-matching scheduling when the job mix
>>> is not known.  For example, the scheduler does not know whether
>>> to preserve memory vs cores when deciding where to place a job.
>>> Also, the interactions of the gateway scheduler and the local
>>> schedulers may be complicated to predict.
>>> 
>>> Fair share is probably not a good idea.  In practice, it tends
>>> to disrupt the other scheduling policies such that one group is
>>> penalized and the others don't run much earlier.
>> 
>> Interesting. What do you think of the capacity based scheduling algorithm (linked below)?
>> 
>>> 
>>> One option is to maintain the gateway job queue internally,
>>> then use the MCP brute force approach: submit to all resources,
>>> then cancel after the first job start.  You may also want to
>>> allow the gateway to set per-resource policy limits on
>>> number of jobs, job duration, job core size, SUs, etc.
>> 
>> MCP is something we should try. The limits per gateway per resource exists, but we need to exercise these capabilities. 
>> 
>> Suresh
>> 
>>> 
>>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>>>> Hi All,
>>>> 
>>>> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
>>>> 
>>>> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
>>>> 
>>>> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
>>>> 
>>>> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
>>>> 
>>>> Other use cases? 
>>>> 
>>>> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
>>>> 
>>>> Thanks in advance for your time and insights,
>>>> 
>>>> Suresh
>>>> 
>>>> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>>>> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>>> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>>>> [5] - https://issues.apache.org/jira/browse/YARN-326
>>>> 
>>>> 
>> 
>

Re: Scheduling stratergies for Airavata

Posted by "Gary E. Gorbet" <ge...@gmail.com>.

It seems to me that among many possible functions a metascheduler (MS) would provide, there are two separate ones that must be addressed first. The two use cases implied are as follows.

(1) The gateway submits a group of jobs to a specified resource where the count of jobs exceeds the resource’s queued job limit. Let’s say 300 very quick jobs are submitted, where the limit is 50 per community user. The MS must maintain an internal queue and release jobs to the resource in groups with job counts under the limit (say, 40 at a time).

(2) The gateway submits a job or set of jobs with a flag that specifies that Airavata choose the resource. Here, MCP or some other mechanism arrives eventually at the specific resource that completes the job(s).

Where both uses are needed - unspecified resource and a group of jobs with count exceeding limits - the MS action would be best defined by knowing the definitions and mechanisms employed in the two separate functions. For example, if MCP is employed, the initial brute force test submissions might need to be done using the determined number of jobs at a time (e.g., 40). But the design here must adhere to design criteria arrived at for both function (1) and function (2).

In UltraScan’s case, the most immediate need is for (1). The user could manually determine the best resource or just make a reasonable guess. What the user does not want to do is manually release jobs 40 at a time. The gateway interface allows specification of a group of 300 jobs and the user does not care what is going on under the covers to effect the running of all of them eventually. So, I guess I am lobbying for addressing (1) first; both to meet UltraScan’s immediate need and to elucidate the design of more sophisticated functionality.

- Gary

On Sep 2, 2014, at 12:02 PM, Suresh Marru <sm...@apache.org> wrote:

> Hi Kenneth,
> 
> On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:
> 
>> 
>> The tricky thing is the need to maintain an internal queue of
>> jobs when the Stampede queued jobs limit is reached.  If airavata
>> has an internal representation for jobs to be submitted, I think you
>> are most of the way there.
> 
> Airavata has an internal representation of jobs, but there is no good global view of all the jobs running on a given resource for a given community account. We are trying to fix this, once this is done, as you say, the FIFO implementation should be straight forward. 
> 
>> It is tricky to do resource-matching scheduling when the job mix
>> is not known.  For example, the scheduler does not know whether
>> to preserve memory vs cores when deciding where to place a job.
>> Also, the interactions of the gateway scheduler and the local
>> schedulers may be complicated to predict.
>> 
>> Fair share is probably not a good idea.  In practice, it tends
>> to disrupt the other scheduling policies such that one group is
>> penalized and the others don't run much earlier.
> 
> Interesting. What do you think of the capacity based scheduling algorithm (linked below)?
> 
>> 
>> One option is to maintain the gateway job queue internally,
>> then use the MCP brute force approach: submit to all resources,
>> then cancel after the first job start.  You may also want to
>> allow the gateway to set per-resource policy limits on
>> number of jobs, job duration, job core size, SUs, etc.
> 
> MCP is something we should try. The limits per gateway per resource exists, but we need to exercise these capabilities. 
> 
> Suresh
> 
>> 
>> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>>> Hi All,
>>> 
>>> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
>>> 
>>> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
>>> 
>>> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
>>> 
>>> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
>>> 
>>> Other use cases? 
>>> 
>>> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
>>> 
>>> Thanks in advance for your time and insights,
>>> 
>>> Suresh
>>> 
>>> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>>> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>>> [5] - https://issues.apache.org/jira/browse/YARN-326
>>> 
>>> 
>

Re: Scheduling stratergies for Airavata

Posted by Suresh Marru <sm...@apache.org>.

Hi Kenneth,

On Sep 2, 2014, at 12:44 PM, K Yoshimoto <ke...@sdsc.edu> wrote:

> 
> The tricky thing is the need to maintain an internal queue of
> jobs when the Stampede queued jobs limit is reached.  If airavata
> has an internal representation for jobs to be submitted, I think you
> are most of the way there.

Airavata has an internal representation of jobs, but there is no good global view of all the jobs running on a given resource for a given community account. We are trying to fix this, once this is done, as you say, the FIFO implementation should be straight forward. 

> It is tricky to do resource-matching scheduling when the job mix
> is not known.  For example, the scheduler does not know whether
> to preserve memory vs cores when deciding where to place a job.
> Also, the interactions of the gateway scheduler and the local
> schedulers may be complicated to predict.
> 
> Fair share is probably not a good idea.  In practice, it tends
> to disrupt the other scheduling policies such that one group is
> penalized and the others don't run much earlier.

Interesting. What do you think of the capacity based scheduling algorithm (linked below)?

> 
> One option is to maintain the gateway job queue internally,
> then use the MCP brute force approach: submit to all resources,
> then cancel after the first job start.  You may also want to
> allow the gateway to set per-resource policy limits on
> number of jobs, job duration, job core size, SUs, etc.

MCP is something we should try. The limits per gateway per resource exists, but we need to exercise these capabilities. 

Suresh

> 
> On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
>> Hi All,
>> 
>> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
>> 
>> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
>> 
>> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
>> 
>> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
>> 
>> Other use cases? 
>> 
>> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
>> 
>> Thanks in advance for your time and insights,
>> 
>> Suresh
>> 
>> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
>> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
>> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
>> [5] - https://issues.apache.org/jira/browse/YARN-326
>> 
>>

Re: Scheduling stratergies for Airavata

Posted by K Yoshimoto <ke...@sdsc.edu>.

The tricky thing is the need to maintain an internal queue of
jobs when the Stampede queued jobs limit is reached.  If airavata
has an internal representation for jobs to be submitted, I think you
are most of the way there.

It is tricky to do resource-matching scheduling when the job mix
is not known.  For example, the scheduler does not know whether
to preserve memory vs cores when deciding where to place a job.
Also, the interactions of the gateway scheduler and the local
schedulers may be complicated to predict.

Fair share is probably not a good idea.  In practice, it tends
to disrupt the other scheduling policies such that one group is
penalized and the others don't run much earlier.

One option is to maintain the gateway job queue internally,
then use the MCP brute force approach: submit to all resources,
then cancel after the first job start.  You may also want to
allow the gateway to set per-resource policy limits on
number of jobs, job duration, job core size, SUs, etc.

On Tue, Sep 02, 2014 at 07:50:12AM -0400, Suresh Marru wrote:
> Hi All,
> 
> Need some guidance on identifying a scheduling strategy and a pluggable third party implementation for airavata scheduling needs. For context let me describe the use cases for scheduling within airavata:
> 
> * If we gateway/user is submitting a series of jobs, airavata is currently not throttling them and sending them to compute clusters (in a FIFO way). Resources enforce per user job limit within a queue and ensure fair use of the clusters ((example: stampede allows 50 jobs per user in the normal queue [1]). Airavata will need to implement queues and throttle jobs respecting the max-job-per-queue limits of a underlying resource queue. 
>  
> * Current version of Airavata is also not performing job scheduling across available computational resources and expecting gateways/users to pick resources during experiment launch. Airavata will need to implement schedulers which become aware of existing loads on the clusters and spread jobs efficiently. The scheduler should be able to get access to heuristics on previous executions and current requirements which includes job size (number of nodes/cores), memory requirements, wall time estimates and so forth. 
> 
> * As Airavata is mapping multiple individual user jobs into one or more community account submissions, it also becomes critical to implement fair-share scheduling among these users to ensure fair use of allocations as well as allowable queue limits.
> 
> Other use cases? 
> 
> We will greatly appreciate if folks on this list can shed light on experiences using schedulers implemented in hadoop, mesos, storm or other frameworks outside of their intended use. For instance, hadoop (yarn) capacity [2] and fair schedulers [3][4][5] seem to meet the needs of airavata. Is it a good idea to attempt to reuse these implementations? Any other pluggable third-party alternatives. 
> 
> Thanks in advance for your time and insights,
> 
> Suresh
> 
> [1] - https://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide#running
> [2] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
> [3] - http://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/FairScheduler.html
> [4] - https://issues.apache.org/jira/browse/HADOOP-3746
> [5] - https://issues.apache.org/jira/browse/YARN-326
> 
>