You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/05/19 12:10:04 UTC

Job scheduling (Re: Unable to run more than one job concurrently)

Andrzej Bialecki wrote:
> Hi all,
>
> I'm running Hadoop on a relatively small cluster (5 nodes) with 
> growing datasets.
>
> I noticed that if I start a job that is configured to run more map 
> tasks than is the cluster capacity (mapred.tasktracker.tasks.maximum * 
> number of nodes, 20 in this case), of course only that many map tasks 
> will run, and when they are finished the next map tasks from that job 
> will be scheduled.
>
> However, when I try to start another job in parallel, only its reduce 
> tasks will be scheduled (uselessly spin-waiting for map output, and 
> only reducing the number of available tasks in the cluster...), and no 
> map tasks from this job will be scheduled - until the first job 
> completes. This feels wrong - not only I'm not making progress on the 
> second job, but I'm also taking the slots away from the first job!
>
> I'm somewhat miffed about this - I'd think that jobtracker should 
> split the available resources evenly between these two jobs, i.e. it 
> should schedule some map tasks from the first job and some from the 
> second one. This is not what is happening, though ...
>
> Is this a configuration error, a bug, or a feature? :)
>

It seems it's a feature - I found the code in 
JobTracker.pollForNewTask(), and I'm not too happy about it.

Let's consider the following example: if I'm running a Nutch fetcher, 
the main limitation is the available bandwidth to fetch pages, and not 
the capacity of the cluster. I'd love to be able to execute other jobs 
in parallel, so that I don't have to wait until fetcher completes. I 
could sacrifice some of the task slots on tasktrackers for that other 
job, because the fetcher job wouldn't suffer from this anyway (at least 
not too much).

So, I'd like to change this code to pick up a random job from the list 
jobsByArrival, and take job.obtainNewMapTask from that randomly selected 
job. Would that work? Additionally, if no map tasks from that job have 
been allocated I'd like to skip adding reduce tasks from that job, later 
in lines 721-750.

Perhaps we should extend JobInProgress to include a priority, and 
implement something a la Unix scheduler.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

Yes, please keep it simple.

I know folks in research with many ideas for how to add features to  
the programming model and we still have a lot of work to do to make  
one job run efficiently.  I think it is best to not make these  
directions harder by adding many features that will need to be  
supported to the scheduler.  Let's master running one job well before  
putting a lot of effort into running lots of jobs.

That said, priority queuing seems simple and separable from my  
concerns.  Interleaving jobs gets more complicated and is something  
I'd be happy to see deferred.  Doing so by default with decaying  
priorities and such doesn't seem like the right trade-off to me.   
This is primarily a batch system.

On May 19, 2006, at 2:09 PM, Paul Sutter wrote:

> right now, i would prefer that we have a really really simple  
> scheme that
> can be done quickly.
>
> being able to run a huge days-long job in the background, such that a
> smaller hours-long job basically takes over and runs to completion,  
> that
> would be a big win.
>
> having multple jobs with time slicing? cool, but only 20% better and
> probably 80% more work.
>
> if we want a thorough scheme, i think eric has the right idea of  
> using one
> of the existing scheduler packages.
>
>
> On 5/19/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>>
>> Doug Cutting wrote:
>> > Paul Sutter wrote:
>> >> (1) Allow submission times in the future, enabling the creation of
>> >> "background" jobs. My understanding is that job submission  
>> times are
>> >> used to
>> >> prioritize scheduling. All tasks from a job submitted early run to
>> >> completion before those of a job submitted later. If we could  
>> submit
>> any
>> >> days-long jobs with a submission time in the future, say the year
>> >> 2010, and
>> >> any short hours-long jobs with the current time, that short job  
>> would
>> be
>> >> able to interrupt the long job. Hack? Yes. Useful? I think so.
>> >
>> > I think this is equivalent to adding a job priority, where tasks  
>> with
>> > the highest priority job are run first.  If jobs are at the same
>> > priority, then the first submitted would run.  Adding priority  
>> would
>> > add a bit more complexity, but would also be less of a hack.
>>
>>
>> Hmm.. If you compare it to a Unix scheduler, processes at the same
>> priority have even chances of being run, regardless of which was  
>> started
>> first - not only that, processes undergo a "priority decay", in  
>> that if
>> they are running longer then their priority is lowered - this enables
>> new processes to start quickly (and maybe quickly finish), and then
>> fairly compete with other processes.
>>
>> In our case, this would mean that jobs with the same priority would
>> execute concurrently, sharing available map/reduce slots, and long
>> running jobs would be gradually de-prioritized. This also means  
>> that the
>> first job will slow down when the second one is started, but the  
>> second
>> job will have a chance to make a good start (and perhaps quickly  
>> finish)
>> and then, subject to the priority decay, run in parallel with  
>> other jobs
>> (albeit slower) instead of being stuck in the wait queue.
>>
>> And if the second job is started with a higher priority, it should
>> preempt the first job (i.e. it should get proportionally more  
>> slots than
>> the first job). If you need all cluster resources for a specific job,
>> and don't want any other jobs to run, just set the priority to the
>> highest value, thus preempting all other jobs (actually, it would
>> suspend other already executing jobs, which would resume when your  
>> job
>> is done - not a bad feature either!).
>>
>> I think this is a relatively simple and well understood mechanism.
>>
>>
>> >> (2) Have a per-job total task count limit. Currently, we  
>> establish the
>> >> number of tasks each node runs, and how many map or reduce  
>> tasks we
>> have
>> >> total in a given job. But it would be great if we could set a  
>> ceiling
>> >> on the
>> >> number of tasks that run concurrently for a given job. This may  
>> help
>> >> with
>> >> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer
>> >> concurrent
>> >> jobs would be fine?).
>> >
>> > I like this idea.  So if the highest-priority job is already  
>> running
>> > at its task limit, then tasks can be run from the next
>> > highest-priority job.  Should there be separate limits for maps and
>> > reduces?
>>
>> I like this idea too. I think a similar setting for the minimum  
>> number
>> of tasks would be needed too? That would solve my problem. In  
>> fact, it
>> would be probably better than the schema I described above,  
>> because it
>> would guarantee certain minimum tasks running at any time.
>>
>> This reminds me of the "idle time" and "real time" policies in BSD
>> scheduler ... man 1 rtprio. The "real time" policy prevents the  
>> priority
>> decay that normally occurs, and "idle time" policy allows  
>> processes to
>> run only if CPU is idle.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Paul Sutter <su...@gmail.com>.

right now, i would prefer that we have a really really simple scheme that
can be done quickly.

being able to run a huge days-long job in the background, such that a
smaller hours-long job basically takes over and runs to completion, that
would be a big win.

having multple jobs with time slicing? cool, but only 20% better and
probably 80% more work.

if we want a thorough scheme, i think eric has the right idea of using one
of the existing scheduler packages.


On 5/19/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Doug Cutting wrote:
> > Paul Sutter wrote:
> >> (1) Allow submission times in the future, enabling the creation of
> >> "background" jobs. My understanding is that job submission times are
> >> used to
> >> prioritize scheduling. All tasks from a job submitted early run to
> >> completion before those of a job submitted later. If we could submit
> any
> >> days-long jobs with a submission time in the future, say the year
> >> 2010, and
> >> any short hours-long jobs with the current time, that short job would
> be
> >> able to interrupt the long job. Hack? Yes. Useful? I think so.
> >
> > I think this is equivalent to adding a job priority, where tasks with
> > the highest priority job are run first.  If jobs are at the same
> > priority, then the first submitted would run.  Adding priority would
> > add a bit more complexity, but would also be less of a hack.
>
>
> Hmm.. If you compare it to a Unix scheduler, processes at the same
> priority have even chances of being run, regardless of which was started
> first - not only that, processes undergo a "priority decay", in that if
> they are running longer then their priority is lowered - this enables
> new processes to start quickly (and maybe quickly finish), and then
> fairly compete with other processes.
>
> In our case, this would mean that jobs with the same priority would
> execute concurrently, sharing available map/reduce slots, and long
> running jobs would be gradually de-prioritized. This also means that the
> first job will slow down when the second one is started, but the second
> job will have a chance to make a good start (and perhaps quickly finish)
> and then, subject to the priority decay, run in parallel with other jobs
> (albeit slower) instead of being stuck in the wait queue.
>
> And if the second job is started with a higher priority, it should
> preempt the first job (i.e. it should get proportionally more slots than
> the first job). If you need all cluster resources for a specific job,
> and don't want any other jobs to run, just set the priority to the
> highest value, thus preempting all other jobs (actually, it would
> suspend other already executing jobs, which would resume when your job
> is done - not a bad feature either!).
>
> I think this is a relatively simple and well understood mechanism.
>
>
> >> (2) Have a per-job total task count limit. Currently, we establish the
> >> number of tasks each node runs, and how many map or reduce tasks we
> have
> >> total in a given job. But it would be great if we could set a ceiling
> >> on the
> >> number of tasks that run concurrently for a given job. This may help
> >> with
> >> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer
> >> concurrent
> >> jobs would be fine?).
> >
> > I like this idea.  So if the highest-priority job is already running
> > at its task limit, then tasks can be run from the next
> > highest-priority job.  Should there be separate limits for maps and
> > reduces?
>
> I like this idea too. I think a similar setting for the minimum number
> of tasks would be needed too? That would solve my problem. In fact, it
> would be probably better than the schema I described above, because it
> would guarantee certain minimum tasks running at any time.
>
> This reminds me of the "idle time" and "real time" policies in BSD
> scheduler ... man 1 rtprio. The "real time" policy prevents the priority
> decay that normally occurs, and "idle time" policy allows processes to
> run only if CPU is idle.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Paul Sutter wrote:
>> (1) Allow submission times in the future, enabling the creation of
>> "background" jobs. My understanding is that job submission times are 
>> used to
>> prioritize scheduling. All tasks from a job submitted early run to
>> completion before those of a job submitted later. If we could submit any
>> days-long jobs with a submission time in the future, say the year 
>> 2010, and
>> any short hours-long jobs with the current time, that short job would be
>> able to interrupt the long job. Hack? Yes. Useful? I think so.
>
> I think this is equivalent to adding a job priority, where tasks with 
> the highest priority job are run first.  If jobs are at the same 
> priority, then the first submitted would run.  Adding priority would 
> add a bit more complexity, but would also be less of a hack.

Hmm.. If you compare it to a Unix scheduler, processes at the same 
priority have even chances of being run, regardless of which was started 
first - not only that, processes undergo a "priority decay", in that if 
they are running longer then their priority is lowered - this enables 
new processes to start quickly (and maybe quickly finish), and then 
fairly compete with other processes.

In our case, this would mean that jobs with the same priority would 
execute concurrently, sharing available map/reduce slots, and long 
running jobs would be gradually de-prioritized. This also means that the 
first job will slow down when the second one is started, but the second 
job will have a chance to make a good start (and perhaps quickly finish) 
and then, subject to the priority decay, run in parallel with other jobs 
(albeit slower) instead of being stuck in the wait queue.

And if the second job is started with a higher priority, it should 
preempt the first job (i.e. it should get proportionally more slots than 
the first job). If you need all cluster resources for a specific job, 
and don't want any other jobs to run, just set the priority to the 
highest value, thus preempting all other jobs (actually, it would 
suspend other already executing jobs, which would resume when your job 
is done - not a bad feature either!).

I think this is a relatively simple and well understood mechanism.

>> (2) Have a per-job total task count limit. Currently, we establish the
>> number of tasks each node runs, and how many map or reduce tasks we have
>> total in a given job. But it would be great if we could set a ceiling 
>> on the
>> number of tasks that run concurrently for a given job. This may help 
>> with
>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer 
>> concurrent
>> jobs would be fine?).
>
> I like this idea.  So if the highest-priority job is already running 
> at its task limit, then tasks can be run from the next 
> highest-priority job.  Should there be separate limits for maps and 
> reduces?

I like this idea too. I think a similar setting for the minimum number 
of tasks would be needed too? That would solve my problem. In fact, it 
would be probably better than the schema I described above, because it 
would guarantee certain minimum tasks running at any time.

This reminds me of the "idle time" and "real time" policies in BSD 
scheduler ... man 1 rtprio. The "real time" policy prevents the priority 
decay that normally occurs, and "idle time" policy allows processes to 
run only if CPU is idle.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Rod Taylor <rb...@sitesell.com>.

> You have no guarantee that your time sensitive data is safe /  
> committed until after your reduce has completed.  If you care about  
> reliability or data integrity, simply run a full map-reduce job in  
> your collection window and store the result in the HDFS.

Perhaps I explained incorrectly. It's NOT the data that is time
sensitive it is the resource availability that is time sensitive. With a
given availability window for retrieval. So long as sorting is a
requirement of reduce, the overhead of saving is going to remain
significant.

> Do expensive post processing you have a quarter to complete as  
> another job.  Being able to preempt a long job with a time sensitive  
> short job seems to really be your requirement.

Fetch has the same problem. Running fetches end-to-end (starting a new
one the instant a previous has finished) you end up with lulls between
fetches. For me this is about 15% of the time (15% wasted bandwidth
since you pay a flat rate).

My machines all have 12GB ram -- temporary storage is in memory -- and
reasonably fast processors. I really don't want to hold up a new fetch
map for a previous rounds fetch reduce.

> On May 21, 2006, at 11:22 AM, Rod Taylor wrote:
> 
> >
> >>> (2) Have a per-job total task count limit. Currently, we  
> >>> establish the
> >>> number of tasks each node runs, and how many map or reduce tasks  
> >>> we have
> >>> total in a given job. But it would be great if we could set a  
> >>> ceiling on the
> >>> number of tasks that run concurrently for a given job. This may  
> >>> help with
> >>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
> >>> concurrent
> >>> jobs would be fine?).
> >>
> >> I like this idea.  So if the highest-priority job is already  
> >> running at
> >> its task limit, then tasks can be run from the next highest-priority
> >> job.  Should there be separate limits for maps and reduces?
> >
> > Limits for map and reduce are useful for a job class. Not so much  
> > for a
> > specific job instance.  Data collection may be best achieved with 15
> > parallel maps pulling data from remote data sources. But if the fact
> > there are 3 from one job and 12 from another isn't important. It's
> > important that 15 makes best use of resources.
> >
> > A different priority for map and reduce would also be useful. Many  
> > times
> > data collection in a set timeframe is far more important than reducing
> > it for storage or post processing, particularly when data  
> > collection is
> > retrieving it from a remote resource.
> >
> >
> > Data warehousing activities often require that data collection occur
> > once a night between set hours (very high priority) but processing of
> > the data collected can occur any time until the end of the quarter.
> >
> >
> > For Nutch, with both of the above you should be able to achieve N  
> > number
> > of Fetch Map processes running at all times with everything else being
> > secondary within the remaining resources. This could make use of  
> > 100% of
> > available remote bandwidth.
> >
> > -- 
> > Rod Taylor <rb...@sitesell.com>
> >
> 
> 
-- 
Rod Taylor <rb...@sitesell.com>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

??

You have no guarantee that your time sensitive data is safe /  
committed until after your reduce has completed.  If you care about  
reliability or data integrity, simply run a full map-reduce job in  
your collection window and store the result in the HDFS.

Do expensive post processing you have a quarter to complete as  
another job.  Being able to preempt a long job with a time sensitive  
short job seems to really be your requirement.

On May 21, 2006, at 11:22 AM, Rod Taylor wrote:

>
>>> (2) Have a per-job total task count limit. Currently, we  
>>> establish the
>>> number of tasks each node runs, and how many map or reduce tasks  
>>> we have
>>> total in a given job. But it would be great if we could set a  
>>> ceiling on the
>>> number of tasks that run concurrently for a given job. This may  
>>> help with
>>> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
>>> concurrent
>>> jobs would be fine?).
>>
>> I like this idea.  So if the highest-priority job is already  
>> running at
>> its task limit, then tasks can be run from the next highest-priority
>> job.  Should there be separate limits for maps and reduces?
>
> Limits for map and reduce are useful for a job class. Not so much  
> for a
> specific job instance.  Data collection may be best achieved with 15
> parallel maps pulling data from remote data sources. But if the fact
> there are 3 from one job and 12 from another isn't important. It's
> important that 15 makes best use of resources.
>
> A different priority for map and reduce would also be useful. Many  
> times
> data collection in a set timeframe is far more important than reducing
> it for storage or post processing, particularly when data  
> collection is
> retrieving it from a remote resource.
>
>
> Data warehousing activities often require that data collection occur
> once a night between set hours (very high priority) but processing of
> the data collected can occur any time until the end of the quarter.
>
>
> For Nutch, with both of the above you should be able to achieve N  
> number
> of Fetch Map processes running at all times with everything else being
> secondary within the remaining resources. This could make use of  
> 100% of
> available remote bandwidth.
>
> -- 
> Rod Taylor <rb...@sitesell.com>
>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Rod Taylor <rb...@sitesell.com>.

> > (2) Have a per-job total task count limit. Currently, we establish the
> > number of tasks each node runs, and how many map or reduce tasks we have
> > total in a given job. But it would be great if we could set a ceiling on the
> > number of tasks that run concurrently for a given job. This may help with
> > Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent
> > jobs would be fine?).
> 
> I like this idea.  So if the highest-priority job is already running at 
> its task limit, then tasks can be run from the next highest-priority 
> job.  Should there be separate limits for maps and reduces?

Limits for map and reduce are useful for a job class. Not so much for a
specific job instance.  Data collection may be best achieved with 15
parallel maps pulling data from remote data sources. But if the fact
there are 3 from one job and 12 from another isn't important. It's
important that 15 makes best use of resources.

A different priority for map and reduce would also be useful. Many times
data collection in a set timeframe is far more important than reducing
it for storage or post processing, particularly when data collection is
retrieving it from a remote resource.


Data warehousing activities often require that data collection occur
once a night between set hours (very high priority) but processing of
the data collected can occur any time until the end of the quarter.


For Nutch, with both of the above you should be able to achieve N number
of Fetch Map processes running at all times with everything else being
secondary within the remaining resources. This could make use of 100% of
available remote bandwidth.

-- 
Rod Taylor <rb...@sitesell.com>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Andrzej Bialecki <ab...@getopt.org>.

Paul Sutter wrote:
> priority vs pretend dates: whatever is most trivial, i'm in favor. we 
> could
> certainly use it today.
>
> delaying reducers: regardless of the bottleneck in the system, andrzej's
> problem was that reducers were sitting idle while the mappers of another
> task were running.

Exactly. There is no point in starting any reducers if we haven't even 
scheduled a single map task for the job. As soon as at least one map 
task is scheduled, I'm fine with starting reducers, I agree with 
arguments about network bottlenecks that Doug presented.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Paul Sutter <su...@gmail.com>.

priority vs pretend dates: whatever is most trivial, i'm in favor. we could
certainly use it today.

delaying reducers: regardless of the bottleneck in the system, andrzej's
problem was that reducers were sitting idle while the mappers of another
task were running.

paul


On 5/19/06, Doug Cutting <cu...@apache.org> wrote:
>
> Paul Sutter wrote:
> > (1) Allow submission times in the future, enabling the creation of
> > "background" jobs. My understanding is that job submission times are
> used to
> > prioritize scheduling. All tasks from a job submitted early run to
> > completion before those of a job submitted later. If we could submit any
> > days-long jobs with a submission time in the future, say the year 2010,
> and
> > any short hours-long jobs with the current time, that short job would be
> > able to interrupt the long job. Hack? Yes. Useful? I think so.
>
> I think this is equivalent to adding a job priority, where tasks with
> the highest priority job are run first.  If jobs are at the same
> priority, then the first submitted would run.  Adding priority would add
> a bit more complexity, but would also be less of a hack.
>
> > (2) Have a per-job total task count limit. Currently, we establish the
> > number of tasks each node runs, and how many map or reduce tasks we have
> > total in a given job. But it would be great if we could set a ceiling on
> the
> > number of tasks that run concurrently for a given job. This may help
> with
> > Andrzej's fetcher (since it is bandwidth constrained, maybe fewer
> concurrent
> > jobs would be fine?).
>
> I like this idea.  So if the highest-priority job is already running at
> its task limit, then tasks can be run from the next highest-priority
> job.  Should there be separate limits for maps and reduces?
>
> > (3) Don't start the reducers until a certain number of mappers have
> > completed (25%? 75%? 90%?). This optimization of starting early will be
> less
> > important when we've solved the map output copy problems.
>
> Would this only be done if there were other jobs competing for reduce
> slots?  Otherwise this could have an impact on performance.
>
> In theory, network bandwidth is the primary MapReduce bottleneck.  Map
> input usually doesn't cross the network, but map output must generally
> cross the network to reduce nodes.  Sorting can be done in parallel with
> mapping and copying of map output to reduce nodes, but reduce output
> cannot begin until all map output has arrived.  Finally, in general,
> reduce output must cross the network, for replication.  So the data must
> cross the network twice: once between map and reduce and once on reduce
> output, and these two transfers are serialized.  Since the network is
> the bottleneck, we will get the best performance when we saturate it
> continuously.  So we should thus begin transferring map outputs as soon
> as they are available.  That's the theory.
>
> Doug
>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Doug Cutting <cu...@apache.org>.

Paul Sutter wrote:
> (1) Allow submission times in the future, enabling the creation of
> "background" jobs. My understanding is that job submission times are used to
> prioritize scheduling. All tasks from a job submitted early run to
> completion before those of a job submitted later. If we could submit any
> days-long jobs with a submission time in the future, say the year 2010, and
> any short hours-long jobs with the current time, that short job would be
> able to interrupt the long job. Hack? Yes. Useful? I think so.

I think this is equivalent to adding a job priority, where tasks with 
the highest priority job are run first.  If jobs are at the same 
priority, then the first submitted would run.  Adding priority would add 
a bit more complexity, but would also be less of a hack.

> (2) Have a per-job total task count limit. Currently, we establish the
> number of tasks each node runs, and how many map or reduce tasks we have
> total in a given job. But it would be great if we could set a ceiling on the
> number of tasks that run concurrently for a given job. This may help with
> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent
> jobs would be fine?).

I like this idea.  So if the highest-priority job is already running at 
its task limit, then tasks can be run from the next highest-priority 
job.  Should there be separate limits for maps and reduces?

> (3) Don't start the reducers until a certain number of mappers have
> completed (25%? 75%? 90%?). This optimization of starting early will be less
> important when we've solved the map output copy problems.

Would this only be done if there were other jobs competing for reduce 
slots?  Otherwise this could have an impact on performance.

In theory, network bandwidth is the primary MapReduce bottleneck.  Map 
input usually doesn't cross the network, but map output must generally 
cross the network to reduce nodes.  Sorting can be done in parallel with 
mapping and copying of map output to reduce nodes, but reduce output 
cannot begin until all map output has arrived.  Finally, in general, 
reduce output must cross the network, for replication.  So the data must 
cross the network twice: once between map and reduce and once on reduce 
output, and these two transfers are serialized.  Since the network is 
the bottleneck, we will get the best performance when we saturate it 
continuously.  So we should thus begin transferring map outputs as soon 
as they are available.  That's the theory.

Doug

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Paul Sutter <su...@gmail.com>.

Using an existing full-featured scheduler will be great.

Being able to submit tasks in the future would be nice though, as an interim
step. If it would work.

Question for any Hadoop scheduler experts out there: will that work as a
quick hack way to implement background tasks?

Paul


On 5/19/06, Eric Baldeschwieler <er...@yahoo-inc.com> wrote:
>
> We're planning to experiment with using existing batch scheduling
> systems for addressing these concerns later in the year.  Condor and
> Torque being the leading contenders.
>
> The thinking is that these systems have huge investments in
> configurable scheduling policies and that it is best to keep hadoop
> simple and leverage these systems to get fine grained / multi-user
> scheduling control.
>
> If this works, the idea is to run a durable HDFS cluster and have the
> batch systems setup task tracker networks for each user on demand.
> This approach is probably more applicable if you have large clusters
> with distinct users / systems sharing them, so this may not address
> your requirements.
>
> In any case this is why my team is not putting a lot of thought into
> this problem in the short term.  That said, I've always anticipated
> that others in the hadoop community might pursue improved
> scheduling.  I just advocate keeping it simple, because when you look
> at condor or torque you will quickly appreciate how unsimple it can
> become!
>
>
> On May 19, 2006, at 11:01 AM, Paul Sutter wrote:
>
> >
> > A few suggestions to allow for a very simple extension to the current
> > scheduling:
> >
> > (1) Allow submission times in the future, enabling the creation of
> > "background" jobs. My understanding is that job submission times
> > are used to
> > prioritize scheduling. All tasks from a job submitted early run to
> > completion before those of a job submitted later. If we could
> > submit any
> > days-long jobs with a submission time in the future, say the year
> > 2010, and
> > any short hours-long jobs with the current time, that short job
> > would be
> > able to interrupt the long job. Hack? Yes. Useful? I think so.
> >
> > (2) Have a per-job total task count limit. Currently, we establish the
> > number of tasks each node runs, and how many map or reduce tasks we
> > have
> > total in a given job. But it would be great if we could set a
> > ceiling on the
> > number of tasks that run concurrently for a given job. This may
> > help with
> > Andrzej's fetcher (since it is bandwidth constrained, maybe fewer
> > concurrent
> > jobs would be fine?).
> >
> > (3) Don't start the reducers until a certain number of mappers have
> > completed (25%? 75%? 90%?). This optimization of starting early
> > will be less
> > important when we've solved the map output copy problems.
> >
> > Just a few ideas.
> >
> > -----Original Message-----
> > From: bpendleton@gmail.com [mailto:bpendleton@gmail.com] On Behalf
> > Of Bryan
> > A. Pendleton
> > Sent: Friday, May 19, 2006 10:44 AM
> > To: hadoop-dev@lucene.apache.org
> > Subject: Re: Job scheduling (Re: Unable to run more than one job
> > concurrently)
> >
> > There are some additional risks to running simultaneous jobs. Right
> > now,
> > Hadoop does a very bad job dealing with out-of-space conditions. If
> > you run
> > two jobs, where the total amount of temporary space (for map outputs)
> > between both jobs is greater than the amount of space available on the
> > cluster, then they will both fail. If you run them serially, they
> > should
> > both succeed.
> >
> > In the very least, it's probably wise to take into account more
> > than just
> > scheduling priority in any scheduler. (Expected) temporary space
> > demands,
> > bandwidth limits, and size of jobs should be some of the criteria
> > available
> > to the scheduler.
> >
> > On 5/19/06, Andrzej Bialecki <ab...@getopt.org> wrote:
> >>
> >> Andrzej Bialecki wrote:
> >>> Hi all,
> >>>
> >>> I'm running Hadoop on a relatively small cluster (5 nodes) with
> >>> growing datasets.
> >>>
> >>> I noticed that if I start a job that is configured to run more map
> >>> tasks than is the cluster capacity
> >>> (mapred.tasktracker.tasks.maximum *
> >>> number of nodes, 20 in this case), of course only that many map
> >>> tasks
> >>> will run, and when they are finished the next map tasks from that
> >>> job
> >>> will be scheduled.
> >>>
> >>> However, when I try to start another job in parallel, only its
> >>> reduce
> >>> tasks will be scheduled (uselessly spin-waiting for map output, and
> >>> only reducing the number of available tasks in the cluster...),
> >>> and no
> >>> map tasks from this job will be scheduled - until the first job
> >>> completes. This feels wrong - not only I'm not making progress on
> >>> the
> >>> second job, but I'm also taking the slots away from the first job!
> >>>
> >>> I'm somewhat miffed about this - I'd think that jobtracker should
> >>> split the available resources evenly between these two jobs, i.e. it
> >>> should schedule some map tasks from the first job and some from the
> >>> second one. This is not what is happening, though ...
> >>>
> >>> Is this a configuration error, a bug, or a feature? :)
> >>>
> >>
> >> It seems it's a feature - I found the code in
> >> JobTracker.pollForNewTask(), and I'm not too happy about it.
> >>
> >> Let's consider the following example: if I'm running a Nutch fetcher,
> >> the main limitation is the available bandwidth to fetch pages, and
> >> not
> >> the capacity of the cluster. I'd love to be able to execute other
> >> jobs
> >> in parallel, so that I don't have to wait until fetcher completes. I
> >> could sacrifice some of the task slots on tasktrackers for that other
> >> job, because the fetcher job wouldn't suffer from this anyway (at
> >> least
> >> not too much).
> >>
> >> So, I'd like to change this code to pick up a random job from the
> >> list
> >> jobsByArrival, and take job.obtainNewMapTask from that randomly
> >> selected
> >> job. Would that work? Additionally, if no map tasks from that job
> >> have
> >> been allocated I'd like to skip adding reduce tasks from that job,
> >> later
> >> in lines 721-750.
> >>
> >> Perhaps we should extend JobInProgress to include a priority, and
> >> implement something a la Unix scheduler.
> >>
> >> --
> >> Best regards,
> >> Andrzej Bialecki     <><
> >> ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> >>
> >>
> >
> >
> > --
> > Bryan A. Pendleton
> > Ph: (877) geek-1-bp
> >
>
>

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Eric Baldeschwieler <er...@yahoo-inc.com>.

We're planning to experiment with using existing batch scheduling  
systems for addressing these concerns later in the year.  Condor and  
Torque being the leading contenders.

The thinking is that these systems have huge investments in  
configurable scheduling policies and that it is best to keep hadoop  
simple and leverage these systems to get fine grained / multi-user  
scheduling control.

If this works, the idea is to run a durable HDFS cluster and have the  
batch systems setup task tracker networks for each user on demand.   
This approach is probably more applicable if you have large clusters  
with distinct users / systems sharing them, so this may not address  
your requirements.

In any case this is why my team is not putting a lot of thought into  
this problem in the short term.  That said, I've always anticipated  
that others in the hadoop community might pursue improved  
scheduling.  I just advocate keeping it simple, because when you look  
at condor or torque you will quickly appreciate how unsimple it can  
become!


On May 19, 2006, at 11:01 AM, Paul Sutter wrote:

>
> A few suggestions to allow for a very simple extension to the current
> scheduling:
>
> (1) Allow submission times in the future, enabling the creation of
> "background" jobs. My understanding is that job submission times  
> are used to
> prioritize scheduling. All tasks from a job submitted early run to
> completion before those of a job submitted later. If we could  
> submit any
> days-long jobs with a submission time in the future, say the year  
> 2010, and
> any short hours-long jobs with the current time, that short job  
> would be
> able to interrupt the long job. Hack? Yes. Useful? I think so.
>
> (2) Have a per-job total task count limit. Currently, we establish the
> number of tasks each node runs, and how many map or reduce tasks we  
> have
> total in a given job. But it would be great if we could set a  
> ceiling on the
> number of tasks that run concurrently for a given job. This may  
> help with
> Andrzej's fetcher (since it is bandwidth constrained, maybe fewer  
> concurrent
> jobs would be fine?).
>
> (3) Don't start the reducers until a certain number of mappers have
> completed (25%? 75%? 90%?). This optimization of starting early  
> will be less
> important when we've solved the map output copy problems.
>
> Just a few ideas.
>
> -----Original Message-----
> From: bpendleton@gmail.com [mailto:bpendleton@gmail.com] On Behalf  
> Of Bryan
> A. Pendleton
> Sent: Friday, May 19, 2006 10:44 AM
> To: hadoop-dev@lucene.apache.org
> Subject: Re: Job scheduling (Re: Unable to run more than one job
> concurrently)
>
> There are some additional risks to running simultaneous jobs. Right  
> now,
> Hadoop does a very bad job dealing with out-of-space conditions. If  
> you run
> two jobs, where the total amount of temporary space (for map outputs)
> between both jobs is greater than the amount of space available on the
> cluster, then they will both fail. If you run them serially, they  
> should
> both succeed.
>
> In the very least, it's probably wise to take into account more  
> than just
> scheduling priority in any scheduler. (Expected) temporary space  
> demands,
> bandwidth limits, and size of jobs should be some of the criteria  
> available
> to the scheduler.
>
> On 5/19/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>>
>> Andrzej Bialecki wrote:
>>> Hi all,
>>>
>>> I'm running Hadoop on a relatively small cluster (5 nodes) with
>>> growing datasets.
>>>
>>> I noticed that if I start a job that is configured to run more map
>>> tasks than is the cluster capacity  
>>> (mapred.tasktracker.tasks.maximum *
>>> number of nodes, 20 in this case), of course only that many map  
>>> tasks
>>> will run, and when they are finished the next map tasks from that  
>>> job
>>> will be scheduled.
>>>
>>> However, when I try to start another job in parallel, only its  
>>> reduce
>>> tasks will be scheduled (uselessly spin-waiting for map output, and
>>> only reducing the number of available tasks in the cluster...),  
>>> and no
>>> map tasks from this job will be scheduled - until the first job
>>> completes. This feels wrong - not only I'm not making progress on  
>>> the
>>> second job, but I'm also taking the slots away from the first job!
>>>
>>> I'm somewhat miffed about this - I'd think that jobtracker should
>>> split the available resources evenly between these two jobs, i.e. it
>>> should schedule some map tasks from the first job and some from the
>>> second one. This is not what is happening, though ...
>>>
>>> Is this a configuration error, a bug, or a feature? :)
>>>
>>
>> It seems it's a feature - I found the code in
>> JobTracker.pollForNewTask(), and I'm not too happy about it.
>>
>> Let's consider the following example: if I'm running a Nutch fetcher,
>> the main limitation is the available bandwidth to fetch pages, and  
>> not
>> the capacity of the cluster. I'd love to be able to execute other  
>> jobs
>> in parallel, so that I don't have to wait until fetcher completes. I
>> could sacrifice some of the task slots on tasktrackers for that other
>> job, because the fetcher job wouldn't suffer from this anyway (at  
>> least
>> not too much).
>>
>> So, I'd like to change this code to pick up a random job from the  
>> list
>> jobsByArrival, and take job.obtainNewMapTask from that randomly  
>> selected
>> job. Would that work? Additionally, if no map tasks from that job  
>> have
>> been allocated I'd like to skip adding reduce tasks from that job,  
>> later
>> in lines 721-750.
>>
>> Perhaps we should extend JobInProgress to include a priority, and
>> implement something a la Unix scheduler.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>> ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>>
>
>
> -- 
> Bryan A. Pendleton
> Ph: (877) geek-1-bp
>

RE: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by Paul Sutter <ps...@quantcast.com>.

A few suggestions to allow for a very simple extension to the current
scheduling:

(1) Allow submission times in the future, enabling the creation of
"background" jobs. My understanding is that job submission times are used to
prioritize scheduling. All tasks from a job submitted early run to
completion before those of a job submitted later. If we could submit any
days-long jobs with a submission time in the future, say the year 2010, and
any short hours-long jobs with the current time, that short job would be
able to interrupt the long job. Hack? Yes. Useful? I think so.

(2) Have a per-job total task count limit. Currently, we establish the
number of tasks each node runs, and how many map or reduce tasks we have
total in a given job. But it would be great if we could set a ceiling on the
number of tasks that run concurrently for a given job. This may help with
Andrzej's fetcher (since it is bandwidth constrained, maybe fewer concurrent
jobs would be fine?).

(3) Don't start the reducers until a certain number of mappers have
completed (25%? 75%? 90%?). This optimization of starting early will be less
important when we've solved the map output copy problems.

Just a few ideas.

-----Original Message-----
From: bpendleton@gmail.com [mailto:bpendleton@gmail.com] On Behalf Of Bryan
A. Pendleton
Sent: Friday, May 19, 2006 10:44 AM
To: hadoop-dev@lucene.apache.org
Subject: Re: Job scheduling (Re: Unable to run more than one job
concurrently)

There are some additional risks to running simultaneous jobs. Right now,
Hadoop does a very bad job dealing with out-of-space conditions. If you run
two jobs, where the total amount of temporary space (for map outputs)
between both jobs is greater than the amount of space available on the
cluster, then they will both fail. If you run them serially, they should
both succeed.

In the very least, it's probably wise to take into account more than just
scheduling priority in any scheduler. (Expected) temporary space demands,
bandwidth limits, and size of jobs should be some of the criteria available
to the scheduler.

On 5/19/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Andrzej Bialecki wrote:
> > Hi all,
> >
> > I'm running Hadoop on a relatively small cluster (5 nodes) with
> > growing datasets.
> >
> > I noticed that if I start a job that is configured to run more map
> > tasks than is the cluster capacity (mapred.tasktracker.tasks.maximum *
> > number of nodes, 20 in this case), of course only that many map tasks
> > will run, and when they are finished the next map tasks from that job
> > will be scheduled.
> >
> > However, when I try to start another job in parallel, only its reduce
> > tasks will be scheduled (uselessly spin-waiting for map output, and
> > only reducing the number of available tasks in the cluster...), and no
> > map tasks from this job will be scheduled - until the first job
> > completes. This feels wrong - not only I'm not making progress on the
> > second job, but I'm also taking the slots away from the first job!
> >
> > I'm somewhat miffed about this - I'd think that jobtracker should
> > split the available resources evenly between these two jobs, i.e. it
> > should schedule some map tasks from the first job and some from the
> > second one. This is not what is happening, though ...
> >
> > Is this a configuration error, a bug, or a feature? :)
> >
>
> It seems it's a feature - I found the code in
> JobTracker.pollForNewTask(), and I'm not too happy about it.
>
> Let's consider the following example: if I'm running a Nutch fetcher,
> the main limitation is the available bandwidth to fetch pages, and not
> the capacity of the cluster. I'd love to be able to execute other jobs
> in parallel, so that I don't have to wait until fetcher completes. I
> could sacrifice some of the task slots on tasktrackers for that other
> job, because the fetcher job wouldn't suffer from this anyway (at least
> not too much).
>
> So, I'd like to change this code to pick up a random job from the list
> jobsByArrival, and take job.obtainNewMapTask from that randomly selected
> job. Would that work? Additionally, if no map tasks from that job have
> been allocated I'd like to skip adding reduce tasks from that job, later
> in lines 721-750.
>
> Perhaps we should extend JobInProgress to include a priority, and
> implement something a la Unix scheduler.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

-- 
Bryan A. Pendleton
Ph: (877) geek-1-bp

Re: Job scheduling (Re: Unable to run more than one job concurrently)

Posted by "Bryan A. Pendleton" <bp...@geekdom.net>.

There are some additional risks to running simultaneous jobs. Right now,
Hadoop does a very bad job dealing with out-of-space conditions. If you run
two jobs, where the total amount of temporary space (for map outputs)
between both jobs is greater than the amount of space available on the
cluster, then they will both fail. If you run them serially, they should
both succeed.

In the very least, it's probably wise to take into account more than just
scheduling priority in any scheduler. (Expected) temporary space demands,
bandwidth limits, and size of jobs should be some of the criteria available
to the scheduler.

On 5/19/06, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> Andrzej Bialecki wrote:
> > Hi all,
> >
> > I'm running Hadoop on a relatively small cluster (5 nodes) with
> > growing datasets.
> >
> > I noticed that if I start a job that is configured to run more map
> > tasks than is the cluster capacity (mapred.tasktracker.tasks.maximum *
> > number of nodes, 20 in this case), of course only that many map tasks
> > will run, and when they are finished the next map tasks from that job
> > will be scheduled.
> >
> > However, when I try to start another job in parallel, only its reduce
> > tasks will be scheduled (uselessly spin-waiting for map output, and
> > only reducing the number of available tasks in the cluster...), and no
> > map tasks from this job will be scheduled - until the first job
> > completes. This feels wrong - not only I'm not making progress on the
> > second job, but I'm also taking the slots away from the first job!
> >
> > I'm somewhat miffed about this - I'd think that jobtracker should
> > split the available resources evenly between these two jobs, i.e. it
> > should schedule some map tasks from the first job and some from the
> > second one. This is not what is happening, though ...
> >
> > Is this a configuration error, a bug, or a feature? :)
> >
>
> It seems it's a feature - I found the code in
> JobTracker.pollForNewTask(), and I'm not too happy about it.
>
> Let's consider the following example: if I'm running a Nutch fetcher,
> the main limitation is the available bandwidth to fetch pages, and not
> the capacity of the cluster. I'd love to be able to execute other jobs
> in parallel, so that I don't have to wait until fetcher completes. I
> could sacrifice some of the task slots on tasktrackers for that other
> job, because the fetcher job wouldn't suffer from this anyway (at least
> not too much).
>
> So, I'd like to change this code to pick up a random job from the list
> jobsByArrival, and take job.obtainNewMapTask from that randomly selected
> job. Would that work? Additionally, if no map tasks from that job have
> been allocated I'd like to skip adding reduce tasks from that job, later
> in lines 721-750.
>
> Perhaps we should extend JobInProgress to include a priority, and
> implement something a la Unix scheduler.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>


-- 
Bryan A. Pendleton
Ph: (877) geek-1-bp