You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-dev@hadoop.apache.org by Vasco Visser <va...@gmail.com> on 2012/09/02 18:34:00 UTC

On the topic of task scheduling

Hi,

I am new to the list, I am working with hadoop in the context of my
MSc graduation project (has nothing to do with task scheduling per
se). I came across task scheduling because I ran into the fifo
starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
the fifo starvation issue is solved. The behavior of task scheduling I
observe in this branch is as follows. It begins with all containers
allocated to mappers. Pretty quickly reducers are starting to be
scheduled. In a linear way more containers are given to reducers,
until about 50% (does anybody know why 50%?) of available containers
are reducers (this point is reached when ~ 50% of the mappers are
finished). It stays ~50-50 for until all mappers are scheduled. Only
then the proportion of containers allocated to reducers is increased
to > 50%.

I don't think this is in general quite the optimal (in terms of total
job completion time) scheduling behavior. The reason being that the
last reducer can only be scheduled when a free container becomes
available after all mappers are scheduled. Thus, in order to shorten
total job completion time the last reducer must be scheduled as early
as possible.

For the following gedankenexperiment, assume # reducer is set to 99%
capacity, as suggested somewhere in the hadoop docs, and that each
reducer will process roughly the same amount of work. I am going to
schedule as in 2.1.0, but instead of allocating reducers slowly up to
50 % of capacity, I am just going to take away containers. Thus, the
amount of map work is the same as in 2.1.0, only no reduce work will
be done. At the point that the proportion of reducers would increased
to more than 50% of the containers (i.e., near the end of the map
phase), I schedule all reducers in the containers I took away, making
sure that the last reducer is scheduled at the same moment as it would
be in 2.1.0.  My claim is that the job completion time of this
hypothetical scheduling is about the same as the scheduling in 2.1.0
(as the last reducer is scheduled at the same time), even though I
took away 50% of the available resources for a large part of the job!
The conclusion is that it would be better to allocate all available
containers to mappers, and that reducers are starting to be scheduled
when the map phase is nearing its end, instead of right at the
beginning of the job.

Scheduling reducers early seems to me the way to go only when: 1) the
output from mappers is very skewed, i.e., some reducers are expected
to need much more time than others, 2) the network connection between
nodes is (expected to be) a big bottleneck, i.e., schedule reducers
early to smear out data transfer over the lifetime of a job, or 3)
there is no contention for resource containers.

with regard to point 1: skewedness can be determined by looking at
relative sizes of partitioned mapper output.

with regard to point 2: I think the network is only a bottleneck if it
feeds tuples slower than the reducer can merge sort the tuples (am I
right?). Also, it might be a nice optimization to transfer the
intermediate data to the machine that is going/likely to run a
specific reducer before the reducer is actually ran there (e.g.,
something like a per machine prefetch manager?). A per machine task
scheduling queue would be needed for this, to determine where a
reducer is going/likely to be scheduled.

Just my two cents. I'm interested in hearing opinions on this matter.

Regards, Vasco

Re: On the topic of task scheduling

Posted by Robert Evans <ev...@yahoo-inc.com>.

You are correct about my typo, should be launching reducers, not maps.

We do want a solution that is good in most cases, and preferably
automatic, because most users and not going to change any default values.
But I think you also want to give administrators of a cluster and
individual users as well the knobs to adjust if resources are better spent
on improving overall throughput of the cluster or if the run time of a job
is a higher priority.  On our clusters some jobs have a tight SLA.  We
ideally want to do what we can to meet their SLA, even if it requires
using more resources.  On the other hand, running on the same cluster will
be jobs with either no SLA or a very lenient one.  In those cases we want
to use the resources as wisely as possible so as many jobs as possible can
complete in the given time frame.  This has bigger ramifications with the
RM's scheduling, but ideally AM would also adjust its timing of requests
as well so both work together for a common goal.

--Bobby Evans  

On 9/4/12 8:59 AM, "Vasco Visser" <va...@gmail.com> wrote:

>On Tue, Sep 4, 2012 at 3:11 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
>> The other thing to point out too is that in order to solve this problem
>> perfectly you litterly have to solve the halting problem.  You have to
>> predict if the maps are going to finish quickly or slowly.  If they
>>finish
>> quickly then you want to launch reduces quickly to start fetching data
>> from the mappers, if they are going to finish very slowly, then you
>>have a
>> lot of reducers taking up resources not doing anything.
>
>I agree with you that a perfect solution is not going to be feasible.
>The aim should probably be a solution that is good in many cases.
>
>> That is why there
>> is the config parameter that can be set on a per job basis to tell the
>>AM
>> when to start launch maps.
>
>I assume you mean start launching reducers
>
>> We have actually been experimenting with
>> setting this to 100% because it improves utilization of the cluster a
>>lot.
>
>thanks for pointing this out, I didn't know about this config option.
>That the utilization of the cluster improves by setting this to 1
>doesn't surprise me.
>
>Maybe it is a good idea to introduce a concept like "job container
>time" that captures how much resources a job uses in its life time.
>For example, if a job uses 10 mappers each for a minute and 10
>reducers also each for a minute, then the container time would be 20
>minutes. Having idle reducer will increase container time.
>
>A conceptually simple method to optimize the container time of a job
>is to let the AM monitor for each scheduled reducer how much of the
>time it is waiting for mappers to produce intermediate data  (maybe
>embed this in the heartbeat?). If the average waiting for all
>scheduled reducers is above a certain proportion (say waiting more
>than 25% of the time or smt), then the AM can decide to discard
>some/all reducers and give the freed resources to mappers.
>
>This is just an idea, I don't know about the feasibility. Also I
>didn't think about the relationship between optimizing container time
>for a single job and optimizing it for all jobs utilizing on the
>cluster. Might be that minimizing for each job gives minimal overall,
>but not sure.
>
>> On 9/2/12 1:46 PM, "Arun C Murthy" <ac...@hortonworks.com> wrote:
>>
>>>Vasco,
>>>
>>> Welcome to Hadoop!
>>>
>>> You observations are all correct - in simplest case you launch all
>>>reduces up front (we used to do that initially) and get a good
>>>'pipeline'
>>>between maps, shuffle (i.e. moving map-outputs to reduces) and the
>>>reduce
>>>itself.
>>>
>>> However, one thing to remember is that keeping reduces up and running
>>>without sufficient maps being completed is a waste of resources in the
>>>cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not
>>>launch reduces until a certain percentage of the job's maps are complete
>>>- by default it's set to 5%. However, there still is a flaw with it
>>>(regardless of what you set it to be i.e. 5% or 50%). If it's too high,
>>>you lose the 'pipeline' and too low (5%), reduces still spin waiting for
>>>all maps to complete wasting resources in the cluster.
>>>
>>> Given that, we've implemented the heuristic you've described below for
>>>hadoop-2 which is better at balancing resource-utilization v/s
>>>pipelining
>>>or job latency.
>>>
>>> However, as you've pointed out there are several improvements which are
>>>feasible. But, remember that the complexity involved has on a number of
>>>factors you've already mentioned:
>>> # Job size (a job with 100m/10r v/s 100000m/10000r)
>>> # Skew for reduces
>>> # Resource availability i.e. other active jobs/shuffles in the system,
>>>network bandwidth etc.
>>>
>>> If you look at an ideal shuffle it will look so (pardon my primitive
>>>scribble):
>>> http://people.apache.org/~acmurthy/ideal-shuffle.png
>>>
>>> From that graph:
>>> # X i.e. when to launch reduces depends on resource availability, job
>>>size & maps' completion rate.
>>> # Slope of shuffles (red worm) depends on network b/w, skew etc.
>>>
>>> None of your points are invalid - I'm just pointing out the
>>>possibilities and complexities.
>>>
>>> Your points about aggregation are also valid, look at
>>>http://code.google.com/p/sailfish/ for e.g.
>>>
>>> One of the advantages of hadoop-2 is that anyone can play with these
>>>heuristics and implement your own - I'd love to help if you are
>>>interested in playing with them.
>>>
>>> Related jiras:
>>> https://issues.apache.org/jira/browse/MAPREDUCE-4584
>>>
>>>hth,
>>>Arun
>>>
>>>On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:
>>>
>>>> Hi,
>>>>
>>>> I am new to the list, I am working with hadoop in the context of my
>>>> MSc graduation project (has nothing to do with task scheduling per
>>>> se). I came across task scheduling because I ran into the fifo
>>>> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
>>>> the fifo starvation issue is solved. The behavior of task scheduling I
>>>> observe in this branch is as follows. It begins with all containers
>>>> allocated to mappers. Pretty quickly reducers are starting to be
>>>> scheduled. In a linear way more containers are given to reducers,
>>>> until about 50% (does anybody know why 50%?) of available containers
>>>> are reducers (this point is reached when ~ 50% of the mappers are
>>>> finished). It stays ~50-50 for until all mappers are scheduled. Only
>>>> then the proportion of containers allocated to reducers is increased
>>>> to > 50%.
>>>>
>>>> I don't think this is in general quite the optimal (in terms of total
>>>> job completion time) scheduling behavior. The reason being that the
>>>> last reducer can only be scheduled when a free container becomes
>>>> available after all mappers are scheduled. Thus, in order to shorten
>>>> total job completion time the last reducer must be scheduled as early
>>>> as possible.
>>>>
>>>> For the following gedankenexperiment, assume # reducer is set to 99%
>>>> capacity, as suggested somewhere in the hadoop docs, and that each
>>>> reducer will process roughly the same amount of work. I am going to
>>>> schedule as in 2.1.0, but instead of allocating reducers slowly up to
>>>> 50 % of capacity, I am just going to take away containers. Thus, the
>>>> amount of map work is the same as in 2.1.0, only no reduce work will
>>>> be done. At the point that the proportion of reducers would increased
>>>> to more than 50% of the containers (i.e., near the end of the map
>>>> phase), I schedule all reducers in the containers I took away, making
>>>> sure that the last reducer is scheduled at the same moment as it would
>>>> be in 2.1.0.  My claim is that the job completion time of this
>>>> hypothetical scheduling is about the same as the scheduling in 2.1.0
>>>> (as the last reducer is scheduled at the same time), even though I
>>>> took away 50% of the available resources for a large part of the job!
>>>> The conclusion is that it would be better to allocate all available
>>>> containers to mappers, and that reducers are starting to be scheduled
>>>> when the map phase is nearing its end, instead of right at the
>>>> beginning of the job.
>>>>
>>>> Scheduling reducers early seems to me the way to go only when: 1) the
>>>> output from mappers is very skewed, i.e., some reducers are expected
>>>> to need much more time than others, 2) the network connection between
>>>> nodes is (expected to be) a big bottleneck, i.e., schedule reducers
>>>> early to smear out data transfer over the lifetime of a job, or 3)
>>>> there is no contention for resource containers.
>>>>
>>>> with regard to point 1: skewedness can be determined by looking at
>>>> relative sizes of partitioned mapper output.
>>>>
>>>> with regard to point 2: I think the network is only a bottleneck if it
>>>> feeds tuples slower than the reducer can merge sort the tuples (am I
>>>> right?). Also, it might be a nice optimization to transfer the
>>>> intermediate data to the machine that is going/likely to run a
>>>> specific reducer before the reducer is actually ran there (e.g.,
>>>> something like a per machine prefetch manager?). A per machine task
>>>> scheduling queue would be needed for this, to determine where a
>>>> reducer is going/likely to be scheduled.
>>>>
>>>> Just my two cents. I'm interested in hearing opinions on this matter.
>>>>
>>>> Regards, Vasco
>>>
>>>--
>>>Arun C. Murthy
>>>Hortonworks Inc.
>>>http://hortonworks.com/
>>>
>>>
>>

Re: On the topic of task scheduling

Posted by Vasco Visser <va...@gmail.com>.

On Tue, Sep 4, 2012 at 3:11 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> The other thing to point out too is that in order to solve this problem
> perfectly you litterly have to solve the halting problem.  You have to
> predict if the maps are going to finish quickly or slowly.  If they finish
> quickly then you want to launch reduces quickly to start fetching data
> from the mappers, if they are going to finish very slowly, then you have a
> lot of reducers taking up resources not doing anything.

I agree with you that a perfect solution is not going to be feasible.
The aim should probably be a solution that is good in many cases.

> That is why there
> is the config parameter that can be set on a per job basis to tell the AM
> when to start launch maps.

I assume you mean start launching reducers

> We have actually been experimenting with
> setting this to 100% because it improves utilization of the cluster a lot.

thanks for pointing this out, I didn't know about this config option.
That the utilization of the cluster improves by setting this to 1
doesn't surprise me.

Maybe it is a good idea to introduce a concept like "job container
time" that captures how much resources a job uses in its life time.
For example, if a job uses 10 mappers each for a minute and 10
reducers also each for a minute, then the container time would be 20
minutes. Having idle reducer will increase container time.

A conceptually simple method to optimize the container time of a job
is to let the AM monitor for each scheduled reducer how much of the
time it is waiting for mappers to produce intermediate data  (maybe
embed this in the heartbeat?). If the average waiting for all
scheduled reducers is above a certain proportion (say waiting more
than 25% of the time or smt), then the AM can decide to discard
some/all reducers and give the freed resources to mappers.

This is just an idea, I don't know about the feasibility. Also I
didn't think about the relationship between optimizing container time
for a single job and optimizing it for all jobs utilizing on the
cluster. Might be that minimizing for each job gives minimal overall,
but not sure.

> On 9/2/12 1:46 PM, "Arun C Murthy" <ac...@hortonworks.com> wrote:
>
>>Vasco,
>>
>> Welcome to Hadoop!
>>
>> You observations are all correct - in simplest case you launch all
>>reduces up front (we used to do that initially) and get a good 'pipeline'
>>between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce
>>itself.
>>
>> However, one thing to remember is that keeping reduces up and running
>>without sufficient maps being completed is a waste of resources in the
>>cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not
>>launch reduces until a certain percentage of the job's maps are complete
>>- by default it's set to 5%. However, there still is a flaw with it
>>(regardless of what you set it to be i.e. 5% or 50%). If it's too high,
>>you lose the 'pipeline' and too low (5%), reduces still spin waiting for
>>all maps to complete wasting resources in the cluster.
>>
>> Given that, we've implemented the heuristic you've described below for
>>hadoop-2 which is better at balancing resource-utilization v/s pipelining
>>or job latency.
>>
>> However, as you've pointed out there are several improvements which are
>>feasible. But, remember that the complexity involved has on a number of
>>factors you've already mentioned:
>> # Job size (a job with 100m/10r v/s 100000m/10000r)
>> # Skew for reduces
>> # Resource availability i.e. other active jobs/shuffles in the system,
>>network bandwidth etc.
>>
>> If you look at an ideal shuffle it will look so (pardon my primitive
>>scribble):
>> http://people.apache.org/~acmurthy/ideal-shuffle.png
>>
>> From that graph:
>> # X i.e. when to launch reduces depends on resource availability, job
>>size & maps' completion rate.
>> # Slope of shuffles (red worm) depends on network b/w, skew etc.
>>
>> None of your points are invalid - I'm just pointing out the
>>possibilities and complexities.
>>
>> Your points about aggregation are also valid, look at
>>http://code.google.com/p/sailfish/ for e.g.
>>
>> One of the advantages of hadoop-2 is that anyone can play with these
>>heuristics and implement your own - I'd love to help if you are
>>interested in playing with them.
>>
>> Related jiras:
>> https://issues.apache.org/jira/browse/MAPREDUCE-4584
>>
>>hth,
>>Arun
>>
>>On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:
>>
>>> Hi,
>>>
>>> I am new to the list, I am working with hadoop in the context of my
>>> MSc graduation project (has nothing to do with task scheduling per
>>> se). I came across task scheduling because I ran into the fifo
>>> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
>>> the fifo starvation issue is solved. The behavior of task scheduling I
>>> observe in this branch is as follows. It begins with all containers
>>> allocated to mappers. Pretty quickly reducers are starting to be
>>> scheduled. In a linear way more containers are given to reducers,
>>> until about 50% (does anybody know why 50%?) of available containers
>>> are reducers (this point is reached when ~ 50% of the mappers are
>>> finished). It stays ~50-50 for until all mappers are scheduled. Only
>>> then the proportion of containers allocated to reducers is increased
>>> to > 50%.
>>>
>>> I don't think this is in general quite the optimal (in terms of total
>>> job completion time) scheduling behavior. The reason being that the
>>> last reducer can only be scheduled when a free container becomes
>>> available after all mappers are scheduled. Thus, in order to shorten
>>> total job completion time the last reducer must be scheduled as early
>>> as possible.
>>>
>>> For the following gedankenexperiment, assume # reducer is set to 99%
>>> capacity, as suggested somewhere in the hadoop docs, and that each
>>> reducer will process roughly the same amount of work. I am going to
>>> schedule as in 2.1.0, but instead of allocating reducers slowly up to
>>> 50 % of capacity, I am just going to take away containers. Thus, the
>>> amount of map work is the same as in 2.1.0, only no reduce work will
>>> be done. At the point that the proportion of reducers would increased
>>> to more than 50% of the containers (i.e., near the end of the map
>>> phase), I schedule all reducers in the containers I took away, making
>>> sure that the last reducer is scheduled at the same moment as it would
>>> be in 2.1.0.  My claim is that the job completion time of this
>>> hypothetical scheduling is about the same as the scheduling in 2.1.0
>>> (as the last reducer is scheduled at the same time), even though I
>>> took away 50% of the available resources for a large part of the job!
>>> The conclusion is that it would be better to allocate all available
>>> containers to mappers, and that reducers are starting to be scheduled
>>> when the map phase is nearing its end, instead of right at the
>>> beginning of the job.
>>>
>>> Scheduling reducers early seems to me the way to go only when: 1) the
>>> output from mappers is very skewed, i.e., some reducers are expected
>>> to need much more time than others, 2) the network connection between
>>> nodes is (expected to be) a big bottleneck, i.e., schedule reducers
>>> early to smear out data transfer over the lifetime of a job, or 3)
>>> there is no contention for resource containers.
>>>
>>> with regard to point 1: skewedness can be determined by looking at
>>> relative sizes of partitioned mapper output.
>>>
>>> with regard to point 2: I think the network is only a bottleneck if it
>>> feeds tuples slower than the reducer can merge sort the tuples (am I
>>> right?). Also, it might be a nice optimization to transfer the
>>> intermediate data to the machine that is going/likely to run a
>>> specific reducer before the reducer is actually ran there (e.g.,
>>> something like a per machine prefetch manager?). A per machine task
>>> scheduling queue would be needed for this, to determine where a
>>> reducer is going/likely to be scheduled.
>>>
>>> Just my two cents. I'm interested in hearing opinions on this matter.
>>>
>>> Regards, Vasco
>>
>>--
>>Arun C. Murthy
>>Hortonworks Inc.
>>http://hortonworks.com/
>>
>>
>

Re: On the topic of task scheduling

Posted by Robert Evans <ev...@yahoo-inc.com>.

The other thing to point out too is that in order to solve this problem
perfectly you litterly have to solve the halting problem.  You have to
predict if the maps are going to finish quickly or slowly.  If they finish
quickly then you want to launch reduces quickly to start fetching data
from the mappers, if they are going to finish very slowly, then you have a
lot of reducers taking up resources not doing anything.  That is why there
is the config parameter that can be set on a per job basis to tell the AM
when to start launch maps.  We have actually been experimenting with
setting this to 100% because it improves utilization of the cluster a lot.
But be careful there are a lot of bug that you might run into if you do
this.  I think we have fixed al of them, but I don't know how many have
been merged into 2.1 and how many are still sitting on 2.2.

--Bobby

On 9/2/12 1:46 PM, "Arun C Murthy" <ac...@hortonworks.com> wrote:

>Vasco,
>
> Welcome to Hadoop!
>
> You observations are all correct - in simplest case you launch all
>reduces up front (we used to do that initially) and get a good 'pipeline'
>between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce
>itself.
>
> However, one thing to remember is that keeping reduces up and running
>without sufficient maps being completed is a waste of resources in the
>cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not
>launch reduces until a certain percentage of the job's maps are complete
>- by default it's set to 5%. However, there still is a flaw with it
>(regardless of what you set it to be i.e. 5% or 50%). If it's too high,
>you lose the 'pipeline' and too low (5%), reduces still spin waiting for
>all maps to complete wasting resources in the cluster.
>
> Given that, we've implemented the heuristic you've described below for
>hadoop-2 which is better at balancing resource-utilization v/s pipelining
>or job latency.
>
> However, as you've pointed out there are several improvements which are
>feasible. But, remember that the complexity involved has on a number of
>factors you've already mentioned:
> # Job size (a job with 100m/10r v/s 100000m/10000r)
> # Skew for reduces
> # Resource availability i.e. other active jobs/shuffles in the system,
>network bandwidth etc.
>
> If you look at an ideal shuffle it will look so (pardon my primitive
>scribble):
> http://people.apache.org/~acmurthy/ideal-shuffle.png
>
> From that graph:
> # X i.e. when to launch reduces depends on resource availability, job
>size & maps' completion rate.
> # Slope of shuffles (red worm) depends on network b/w, skew etc.
>
> None of your points are invalid - I'm just pointing out the
>possibilities and complexities.
>
> Your points about aggregation are also valid, look at
>http://code.google.com/p/sailfish/ for e.g.
>
> One of the advantages of hadoop-2 is that anyone can play with these
>heuristics and implement your own - I'd love to help if you are
>interested in playing with them.
>
> Related jiras:
> https://issues.apache.org/jira/browse/MAPREDUCE-4584
>
>hth,
>Arun
>
>On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:
>
>> Hi,
>> 
>> I am new to the list, I am working with hadoop in the context of my
>> MSc graduation project (has nothing to do with task scheduling per
>> se). I came across task scheduling because I ran into the fifo
>> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
>> the fifo starvation issue is solved. The behavior of task scheduling I
>> observe in this branch is as follows. It begins with all containers
>> allocated to mappers. Pretty quickly reducers are starting to be
>> scheduled. In a linear way more containers are given to reducers,
>> until about 50% (does anybody know why 50%?) of available containers
>> are reducers (this point is reached when ~ 50% of the mappers are
>> finished). It stays ~50-50 for until all mappers are scheduled. Only
>> then the proportion of containers allocated to reducers is increased
>> to > 50%.
>> 
>> I don't think this is in general quite the optimal (in terms of total
>> job completion time) scheduling behavior. The reason being that the
>> last reducer can only be scheduled when a free container becomes
>> available after all mappers are scheduled. Thus, in order to shorten
>> total job completion time the last reducer must be scheduled as early
>> as possible.
>> 
>> For the following gedankenexperiment, assume # reducer is set to 99%
>> capacity, as suggested somewhere in the hadoop docs, and that each
>> reducer will process roughly the same amount of work. I am going to
>> schedule as in 2.1.0, but instead of allocating reducers slowly up to
>> 50 % of capacity, I am just going to take away containers. Thus, the
>> amount of map work is the same as in 2.1.0, only no reduce work will
>> be done. At the point that the proportion of reducers would increased
>> to more than 50% of the containers (i.e., near the end of the map
>> phase), I schedule all reducers in the containers I took away, making
>> sure that the last reducer is scheduled at the same moment as it would
>> be in 2.1.0.  My claim is that the job completion time of this
>> hypothetical scheduling is about the same as the scheduling in 2.1.0
>> (as the last reducer is scheduled at the same time), even though I
>> took away 50% of the available resources for a large part of the job!
>> The conclusion is that it would be better to allocate all available
>> containers to mappers, and that reducers are starting to be scheduled
>> when the map phase is nearing its end, instead of right at the
>> beginning of the job.
>> 
>> Scheduling reducers early seems to me the way to go only when: 1) the
>> output from mappers is very skewed, i.e., some reducers are expected
>> to need much more time than others, 2) the network connection between
>> nodes is (expected to be) a big bottleneck, i.e., schedule reducers
>> early to smear out data transfer over the lifetime of a job, or 3)
>> there is no contention for resource containers.
>> 
>> with regard to point 1: skewedness can be determined by looking at
>> relative sizes of partitioned mapper output.
>> 
>> with regard to point 2: I think the network is only a bottleneck if it
>> feeds tuples slower than the reducer can merge sort the tuples (am I
>> right?). Also, it might be a nice optimization to transfer the
>> intermediate data to the machine that is going/likely to run a
>> specific reducer before the reducer is actually ran there (e.g.,
>> something like a per machine prefetch manager?). A per machine task
>> scheduling queue would be needed for this, to determine where a
>> reducer is going/likely to be scheduled.
>> 
>> Just my two cents. I'm interested in hearing opinions on this matter.
>> 
>> Regards, Vasco
>
>--
>Arun C. Murthy
>Hortonworks Inc.
>http://hortonworks.com/
>
>

Re: On the topic of task scheduling

Posted by Arun C Murthy <ac...@hortonworks.com>.

Vasco,

 Welcome to Hadoop!

 You observations are all correct - in simplest case you launch all reduces up front (we used to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce itself.

 However, one thing to remember is that keeping reduces up and running without sufficient maps being completed is a waste of resources in the cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps are complete - by default it's set to 5%. However, there still is a flaw with it (regardless of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster.

 Given that, we've implemented the heuristic you've described below for hadoop-2 which is better at balancing resource-utilization v/s pipelining or job latency.

 However, as you've pointed out there are several improvements which are feasible. But, remember that the complexity involved has on a number of factors you've already mentioned:
 # Job size (a job with 100m/10r v/s 100000m/10000r)
 # Skew for reduces
 # Resource availability i.e. other active jobs/shuffles in the system, network bandwidth etc.

 If you look at an ideal shuffle it will look so (pardon my primitive scribble):
 http://people.apache.org/~acmurthy/ideal-shuffle.png

 From that graph:
 # X i.e. when to launch reduces depends on resource availability, job size & maps' completion rate.
 # Slope of shuffles (red worm) depends on network b/w, skew etc.

 None of your points are invalid - I'm just pointing out the possibilities and complexities.

 Your points about aggregation are also valid, look at http://code.google.com/p/sailfish/ for e.g.

 One of the advantages of hadoop-2 is that anyone can play with these heuristics and implement your own - I'd love to help if you are interested in playing with them.

 Related jiras:
 https://issues.apache.org/jira/browse/MAPREDUCE-4584

hth,
Arun

On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:

> Hi,
> 
> I am new to the list, I am working with hadoop in the context of my
> MSc graduation project (has nothing to do with task scheduling per
> se). I came across task scheduling because I ran into the fifo
> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
> the fifo starvation issue is solved. The behavior of task scheduling I
> observe in this branch is as follows. It begins with all containers
> allocated to mappers. Pretty quickly reducers are starting to be
> scheduled. In a linear way more containers are given to reducers,
> until about 50% (does anybody know why 50%?) of available containers
> are reducers (this point is reached when ~ 50% of the mappers are
> finished). It stays ~50-50 for until all mappers are scheduled. Only
> then the proportion of containers allocated to reducers is increased
> to > 50%.
> 
> I don't think this is in general quite the optimal (in terms of total
> job completion time) scheduling behavior. The reason being that the
> last reducer can only be scheduled when a free container becomes
> available after all mappers are scheduled. Thus, in order to shorten
> total job completion time the last reducer must be scheduled as early
> as possible.
> 
> For the following gedankenexperiment, assume # reducer is set to 99%
> capacity, as suggested somewhere in the hadoop docs, and that each
> reducer will process roughly the same amount of work. I am going to
> schedule as in 2.1.0, but instead of allocating reducers slowly up to
> 50 % of capacity, I am just going to take away containers. Thus, the
> amount of map work is the same as in 2.1.0, only no reduce work will
> be done. At the point that the proportion of reducers would increased
> to more than 50% of the containers (i.e., near the end of the map
> phase), I schedule all reducers in the containers I took away, making
> sure that the last reducer is scheduled at the same moment as it would
> be in 2.1.0.  My claim is that the job completion time of this
> hypothetical scheduling is about the same as the scheduling in 2.1.0
> (as the last reducer is scheduled at the same time), even though I
> took away 50% of the available resources for a large part of the job!
> The conclusion is that it would be better to allocate all available
> containers to mappers, and that reducers are starting to be scheduled
> when the map phase is nearing its end, instead of right at the
> beginning of the job.
> 
> Scheduling reducers early seems to me the way to go only when: 1) the
> output from mappers is very skewed, i.e., some reducers are expected
> to need much more time than others, 2) the network connection between
> nodes is (expected to be) a big bottleneck, i.e., schedule reducers
> early to smear out data transfer over the lifetime of a job, or 3)
> there is no contention for resource containers.
> 
> with regard to point 1: skewedness can be determined by looking at
> relative sizes of partitioned mapper output.
> 
> with regard to point 2: I think the network is only a bottleneck if it
> feeds tuples slower than the reducer can merge sort the tuples (am I
> right?). Also, it might be a nice optimization to transfer the
> intermediate data to the machine that is going/likely to run a
> specific reducer before the reducer is actually ran there (e.g.,
> something like a per machine prefetch manager?). A per machine task
> scheduling queue would be needed for this, to determine where a
> reducer is going/likely to be scheduled.
> 
> Just my two cents. I'm interested in hearing opinions on this matter.
> 
> Regards, Vasco

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/