You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Zhenhua Guo <je...@gmail.com> on 2010/10/28 23:52:49 UTC

mapper and reducer scheduling

Hi, all
  I wonder how Hadoop schedules mappers and reducers (e.g. consider
load balancing, affinity to data?). For example, how to decide on
which nodes mappers and reducers are to be executed and when.
  Thanks!

Gerald

Re: mapper and reducer scheduling

Posted by Zhenhua Guo <je...@gmail.com>.

Thanks, Jeff, Harsh, He, Hemanth. Those information is quite helpful!

Gerald

On Mon, Nov 1, 2010 at 12:01 AM, Hemanth Yamijala <yh...@gmail.com> wrote:
> Hi,
>
> On Mon, Nov 1, 2010 at 9:13 AM, He Chen <ai...@gmail.com> wrote:
>> If you use the default scheduler of hadoop 0.20.2 or higher. The
>> jobQueueScheduler will take the data locality into account.
>
> This is true irrespective of the scheduler in use. Other schedulers
> currently add a layer to decide which job to pick up first based on
> constraints they choose to satisfy - like fairness, queue capacities
> etc. Once a job is picked up, the logic for picking up a task within
> the job is currently in framework code that all schedulers use.
>
>> That means when
>> a heart beat from TT arrives, the JT will first check a cache which is a map
>> of node and data-local tasks this node has.  The JT will assign node local
>> task first, then the rack local, non-local, recover and speculative tasks if
>> they have default priorities.
>>
>> If a TT get a non-local task, it will query the nodes which have the data
>> and finish this task, you can also decide to keep those fetched data on this
>> TT or not by configuring the Hadoop mapred-site.xml file.
>>
>> BTW, even TT get a data local task, it may also ask other data owner (if you
>> have more than one replica)for data to accelerate the process. (??? my
>> understanding, any one can confirm)
>
> Not that I am aware of. The task's input location is used directly to
> read the data.
>
> Thanks
> Hemanth
>>
>> Hope this will help.
>>
>> Chen
>>
>> On Sun, Oct 31, 2010 at 9:49 PM, Zhenhua Guo <je...@gmail.com> wrote:
>>
>>> Thanks!
>>> One more question. Is the input file replicated on each node where a
>>> mapper is run? Or just the portion processed by a mapper is
>>> transferred?
>>>
>>> Gerald
>>>
>>> On Fri, Oct 29, 2010 at 10:11 AM, Harsh J <qw...@gmail.com> wrote:
>>> > Hello,
>>> >
>>> > On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang <zj...@gmail.com> wrote:
>>> >> TaskTracker will tell JobTracker how many free slots it has through
>>> >> heartbeat. And JobTracker will choose the best tasktracker with the
>>> >> consideration of data locality.
>>> >
>>> > Yes. To add some more, a scheduler is responsible to do assignments of
>>> > tasks (based on various stats, including data locality) to proper
>>> > tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
>>> > TaskTracker its tasks, and the scheduler type is configurable (Some
>>> > examples are Eager/FIFO scheduler, Capacity scheduler, etc.).
>>> >
>>> > This scheduling is done when a heart beat response is to be sent back
>>> > to a TaskTracker that called JobTracker.heartbeat(...).
>>> >
>>> >>
>>> >>
>>> >> On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo <je...@gmail.com> wrote:
>>> >>> Hi, all
>>> >>>  I wonder how Hadoop schedules mappers and reducers (e.g. consider
>>> >>> load balancing, affinity to data?). For example, how to decide on
>>> >>> which nodes mappers and reducers are to be executed and when.
>>> >>>  Thanks!
>>> >>>
>>> >>> Gerald
>>> >>>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Best Regards
>>> >>
>>> >> Jeff Zhang
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Harsh J
>>> > www.harshj.com
>>> >
>>>
>>
>

Re: mapper and reducer scheduling

Posted by Hemanth Yamijala <yh...@gmail.com>.

Hi,

On Mon, Nov 1, 2010 at 9:13 AM, He Chen <ai...@gmail.com> wrote:
> If you use the default scheduler of hadoop 0.20.2 or higher. The
> jobQueueScheduler will take the data locality into account.

This is true irrespective of the scheduler in use. Other schedulers
currently add a layer to decide which job to pick up first based on
constraints they choose to satisfy - like fairness, queue capacities
etc. Once a job is picked up, the logic for picking up a task within
the job is currently in framework code that all schedulers use.

> That means when
> a heart beat from TT arrives, the JT will first check a cache which is a map
> of node and data-local tasks this node has.  The JT will assign node local
> task first, then the rack local, non-local, recover and speculative tasks if
> they have default priorities.
>
> If a TT get a non-local task, it will query the nodes which have the data
> and finish this task, you can also decide to keep those fetched data on this
> TT or not by configuring the Hadoop mapred-site.xml file.
>
> BTW, even TT get a data local task, it may also ask other data owner (if you
> have more than one replica)for data to accelerate the process. (??? my
> understanding, any one can confirm)

Not that I am aware of. The task's input location is used directly to
read the data.

Thanks
Hemanth
>
> Hope this will help.
>
> Chen
>
> On Sun, Oct 31, 2010 at 9:49 PM, Zhenhua Guo <je...@gmail.com> wrote:
>
>> Thanks!
>> One more question. Is the input file replicated on each node where a
>> mapper is run? Or just the portion processed by a mapper is
>> transferred?
>>
>> Gerald
>>
>> On Fri, Oct 29, 2010 at 10:11 AM, Harsh J <qw...@gmail.com> wrote:
>> > Hello,
>> >
>> > On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang <zj...@gmail.com> wrote:
>> >> TaskTracker will tell JobTracker how many free slots it has through
>> >> heartbeat. And JobTracker will choose the best tasktracker with the
>> >> consideration of data locality.
>> >
>> > Yes. To add some more, a scheduler is responsible to do assignments of
>> > tasks (based on various stats, including data locality) to proper
>> > tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
>> > TaskTracker its tasks, and the scheduler type is configurable (Some
>> > examples are Eager/FIFO scheduler, Capacity scheduler, etc.).
>> >
>> > This scheduling is done when a heart beat response is to be sent back
>> > to a TaskTracker that called JobTracker.heartbeat(...).
>> >
>> >>
>> >>
>> >> On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo <je...@gmail.com> wrote:
>> >>> Hi, all
>> >>>  I wonder how Hadoop schedules mappers and reducers (e.g. consider
>> >>> load balancing, affinity to data?). For example, how to decide on
>> >>> which nodes mappers and reducers are to be executed and when.
>> >>>  Thanks!
>> >>>
>> >>> Gerald
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Best Regards
>> >>
>> >> Jeff Zhang
>> >>
>> >
>> >
>> >
>> > --
>> > Harsh J
>> > www.harshj.com
>> >
>>
>

Re: mapper and reducer scheduling

Posted by He Chen <ai...@gmail.com>.

If you use the default scheduler of hadoop 0.20.2 or higher. The
jobQueueScheduler will take the data locality into account. That means when
a heart beat from TT arrives, the JT will first check a cache which is a map
of node and data-local tasks this node has.  The JT will assign node local
task first, then the rack local, non-local, recover and speculative tasks if
they have default priorities.

If a TT get a non-local task, it will query the nodes which have the data
and finish this task, you can also decide to keep those fetched data on this
TT or not by configuring the Hadoop mapred-site.xml file.

BTW, even TT get a data local task, it may also ask other data owner (if you
have more than one replica)for data to accelerate the process. (??? my
understanding, any one can confirm)

Hope this will help.

Chen

On Sun, Oct 31, 2010 at 9:49 PM, Zhenhua Guo <je...@gmail.com> wrote:

> Thanks!
> One more question. Is the input file replicated on each node where a
> mapper is run? Or just the portion processed by a mapper is
> transferred?
>
> Gerald
>
> On Fri, Oct 29, 2010 at 10:11 AM, Harsh J <qw...@gmail.com> wrote:
> > Hello,
> >
> > On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang <zj...@gmail.com> wrote:
> >> TaskTracker will tell JobTracker how many free slots it has through
> >> heartbeat. And JobTracker will choose the best tasktracker with the
> >> consideration of data locality.
> >
> > Yes. To add some more, a scheduler is responsible to do assignments of
> > tasks (based on various stats, including data locality) to proper
> > tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
> > TaskTracker its tasks, and the scheduler type is configurable (Some
> > examples are Eager/FIFO scheduler, Capacity scheduler, etc.).
> >
> > This scheduling is done when a heart beat response is to be sent back
> > to a TaskTracker that called JobTracker.heartbeat(...).
> >
> >>
> >>
> >> On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo <je...@gmail.com> wrote:
> >>> Hi, all
> >>>  I wonder how Hadoop schedules mappers and reducers (e.g. consider
> >>> load balancing, affinity to data?). For example, how to decide on
> >>> which nodes mappers and reducers are to be executed and when.
> >>>  Thanks!
> >>>
> >>> Gerald
> >>>
> >>
> >>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >>
> >
> >
> >
> > --
> > Harsh J
> > www.harshj.com
> >
>

Re: mapper and reducer scheduling

Posted by Harsh J <qw...@gmail.com>.

Hi,

On Mon, Nov 1, 2010 at 8:19 AM, Zhenhua Guo <je...@gmail.com> wrote:
> Thanks!
> One more question. Is the input file replicated on each node where a
> mapper is run? Or just the portion processed by a mapper is
> transferred?

With the use of HDFS, this is what happens: Mappers are run on nodes
where the input file's blocks are already present [Data-local map
tasks]. If TaskTracker slots are unavailable on that node for the
mapper to run, it is run somewhere else and the input block ("portion
processed by a mapper") is fetched from one of the DataNodes in the
same rack [Rack-local map tasks].

-- 
Harsh J
www.harshj.com

Re: mapper and reducer scheduling

Posted by Zhenhua Guo <je...@gmail.com>.

Thanks!
One more question. Is the input file replicated on each node where a
mapper is run? Or just the portion processed by a mapper is
transferred?

Gerald

On Fri, Oct 29, 2010 at 10:11 AM, Harsh J <qw...@gmail.com> wrote:
> Hello,
>
> On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang <zj...@gmail.com> wrote:
>> TaskTracker will tell JobTracker how many free slots it has through
>> heartbeat. And JobTracker will choose the best tasktracker with the
>> consideration of data locality.
>
> Yes. To add some more, a scheduler is responsible to do assignments of
> tasks (based on various stats, including data locality) to proper
> tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
> TaskTracker its tasks, and the scheduler type is configurable (Some
> examples are Eager/FIFO scheduler, Capacity scheduler, etc.).
>
> This scheduling is done when a heart beat response is to be sent back
> to a TaskTracker that called JobTracker.heartbeat(...).
>
>>
>>
>> On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo <je...@gmail.com> wrote:
>>> Hi, all
>>>  I wonder how Hadoop schedules mappers and reducers (e.g. consider
>>> load balancing, affinity to data?). For example, how to decide on
>>> which nodes mappers and reducers are to be executed and when.
>>>  Thanks!
>>>
>>> Gerald
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>

Re: mapper and reducer scheduling

Posted by Harsh J <qw...@gmail.com>.

Hello,

On Fri, Oct 29, 2010 at 12:45 PM, Jeff Zhang <zj...@gmail.com> wrote:
> TaskTracker will tell JobTracker how many free slots it has through
> heartbeat. And JobTracker will choose the best tasktracker with the
> consideration of data locality.

Yes. To add some more, a scheduler is responsible to do assignments of
tasks (based on various stats, including data locality) to proper
tasktrackers. Scheduler.assignTasks(TaskTracker) is used to assign a
TaskTracker its tasks, and the scheduler type is configurable (Some
examples are Eager/FIFO scheduler, Capacity scheduler, etc.).

This scheduling is done when a heart beat response is to be sent back
to a TaskTracker that called JobTracker.heartbeat(...).

>
>
> On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo <je...@gmail.com> wrote:
>> Hi, all
>>  I wonder how Hadoop schedules mappers and reducers (e.g. consider
>> load balancing, affinity to data?). For example, how to decide on
>> which nodes mappers and reducers are to be executed and when.
>>  Thanks!
>>
>> Gerald
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

-- 
Harsh J
www.harshj.com

Re: mapper and reducer scheduling

Posted by Jeff Zhang <zj...@gmail.com>.

TaskTracker will tell JobTracker how many free slots it has through
heartbeat. And JobTracker will choose the best tasktracker with the
consideration of data locality.

On Thu, Oct 28, 2010 at 2:52 PM, Zhenhua Guo <je...@gmail.com> wrote:
> Hi, all
>  I wonder how Hadoop schedules mappers and reducers (e.g. consider
> load balancing, affinity to data?). For example, how to decide on
> which nodes mappers and reducers are to be executed and when.
>  Thanks!
>
> Gerald
>

-- 
Best Regards

Jeff Zhang