You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Saptarshi Guha <sa...@gmail.com> on 2008/06/30 15:41:26 UTC
Data-local tasks
Hello,
I recall asking this question but this is in addition to what I'ev
askd.
Firstly, to recap my question and Arun's specific response:
-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
-- Does the "Data-local map tasks" counter mean the number of tasks
that the had the input data already present on the machine on they
are running on?
-- i.e the wasn't a need to ship the data to them.
Response from Arun
-- Yes. Your understanding is correct. More specifically it means that
the map-task got scheduled on a machine on which one of the
-- replicas of it's input-split-block was present and was served by
the datanode running on that machine. *smile* Arun
Now, Is Hadoop designed to schedule a map task on a machine which has
one of the replicas of it's input split block?
Failing that, does then assign a map task on machine close to one
that contains a replica of it's input split block?
Are there any performance metrics for this?
Many thanks
Saptarshi
Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
Re: Data-local tasks
Posted by Amar Kamat <am...@yahoo-inc.com>.
Saptarshi Guha wrote:
> Hello,
> I recall asking this question but this is in addition to what I'ev askd.
> Firstly, to recap my question and Arun's specific response:
>
> -- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
> -- Does the "Data-local map tasks" counter mean the number of tasks
> that the had the input data already present on the machine on they
> are running on?
> -- i.e the wasn't a need to ship the data to them.
>
> Response from Arun
> -- Yes. Your understanding is correct. More specifically it means that
> the map-task got scheduled on a machine on which one of the
> -- replicas of it's input-split-block was present and was served by
> the datanode running on that machine. *smile* Arun
>
>
> Now, Is Hadoop designed to schedule a map task on a machine which has
> one of the replicas of it's input split block?
Yes.
> Failing that, does then assign a map task on machine close to one that
> contains a replica of it's input split block?
The scheduling is tasktracker based rather than split based. By that
what I mean is that the tasktracker asks for a task and the JT schedules
a task to that tracker.
If there is any split that is data local to the tasktracker and not yet
scheduled, it will be assigned to the tracker. If no such split can be
found the JT will assign a high priority split to it. The priority
amongst the splits is based on their ordering given by the jobclient. By
default its sorted on split size (decreasing order). Either the split is
data-local (on the same machine), rack local (within the same rack) or
is not-local. There is no other measure of closeness. The scheduling
problem is 'given a tasktracker find out the best split' rather than
'given a split find out the best/closest tracker'.
> Are there any performance metrics for this?
>
> Many thanks
> Saptarshi
>
>
> */Saptarshi Guha | saptarshi.guha@gmail.com
> <ma...@gmail.com> | http://www.stat.purdue.edu/~sguha
> <http://www.stat.purdue.edu/%7Esguha>/*
>
>
Re: Data-local tasks
Posted by heyongqiang <he...@software.ict.ac.cn>.
Hadoop does not implemented the clever task scheduler, when a data node heartbeat with the namenode, and if the data node wants a job, simply get one for it.
The selection does not consider the task's input file at all.
Best regards,
Yongqiang He
2008-06-25
发件人: Saptarshi Guha
发送时间: 2008-06-30 21:12:24
收件人: core-user@hadoop.apache.org
抄送:
主题: Data-local tasks
Hello,
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:
-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
-- Does the "Data-local map tasks" counter mean the number of tasks that the had the input data already present on the machine on they are running on?
-- i.e the wasn't a need to ship the data to them.
Response from Arun
-- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the
-- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun
Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block?
Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block?
Are there any performance metrics for this?
Many thanks
Saptarshi
Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha
Re: Data-local tasks
Posted by heyongqiang <he...@software.ict.ac.cn>.
Hadoop does not implemented the clever task scheduler, when a data node heartbeat with the namenode, and if the data node wants a job, simply get one for it.
The selection does not consider the task's input file at all.
Best regards,
Yongqiang He
2008-06-25
发件人: Saptarshi Guha
发送时间: 2008-06-30 21:12:24
收件人: core-user@hadoop.apache.org
抄送:
主题: Data-local tasks
Hello,
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:
-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
-- Does the "Data-local map tasks" counter mean the number of tasks that the had the input data already present on the machine on they are running on?
-- i.e the wasn't a need to ship the data to them.
Response from Arun
-- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the
-- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun
Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block?
Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block?
Are there any performance metrics for this?
Many thanks
Saptarshi
Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha