You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Saptarshi Guha <sa...@gmail.com> on 2008/06/30 15:41:26 UTC

Data-local tasks

Hello,
	I recall asking this question but this is in addition to what I'ev  
askd.
	Firstly, to recap my question and Arun's specific response:

--	On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >
--	Does the "Data-local map tasks" counter mean the number of tasks   
that the had the input data already present on the machine on they   
are running on?
--	i.e the wasn't a need to ship the data to them.

	Response from Arun
--	Yes. Your understanding is correct. More specifically it means that  
the map-task got scheduled on a machine on which one of the
--	replicas of it's input-split-block was present and was served by  
the datanode running on that machine. *smile* Arun


	Now, Is Hadoop designed to schedule a map task on a machine which has  
one of the replicas of it's input split block?
	Failing that, does then assign a map task on machine close to one  
that contains a replica of it's input split block?
	Are there any performance metrics for this?

	Many thanks
	Saptarshi


Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha

Re: Data-local tasks

Posted by Amar Kamat <am...@yahoo-inc.com>.

Saptarshi Guha wrote:
> Hello,
> I recall asking this question but this is in addition to what I'ev askd.
> Firstly, to recap my question and Arun's specific response:
>
> -- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, > 
> -- Does the "Data-local map tasks" counter mean the number of tasks 
>  that the had the input data already present on the machine on they 
>  are running on? 
> -- i.e the wasn't a need to ship the data to them. 
>
> Response from Arun
> -- Yes. Your understanding is correct. More specifically it means that 
> the map-task got scheduled on a machine on which one of the 
> -- replicas of it's input-split-block was present and was served by 
> the datanode running on that machine. *smile* Arun
>
>
> Now, Is Hadoop designed to schedule a map task on a machine which has 
> one of the replicas of it's input split block?
Yes.
> Failing that, does then assign a map task on machine close to one that 
> contains a replica of it's input split block?
The scheduling is tasktracker based rather than split based. By that 
what I mean is that the tasktracker asks for a task and the JT schedules 
a task to that tracker.
If there is any split that is data local to the tasktracker and not yet 
scheduled, it will be assigned to the tracker. If no such split can be 
found the JT will assign a high priority split to it. The priority 
amongst the splits is based on their ordering given by the jobclient. By 
default its sorted on split size (decreasing order). Either the split is 
data-local (on the same machine), rack local (within the same rack) or 
is not-local. There is no other measure of closeness. The scheduling 
problem is 'given a tasktracker find out the best split' rather than 
'given a split find out the best/closest tracker'.
> Are there any performance metrics for this?
>
> Many thanks
> Saptarshi
>
>
> */Saptarshi Guha | saptarshi.guha@gmail.com 
> <ma...@gmail.com> | http://www.stat.purdue.edu/~sguha 
> <http://www.stat.purdue.edu/%7Esguha>/*
>
>

Re: Data-local tasks

Posted by heyongqiang <he...@software.ict.ac.cn>.

Hadoop does not implemented the clever task scheduler, when a data node heartbeat with the namenode, and if the data node wants a job, simply get one for it.
The selection  does not consider the task's input file at all.




  
Best regards,
 
Yongqiang He
2008-06-25



发件人： Saptarshi Guha
发送时间： 2008-06-30 21:12:24
收件人： core-user@hadoop.apache.org
抄送： 
主题： Data-local tasks

Hello, 
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:



-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >  
-- Does the "Data-local map tasks" counter mean the number of tasks  that the had the input data already present on the machine on they  are running on? 
-- i.e the wasn't a need to ship the data to them.  


Response from Arun

-- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the 
-- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun




Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block?

Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block?

Are there any performance metrics for this?



Many thanks

Saptarshi





Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha

Re: Data-local tasks

Posted by heyongqiang <he...@software.ict.ac.cn>.

Hadoop does not implemented the clever task scheduler, when a data node heartbeat with the namenode, and if the data node wants a job, simply get one for it.
The selection  does not consider the task's input file at all.




 
Best regards,
 
Yongqiang He
2008-06-25



发件人： Saptarshi Guha
发送时间： 2008-06-30 21:12:24
收件人： core-user@hadoop.apache.org
抄送： 
主题： Data-local tasks

Hello,
I recall asking this question but this is in addition to what I'ev askd.
Firstly, to recap my question and Arun's specific response:



-- On May 20, 2008, at 9:03 AM, Saptarshi Guha wrote: > Hello, >  
-- Does the "Data-local map tasks" counter mean the number of tasks  that the had the input data already present on the machine on they  are running on? 
-- i.e the wasn't a need to ship the data to them.  


Response from Arun

-- Yes. Your understanding is correct. More specifically it means that the map-task got scheduled on a machine on which one of the 
-- replicas of it's input-split-block was present and was served by the datanode running on that machine. *smile* Arun




Now, Is Hadoop designed to schedule a map task on a machine which has one of the replicas of it's input split block?

Failing that, does then assign a map task on machine close to one that contains a replica of it's input split block?

Are there any performance metrics for this?



Many thanks

Saptarshi





Saptarshi Guha | saptarshi.guha@gmail.com | http://www.stat.purdue.edu/~sguha