You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by praveenesh kumar <pr...@gmail.com> on 2011/12/16 08:42:46 UTC

How Jobtracker choose DataNodes to run TaskTracker ?

Okay so I have one question in mind.

Suppose I have a replication factor of 3 on my cluster of some N
nodes, where N>3 and  there is a data block B1 that exists on some 3
Data nodes --> DD1, DD2, DD3.

I want to run some Mapper function on this block.. My JT will
communicate with NN, to know where can he find the block.
My assumption is NN will give JT all the Data node information where
the block resides, in this case - DD1, DD2,DD3. Am I right on this ?

Now my question is how JT will come to know on which DD it should send
its mapper code ?

Suppose it chose DD1, and my tasktracker starts running on that
machine. By some reasons, DD1 is taking more time than it should have
taken time when it would be running on DD2. How hadoop understand and
take these decisions ?

Thanks,
Praveenesh

RE: How Jobtracker choose DataNodes to run TaskTracker ?

Posted by Sh...@cognizant.com.

Hi Praveenesh,

The NN will send list of DN to the client in sorted order (nodes nearer
to client are first in the list).
If one DN takes more time hadoop has a mechanism to detect that -
Speculative execution.

Speculative execution: One problem with the Hadoop system is that by
dividing the tasks across many nodes, it is possible for a few slow
nodes to rate-limit the rest of the program. For example if one node has
a slow disk controller, then it may be reading its input at only 10% the
speed of all the other nodes. So when 99 map tasks are already complete,
the system is still waiting for the final map task to check in, which
takes much longer than all the other nodes. 

By forcing tasks to run in isolation from one another, individual tasks
do not know where their inputs come from. Tasks trust the Hadoop
platform to just deliver the appropriate input. Therefore, the same
input can be processed multiple times in parallel, to exploit
differences in machine capabilities. As most of the tasks in a job are
coming to a close, the Hadoop platform will schedule redundant copies of
the remaining tasks across several nodes which do not have other work to
perform. This process is known as speculative execution. When tasks
complete, they announce this fact to the JobTracker. Whichever copy of a
task finishes first becomes the definitive copy. If other copies were
executing speculatively, Hadoop tells the TaskTrackers to abandon the
tasks and discard their outputs. The Reducers then receive their inputs
from whichever Mapper completed successfully, first.

Speculative execution is enabled by default. You can disable speculative
execution for the mappers and reducers by setting the
mapred.map.tasks.speculative.execution and
mapred.reduce.tasks.speculative.execution JobConf options to false,
respectively.


Thanks and Regards,
Shreya Pal
Technical Architect DWBI&PM
Vnet: 205594
+91-9766310680


-----Original Message-----
From: praveenesh kumar [mailto:praveenesh@gmail.com] 
Sent: Friday, December 16, 2011 1:13 PM
To: common-user@hadoop.apache.org
Subject: How Jobtracker choose DataNodes to run TaskTracker ?

Okay so I have one question in mind.

Suppose I have a replication factor of 3 on my cluster of some N nodes,
where N>3 and  there is a data block B1 that exists on some 3 Data nodes
--> DD1, DD2, DD3.

I want to run some Mapper function on this block.. My JT will
communicate with NN, to know where can he find the block.
My assumption is NN will give JT all the Data node information where the
block resides, in this case - DD1, DD2,DD3. Am I right on this ?

Now my question is how JT will come to know on which DD it should send
its mapper code ?

Suppose it chose DD1, and my tasktracker starts running on that machine.
By some reasons, DD1 is taking more time than it should have taken time
when it would be running on DD2. How hadoop understand and take these
decisions ?

Thanks,
Praveenesh

This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.