You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by abhishek sharma <ab...@usc.edu> on 2010/04/15 21:53:33 UTC

interaction between HDFS and Job/Task trackers

Hi all,

I read the "Anatomy of a MapReduce Job Run with Hadoop" tutorial by
Tom White (http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/),
and
have a few questions related to the how the input splits are computed,
stored and retrieved.

As per the tutorial, the JobClient's submitJob() function computes and
stores the input splits in a directory named after JobID.
Is this directory stored in the shared filesystem? (The figure 6.1
seems to indicate so).

1) If yes, then what does it mean to store input splits? Does the
following happen during the process of computing the splits?

The input file to the job is first retrieved from HDFS and splits are
created and then stored in a directory (named after JobID) on the
HDFS.

For example, I have a 128MB file stored on HDFS with 64 MB block size.
So the HDFS stores the file as two 64 MB blocks. However, if I set my
mapred.min.split.size to 128MB and pass this file as input, then I see
only 1 map task being launched.

I see an entry like "<task_id> has split on node: <tasktracker_id>" in
my jobtrakcer log.

What does the above entry mean? HDFS has the file stored as two
different blocks. Are the two 64MB blocks first retrieved from HDFS
and then merged into a single 128MB file that is then stored as
single input split?

2) Also, how does a TaskTracker access a split assigned to it?

I mean is there a difference in the way a local (on the same machine)
vs. a non-local (machine on the same or different rack) split
retrieved by the TaskTracker. I assume a TaskTracker has to connect to
the DataNode hosting the block(s) corresponding to a split. Are
different methods invoked for accessing local vs non-local splits?

Is the input split first stored on the TaskTracker's local filesystem
or is it loaded directly into memory as it is retrieved? Does this
depend on whether the split is on the same machine as the TaskTracker
or not?

Thanks,
Abhishek