You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Sean Bigdatafun <se...@gmail.com> on 2011/02/09 22:09:09 UTC

Who actually does the split computation?

http://answers.oreilly.com/topic/459-anatomy-of-a-mapreduce-job-run-with-hadoop/
"Computes the input splits for the job. If the splits cannot be computed,
because the input paths don’t exist, for example, then the job is not
submitted and an error is thrown to the MapReduce program.

Copies the resources needed to run the job, including the job JAR file, the
configuration file and the computed input splits, to the jobtracker’s
filesystem in a directory named after the job ID. The job JAR is copied with
a high replication factor (controlled by the mapred.submit.replication
property,
which defaults to 10) so that there are lots of copies across the cluster
for the tasktrackers to access when they run tasks for the job (step 3)."

1. My first question: who is responsible to compute the input splits? Is it
the jobclient's work or the jobtracker's work? --- it sounds the jobclient's
work from the above statement. But I do not understand how jobclient is able
to compute this info because it does not hold enough information to do so.
To compute the input split, the party must at least know how many blocks the
target input includes, IFIAK, but jobclient does not seem to have such
information.

Here is my understanding about split using an example: a 256MB file stored
in 4 blocks in HDFS can be splitted into 4 splits if it is the target input
for the MR job. Is the minimal split a block or can a split be smaller than
that? How exactly is a split size computed?


-- 
--Sean

Re: Who actually does the split computation?

Posted by Todd Lipcon <to...@cloudera.com>.
On Wed, Feb 9, 2011 at 1:49 PM, Sean Bigdatafun
<se...@gmail.com>wrote:

> Where does this computation happen (in the context of the original picture
> in the posted link )?
>
> JobClient? or JobTracker? (Either way I think they need to contact HDFS
> Namenode to do such a work, which did not seem to get described in that
> link) --- I can't post on mapreduce-user mailing list, so I have to ask it
> here.
>
>
Happens in the JobClient. See o.a.h.mapreduce.JobSubmitter.java:357 in
trunk.

The inputformat's getSplits() method will call out to the NN to find the
locations for the inputs files. See the implementation of FileInputFormat
for details.

-Todd


> On Wed, Feb 9, 2011 at 1:13 PM, David Rosenstrauch <da...@darose.net>wrote:
>
>> On 02/09/2011 04:09 PM, Sean Bigdatafun wrote:
>>
>>> 1. My first question: who is responsible to compute the input splits?
>>>
>>
>> The InputFormat computes InputSplits.  See:
>> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/InputFormat.html
>>
>> DR
>>
>
>
>
> --
> --Sean
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Who actually does the split computation?

Posted by Sean Bigdatafun <se...@gmail.com>.
Where does this computation happen (in the context of the original picture
in the posted link )?

JobClient? or JobTracker? (Either way I think they need to contact HDFS
Namenode to do such a work, which did not seem to get described in that
link) --- I can't post on mapreduce-user mailing list, so I have to ask it
here.

On Wed, Feb 9, 2011 at 1:13 PM, David Rosenstrauch <da...@darose.net>wrote:

> On 02/09/2011 04:09 PM, Sean Bigdatafun wrote:
>
>> 1. My first question: who is responsible to compute the input splits?
>>
>
> The InputFormat computes InputSplits.  See:
> http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/InputFormat.html
>
> DR
>



-- 
--Sean

Re: Who actually does the split computation?

Posted by David Rosenstrauch <da...@darose.net>.
On 02/09/2011 04:09 PM, Sean Bigdatafun wrote:
> 1. My first question: who is responsible to compute the input splits?

The InputFormat computes InputSplits.  See: 
http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapreduce/InputFormat.html

DR