You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Pramy Bhats <pr...@googlemail.com> on 2010/08/13 19:26:45 UTC
why getSplits() is called twice ?
Hi,
I am trying to modify the spliter for mappers. But while looking at the code
base -- I fail to understand, why getSplits is called twice before the
Mapper is launched. It essentially does the same thing both the times (
which is calling the getSplit(JobContext job ) method in the file
FileInputFormat.class ).
First time the getSplits is invoked is :
if (job.getUseNewMapper()) {
maps = writeNewSplits(context, submitSplitFile);
}
int writeNewSplits(JobContext job, Path submitSplitFile
) throws IOException, InterruptedException,
ClassNotFoundException {
JobConf conf = job.getJobConf();
org.apache.hadoop.mapreduce.InputFormat<?,?> input =
ReflectionUtils.newInstance(job.getInputFormatClass(),
job.getJobConf());
* List<org.apache.hadoop.mapreduce.InputSplit> splits =
input.getSplits(job);*
*
*
*
*
*Second time the getSplits is invoked is :*
*
*
*
public void run() {
JobID jobId = profile.getJobID();
JobContext jContext = new JobContext(conf, jobId);
OutputCommitter outputCommitter = job.getOutputCommitter();
try {
// split input into minimum number of splits
RawSplit[] rawSplits;
if (job.getUseNewMapper()) {
org.apache.hadoop.mapreduce.InputFormat<?,?> input =
ReflectionUtils.newInstance(jContext.getInputFormatClass(),
jContext.getJobConf());
List<org.apache.hadoop.mapreduce.InputSplit> splits =
input.getSplits(jContext);
*
*
*
*
*
*
*
*Could you please help me to under stand the reasoning behind it ?*
*
*
*
*
*thanks,*
*--PB*
*
*