You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Pramy Bhats <pr...@googlemail.com> on 2010/08/13 19:26:45 UTC

why getSplits() is called twice ?

Hi,

I am trying to modify the spliter for mappers. But while looking at the code
base -- I fail to understand, why getSplits is called twice before the
Mapper is launched. It essentially does the same thing both the times (
which is calling the getSplit(JobContext job ) method in the file
FileInputFormat.class ).


First time the getSplits is invoked is :

if (job.getUseNewMapper()) {
      maps = writeNewSplits(context, submitSplitFile);
    }

 int writeNewSplits(JobContext job, Path submitSplitFile
                     ) throws IOException, InterruptedException,
                              ClassNotFoundException {
    JobConf conf = job.getJobConf();
    org.apache.hadoop.mapreduce.InputFormat<?,?> input =
      ReflectionUtils.newInstance(job.getInputFormatClass(),
job.getJobConf());

*    List<org.apache.hadoop.mapreduce.InputSplit> splits =
input.getSplits(job);*
*
*
*
*
*Second time the getSplits is invoked is :*
*
*
*
public void run() {
      JobID jobId = profile.getJobID();
      JobContext jContext = new JobContext(conf, jobId);
      OutputCommitter outputCommitter = job.getOutputCommitter();
      try {
        // split input into minimum number of splits
        RawSplit[] rawSplits;
        if (job.getUseNewMapper()) {
          org.apache.hadoop.mapreduce.InputFormat<?,?> input =
              ReflectionUtils.newInstance(jContext.getInputFormatClass(),
jContext.getJobConf());

          List<org.apache.hadoop.mapreduce.InputSplit> splits =
input.getSplits(jContext);
*
*
*
*
*
*
*
*Could you please help me to under stand the reasoning behind it ?*
*
*
*
*
*thanks,*
*--PB*
*
*