You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by 周杰 <zh...@126.com> on 2011/09/26 11:02:06 UTC

A question about "input split"

hello,everyone!
when I see the source of the hadoop,I encounter a problem:
As we all know, when we set the mapred.max.split.size >= blocksize in the conf, the splitSize==blocksize .
my question is when mapred.max.split.size < blocksize,the splitSize is smaller than blocksize,but in the function  "getSplits()" of  Class FileInputFormat:
  public List<InputSplit> getSplits(JobContext job
                                    ) throws IOException {
 ......
    for (FileStatus file: listStatus(job)) {
      Path path = file.getPath();
      FileSystem fs = path.getFileSystem(job.getConfiguration());
      long length = file.getLen();
      BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
      if ((length != 0) && isSplitable(job, path)) { 
        long blockSize = file.getBlockSize();
        long splitSize = computeSplitSize(blockSize, minSize, maxSize);


        long bytesRemaining = length;
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }
......
  }


notice the while Loop, if the splitSize is smaller than blocksize,there is a confused problem.
for example, bolcksize = 64,splitSize = 50,filelength = 200:
                bytesRemaining    splitSize       bytesRemaining/splitSize            length-bytesRemaining

first loop:  200                          50               4                                                0

2th  loop:  150                          50               3                                                50
that means when running the 2th loop ,

new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
start = length-bytesRemaining =50,  length = splitSize = 50,so the 2th loop cover two bulk (bulks: 0---64,64---128,128---192........),and the start = 50,length = 50, that is to say,cover 50---64,64---100.
but in the contruction function  new FileSplit(),it just contain the only one bulk's info.( blkLocations[blkIndex].getHosts()))
I could not understand this.