You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by 周杰 <zh...@126.com> on 2011/09/26 11:02:06 UTC
A question about "input split"
hello,everyone!
when I see the source of the hadoop,I encounter a problem:
As we all know, when we set the mapred.max.split.size >= blocksize in the conf, the splitSize==blocksize .
my question is when mapred.max.split.size < blocksize,the splitSize is smaller than blocksize,but in the function "getSplits()" of Class FileInputFormat:
public List<InputSplit> getSplits(JobContext job
) throws IOException {
......
for (FileStatus file: listStatus(job)) {
Path path = file.getPath();
FileSystem fs = path.getFileSystem(job.getConfiguration());
long length = file.getLen();
BlockLocation[] blkLocations = fs.getFileBlockLocations(file, 0, length);
if ((length != 0) && isSplitable(job, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
bytesRemaining -= splitSize;
}
......
}
notice the while Loop, if the splitSize is smaller than blocksize,there is a confused problem.
for example, bolcksize = 64,splitSize = 50,filelength = 200:
bytesRemaining splitSize bytesRemaining/splitSize length-bytesRemaining
first loop: 200 50 4 0
2th loop: 150 50 3 50
that means when running the 2th loop ,
new FileSplit(path, length-bytesRemaining, splitSize,
blkLocations[blkIndex].getHosts()));
start = length-bytesRemaining =50, length = splitSize = 50,so the 2th loop cover two bulk (bulks: 0---64,64---128,128---192........),and the start = 50,length = 50, that is to say,cover 50---64,64---100.
but in the contruction function new FileSplit(),it just contain the only one bulk's info.( blkLocations[blkIndex].getHosts()))
I could not understand this.