You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by Yoonmin Nam <ro...@dgist.ac.kr> on 2013/10/17 17:21:55 UTC

(Re)About block splitting, input split and TextInputFormat in MapReduce

Hi.

Let we consider this situation:

1.     Block size = 67108864 (64MB)

2.     Data size = 2.2GB. (larger than block size)

 

Then, when I put the input into HDFS, I got the below list of block
replication result:

 

 

http://infolab.dgist.ac.kr/~ronymin/pictures/1.png

 

 

Then, I checked each HDFS block and unfortunately (but naturally) block 2
and block 3 has broken data like this.

 

At the end of block2:

.

.

<username> R. fi

 

At the start of block3:

end</username>

 

This means the original data is like this: (XML format data)

<username>R. fiend</username>

 

If I use the TextInputFormat (LineRecordReader and LineReader), 

I thought that mapper 3 which handle block 2 will cover the start of block 3
to make those line to make the incompletely broken data meaningful!

 

And mapper 4 is reading the next element of end</username>. (Actually next
element is id: <id>55767</id>

 

If it is right, then some Mapper has great performance gain if it has some
block and its adjacent block for handling this kind of block spanning
problem. 

(Because it can reduce the network I/O for get the next block to handle
broken element)

 

At the block replacement result I shown, block 0 and block1 are existed in
same datanode (10.40.3.78).

Also, block1 and block2 are existed in same datanode (10.40.3.83).

 

However, block3 and block4 are not existed at least one same node.. (Both
two blocks are existed in different datanode)

 

At this point, I want to ask you guys about following questions:

 

1.     The block replication policy consider this kind of situation?

2.     Is there any wrong fact of my thought, especially one mapper handles
the end of its block and start of next block to make the line meaningful?

3.     Why the SPLIT_SLOP has value 1.1 in FileInputSplit?

4.     I know HDFS Block generation mechanism splits the input data strictly
based on the value of dfs.block.size, and that value is upper value of
InputSplit. But it is not correct because of SPLIT_SLOPE. But this is wrong,
I think! Please let me know the exact reason of InputSplitting mechanism!

(Let we consider that the last remaining data is 64.8MB (bytesRemaining) and
splitSize is 64MB, so bytesRemaining / splitSize == 1.01 < SPLIT_SLOP, so it
just becomes one input splits!!)

 

Thank you for reading my very long question!