You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Qiming He <qi...@openresearchinc.com> on 2013/03/28 00:49:46 UTC

Hadoop: using NLineInputFormat with compression?

$cat abook.txt |base64 –w 0 >onelinetext.b64 $hadoop fs –put
onelinetext.b64 /input/onelinetext.b64 $hadoop jar hadoop-streaming.jar
-input /input/onelinetext.b64 -output /output -inputformat
org.apache.hadoop.mapred.lib.NLineInputFormat –mapper wc Num task: 1, and
output has one line: Line 1: 1 2 202699 which makes sense because one line
per mapper is intended.

$bzip2 onelinetext.b64 $ hadoop fs –put onelinetext.b64.bz2
/input/onelinetext.b64.bz2 $hadoop jar hadoop-streaming.jar
-Dmapred.input.compress=true
-Dmapred.input.compression.codec=org.apache.hadoop.io.compress.GzipCodec
-input /input/onelinetext.b64.bz2 -output /output -inputformat
org.apache.hadoop.mapred.lib.NLineInputFormat –mapper wc

I am expecting the same results as above, ‘coz decompressing should occur
before processing one-line text (i.e. wc), however, I am getting: Num task:
397, and output has 397 lines: Line1-396: 0 0 0 Line 397: 1 2 202699

Any idea why so many mapred.map.tasks <>1 ? splitting? I purposely choose
gzip because I believe it is NOT split-able. I got similar results when
using bzip2 and lzop codec.

Thanks for your answer in advance.

-- 
Dr. Qiming He
Qiming.He@openresearchinc.com
301-525-6612 (Phone)
815-327-2122 (Fax)