You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Jason Wang <ja...@gmail.com> on 2012/10/18 22:12:24 UTC

Hadoop streaming inserts tabs into mapper output

With hadoop streaming and no reducer, I would expect the output written to
HDFS to be the exact STDOUT from the mapper.  I noticed that tab characters
(0x9) are getting inserted before every new line character (0xa).  This is
problematic for me because the output of my mapper is binary data which I
would like to be written to HDFS unaltered.

I've narrowed my issue down to a very simple example that anybody can run.
 Create a simple test.txt file with 4 or more lines of text (must have
newline characters to exemplify the problem).  Copy this to HDFS, and run a
simple streaming job with "cat" as the mapper:

hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -input
/Users/hadoop/test/test.txt -output /Users/hadoop/test/output -mapper "cat"
-reducer NONE

Copy the output/part-00000 file to local, and hexdump the file.  You'll
notice that 0xA bytes have become 0x9 0xA.

There must be a parameter to streaming that can fix this, but I have not
been able to find it.

Thanks in advance,
Jason