You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Jason Wang <ja...@gmail.com> on 2012/10/18 22:12:24 UTC
Hadoop streaming inserts tabs into mapper output
With hadoop streaming and no reducer, I would expect the output written to
HDFS to be the exact STDOUT from the mapper. I noticed that tab characters
(0x9) are getting inserted before every new line character (0xa). This is
problematic for me because the output of my mapper is binary data which I
would like to be written to HDFS unaltered.
I've narrowed my issue down to a very simple example that anybody can run.
Create a simple test.txt file with 4 or more lines of text (must have
newline characters to exemplify the problem). Copy this to HDFS, and run a
simple streaming job with "cat" as the mapper:
hadoop jar ../contrib/streaming/hadoop-streaming-1.0.3.jar -input
/Users/hadoop/test/test.txt -output /Users/hadoop/test/output -mapper "cat"
-reducer NONE
Copy the output/part-00000 file to local, and hexdump the file. You'll
notice that 0xA bytes have become 0x9 0xA.
There must be a parameter to streaming that can fix this, but I have not
been able to find it.
Thanks in advance,
Jason