You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Jay Hacker <ja...@gmail.com> on 2013/02/06 22:19:23 UTC

Using hadoop streaming with binary data

Is it possible to pass unmolested binary data through a map-only streaming
job from the command line?  I.e., is there a way to avoid extra tabs and
newlines in the output?  I don't need input splits or key/value pairs, I
just want one whole input file fed unmodified into a program, and its
output written unmodified to HDFS.  For example, I'd like to run:

    hadoop jar hadoop-streaming.jar -mapper cat -numReduceTasks 0 -input in
-output out

and have 'out' be exactly the same as 'in'.

There does not seem to be a way to set
mapreduce.output.textoutputformat.separator to the empty string, and
typedbytes prepends the size.  Is there a way to leave data alone out of
the box, or will I have to write a custom InputFormat and OutputFormat?

Thanks!

RE: Using hadoop streaming with binary data

Posted by Venkatesh Kavuluri <vk...@outlook.com>.

You can use hadoop's DistCp to copy files via map/reduce.

Date: Wed, 6 Feb 2013 16:19:23 -0500
Subject: Using hadoop streaming with binary data
From: jayqhacker@gmail.com
To: user@hadoop.apache.org

Is it possible to pass unmolested binary data through a map-only streaming job from the command line?  I.e., is there a way to avoid extra tabs and newlines in the output?  I don't need input splits or key/value pairs, I just want one whole input file fed unmodified into a program, and its output written unmodified to HDFS.  For example, I'd like to run:

    hadoop jar hadoop-streaming.jar -mapper cat -numReduceTasks 0 -input in -output out

and have 'out' be exactly the same as 'in'.

There does not seem to be a way to set mapreduce.output.textoutputformat.separator to the empty string, and typedbytes prepends the size.  Is there a way to leave data alone out of the box, or will I have to write a custom InputFormat and OutputFormat?

Thanks!

RE: Using hadoop streaming with binary data

Posted by Venkatesh Kavuluri <vk...@outlook.com>.

You can use hadoop's DistCp to copy files via map/reduce.

Date: Wed, 6 Feb 2013 16:19:23 -0500
Subject: Using hadoop streaming with binary data
From: jayqhacker@gmail.com
To: user@hadoop.apache.org

Is it possible to pass unmolested binary data through a map-only streaming job from the command line?  I.e., is there a way to avoid extra tabs and newlines in the output?  I don't need input splits or key/value pairs, I just want one whole input file fed unmodified into a program, and its output written unmodified to HDFS.  For example, I'd like to run:

    hadoop jar hadoop-streaming.jar -mapper cat -numReduceTasks 0 -input in -output out

and have 'out' be exactly the same as 'in'.

There does not seem to be a way to set mapreduce.output.textoutputformat.separator to the empty string, and typedbytes prepends the size.  Is there a way to leave data alone out of the box, or will I have to write a custom InputFormat and OutputFormat?

Thanks!

RE: Using hadoop streaming with binary data

Posted by Venkatesh Kavuluri <vk...@outlook.com>.

You can use hadoop's DistCp to copy files via map/reduce.

Date: Wed, 6 Feb 2013 16:19:23 -0500
Subject: Using hadoop streaming with binary data
From: jayqhacker@gmail.com
To: user@hadoop.apache.org

Is it possible to pass unmolested binary data through a map-only streaming job from the command line?  I.e., is there a way to avoid extra tabs and newlines in the output?  I don't need input splits or key/value pairs, I just want one whole input file fed unmodified into a program, and its output written unmodified to HDFS.  For example, I'd like to run:

    hadoop jar hadoop-streaming.jar -mapper cat -numReduceTasks 0 -input in -output out

and have 'out' be exactly the same as 'in'.

There does not seem to be a way to set mapreduce.output.textoutputformat.separator to the empty string, and typedbytes prepends the size.  Is there a way to leave data alone out of the box, or will I have to write a custom InputFormat and OutputFormat?

Thanks!

RE: Using hadoop streaming with binary data

Posted by Venkatesh Kavuluri <vk...@outlook.com>.

You can use hadoop's DistCp to copy files via map/reduce.

Date: Wed, 6 Feb 2013 16:19:23 -0500
Subject: Using hadoop streaming with binary data
From: jayqhacker@gmail.com
To: user@hadoop.apache.org

Is it possible to pass unmolested binary data through a map-only streaming job from the command line?  I.e., is there a way to avoid extra tabs and newlines in the output?  I don't need input splits or key/value pairs, I just want one whole input file fed unmodified into a program, and its output written unmodified to HDFS.  For example, I'd like to run:

    hadoop jar hadoop-streaming.jar -mapper cat -numReduceTasks 0 -input in -output out

and have 'out' be exactly the same as 'in'.

There does not seem to be a way to set mapreduce.output.textoutputformat.separator to the empty string, and typedbytes prepends the size.  Is there a way to leave data alone out of the box, or will I have to write a custom InputFormat and OutputFormat?

Thanks!