You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "zhuweimin (JIRA)" <ji...@apache.org> on 2009/03/03 06:58:56 UTC

[jira] Issue Comment Edited: (HADOOP-3227) Implement a binary input/output format for Streaming

    [ https://issues.apache.org/jira/browse/HADOOP-3227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12678244#action_12678244 ] 

chinashuimin edited comment on HADOOP-3227 at 3/2/09 9:58 PM:
-----------------------------------------------------------

I created two classes for process the standard binary file.it's BinaryInputFormat and BinaryOutputFormat
It is necessary to modify the PipeMapper,PipeMapRed,PipeReducer class for that. 
the version is hadoop0.19.1

Usage is:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
    -input myInputDirs 
    -output myOutputDir 
    -mapper /bin/cat 
    -reducer /bin/wc
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat
    -outputformat org.apache.hadoop.streaming.BinaryOutputFormat

example:
1.the input is binary of map task,the output is text,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1.1-streaming.jar
    -input myInputDirs 
    -output myOutputDir
    -mapper "wc -c"
    -numReduceTasks 0
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat

2.the map's input is binary file,the output is binary file too,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1.1-streaming.jar
    -input myInputDirs 
    -output myOutputDir
    -mapper "convert -resize 200% - -" 
    -numReduceTasks 0
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat
    -outputformat org.apache.hadoop.streaming.BinaryOutputFormat

notes:the convert is from ImageMagick

3.the map's input is binary file,the output is binary file too,and the reducer's input is binary file,but the output is text
$bin/hadoop jar contrib/streaming/hadoop-0.19.1.1-streaming.jar
    -input myInputDirs 
    -output myOutputDir
    -mapper "convert -resize 200% - -"
    -reducer "identify -"
  -numReduceTasks 1
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat
    -outputformat org.apache.hadoop.streaming.BinaryOutputFormat

4.the map's input is binary file,the output is binary file too,and the reducer's input is binary file,but the output is binary file too

It doesn't support it.

      was (Author: chinashuimin):
    I created two classes for process the standard binary file.it's BinaryInputFormat and BinaryOutputFormat
It is necessary to modify the PipeMapper,PipeMapRed,PipeReducer class for that. 
the version is hadoop0.19.1

Usage is:
$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar 
    -input myInputDirs 
    -output myOutputDir 
    -mapper /bin/cat 
    -reducer /bin/wc
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat
    -outputformat org.apache.hadoop.streaming.BinaryOutputFormat

example:
1.the input is binary of map task,the output is text,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
    -input myInputDirs 
    -output myOutputDir
    -mapper "wc -c"
    -numReduceTasks 0
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat

2.the map's input is binary file,the output is binary file too,and no reducer
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
    -input myInputDirs 
    -output myOutputDir
    -mapper "convert -resize 200% - -" 
    -numReduceTasks 0
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat
    -outputformat org.apache.hadoop.streaming.BinaryOutputFormat

notes:the convert is from ImageMagick

3.the map's input is binary file,the output is binary file too,and the reducer's input is binary file,but the output is text
$bin/hadoop jar contrib/streaming/hadoop-0.19.1-streaming.jar
    -input myInputDirs 
    -output myOutputDir
    -mapper "convert -resize 200% - -"
    -reducer "identify -"
  -numReduceTasks 1
    -inputformat org.apache.hadoop.streaming.BinaryInputFormat
    -outputformat org.apache.hadoop.streaming.BinaryOutputFormat

4.the map's input is binary file,the output is binary file too,and the reducer's input is binary file,but the output is binary file too

It doesn't support it.
  
> Implement a binary input/output format for Streaming
> ----------------------------------------------------
>
>                 Key: HADOOP-3227
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3227
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Arun C Murthy
>            Assignee: Arun C Murthy
>         Attachments: hadoop-0.19.1.1-streaming.jar
>
>
> Lots of streaming applications process textual data with 1 record per line and fields separated by a delimiter. It turns out that there is no point in using any of Hadoop's input/output formats since the streaming script/binary itself will parse the input and break into records and fields. In such cases we should provide users with a binary input/output format which just sends 64k (or so) blocks of data directly from HDFS to the streaming application.
> I did something very similar for Pig-Streaming (PIG-94 - BinaryStorage) which resulted in 300%+ speedup for scanning (identity mapper & map-only jobs) data... the parsing done by input/output formats in these cases were pure-overhead.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.