You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/29 17:21:10 UTC
[Hadoop Wiki] Update of "HadoopStreaming" by WimDepoorter

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "HadoopStreaming" page has been changed by WimDepoorter.
The comment on this change is: changed the location of streaming jar from "build/hadoop-streaming.jar" to "$HADOOP_HOME/mapred/contrib/streaming/hadoop-0.xx.y-streaming.jar".
http://wiki.apache.org/hadoop/HadoopStreaming?action=diff&rev1=11&rev2=12

--------------------------------------------------

  Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.
  
  {{{
- 
- Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options]
+ Usage: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar [options]
  Options:
    -input    <path>                   DFS input file(s) for the Map step
    -output   <path>                   DFS output directory for the Reduce step
@@ -55, +54 @@

     -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
  
  Shortcut to run from any directory:
-    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/build/hadoop-streaming.jar"
+    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/mapred/contrib/streaming/hadoop-streaming.jar"
  
  Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
             -file /local/filter.pl -input "/logs/0604*/*" [...]
    Ships a script, invokes the non-shipped perl interpreter
    Shipped files go to the working directory so filter.pl is found by perl
    Input files are all the daily logs for days in month 2006-04
- 
  }}}
- 
- 
  == Practical Help ==
  Using the streaming system you can develop working hadoop jobs with ''extremely'' limited knowldge of Java.  At it's simplest your development task is to write two shell scripts that work well together, let's call them '''shellMapper.sh''' and '''shellReducer.sh'''.  On a machine that doesn't even have hadoop installed you can get first drafts of these working by writing them to work in this way:
  
  {{{
  cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
  }}}
- 
  With streaming, Hadoop basically becomes a system for making pipes from shell-scripting work (with some fudging) on a cluster.  There's a strong logical correspondence between the unix shell scripting environment and hadoop streaming jobs.  The above example with Hadoop has somewhat less elegant syntax, but this is what it looks like:
  
  {{{
- stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper "shellMapper.sh" -file shellReducer.sh  -reducer "shellReducer.sh" -output /dfsOutputDir/myResults  
+ stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper "shellMapper.sh" -file shellReducer.sh  -reducer "shellReducer.sh" -output /dfsOutputDir/myResults
  }}}
- 
- The real place the logical correspondence breaks down is that in a one machine scripting environment shellMapper.sh and shellReducer.sh will each run as a single process and data will flow directly from one process to the other.  With Hadoop the shellMapper.sh file will be sent to every machine on the cluster that has data chunks and each such machine will run it's own chunk through the shellMapper.sh process on each machine.  The output from those scripts ''doesn't'' run a reduce on each of those machines.  Instead the output is sorted so that different lines from various mapping jobs are streamed across the network to different machines (Hadoop defaults to four machines) where the reduce(s) can be performed.  
+ The real place the logical correspondence breaks down is that in a one machine scripting environment shellMapper.sh and shellReducer.sh will each run as a single process and data will flow directly from one process to the other.  With Hadoop the shellMapper.sh file will be sent to every machine on the cluster that has data chunks and each such machine will run it's own chunk through the shellMapper.sh process on each machine.  The output from those scripts ''doesn't'' run a reduce on each of those machines.  Instead the output is sorted so that different lines from various mapping jobs are streamed across the network to different machines (Hadoop defaults to four machines) where the reduce(s) can be performed.
  
  Here are practical tips for getting things working well: