You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2008/01/28 23:02:23 UTC

[Hadoop Wiki] Update of "HadoopStreaming" by JenniferRM

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The following page has been changed by JenniferRM:
http://wiki.apache.org/hadoop/HadoopStreaming

The comment on the change is:
Added some practical advice about getting streaming code to work.

------------------------------------------------------------------------------
  
  }}}
  
+ 
+ == Practical Help ==
+ Using the streaming system you can develop working hadoop jobs with ''extremely'' limited knowldge of Java.  At it's simplest your development task is to write two shell scripts that work well together, let's call them '''shellMapper.sh''' and '''shellReducer.sh'''.  On a machine that doesn't even have hadoop installed you can get first drafts of these working by writing them to work in this way:
+ 
+ {{{
+ cat someInputFile | shellMapper.sh | shellReducer.sh > someOutputFile
+ }}}
+ 
+ With streaming, Hadoop basically becomes a system for making pipes from shell-scripting work (with some fudging) on a cluster.  There's a strong logical correspondence between the unix shell scripting environment and hadoop streaming jobs.  The above example with Hadoop has somewhat less elegant syntax, but this is what it looks like:
+ 
+ {{{
+ stream -input /dfsInputDir/someInputData -file shellMapper.sh -mapper "shellMapper.sh" -file shellReducer.sh  -reducer "shellReducer.sh" -output /dfsOutputDir/myResults  
+ }}}
+ 
+ The real place the logical correspondence breaks down is that in a one machine scripting environment shellMapper.sh and shellReducer.sh will each run as a single process and data will flow directly from one process to the other.  With Hadoop the shellMapper.sh file will be sent to every machine on the cluster that has data chunks and each such machine will run it's own chunk through the shellMapper.sh process on each machine.  The output from those scripts ''doesn't'' run a reduce on each of those machines.  Instead the output is sorted so that different lines from various mapping jobs are streamed across the network to different machines (Hadoop defaults to four machines) where the reduce(s) can be performed.  
+ 
+ Here are practical tips for getting things working well:
+ 
+ * '''Use shell scripts rather than commands''' - The "-file shellMapper.sh" part isn't entirely necessary.  You can simply use a clause like "-mapper 'sed | grep | awk'" or some such but complicated quoting is can introduce bugs.  Wrapping the job in a shell script eliminates some of these issues.
+ 
+ * '''Don't expect shebangs to work''' - If you're going to run other scripts from inside your shell script, don't execpt a line like #!/bin/python to work.  To be certain that things will work, run the script directly like "grep somethingInteresting | '''''perl''' perlScript'' | sort | uniq -c"
+ 
+ For more, see HowToDebugMapReducePrograms.
+