You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/06/09 02:44:11 UTC

[Lucene-hadoop Wiki] Update of "HadoopStreaming" by MichelTourn

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by MichelTourn:
http://wiki.apache.org/lucene-hadoop/HadoopStreaming

------------------------------------------------------------------------------
  {{{
  
+ Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options]
- % bin/hadoopStreaming
- 
- Usage: hadoopStreaming [options]
  Options:
-   -input <path> DFS input file(s) for the Map step
+   -input    <path>     DFS input file(s) for the Map step
-   -output <path> DFS output directory for the Reduce step
+   -output   <path>     DFS output directory for the Reduce step
-   -mapper <cmd> The streaming command to run
+   -mapper   <cmd>      The streaming command to run
+   -combiner <cmd>      Not implemented. But you can pipe the mapper output
-   -reducer <cmd> The streaming command to run
+   -reducer  <cmd>      The streaming command to run
-   -files <file> Additional files to be shipped in the Job jar file
+   -file     <file>     File/dir to be shipped in the Job jar file
-   -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
+   -cluster  <name>     Default uses hadoop-default.xml and hadoop-site.xml
-   -config <file> Optional. One or more paths to xml config files
+   -config   <file>     Optional. One or more paths to xml config files
+   -dfs      <h:p>      Optional. Override DFS configuration
+   -jt       <h:p>      Optional. Override JobTracker configuration
-   -inputreader <spec> Optional. See below
+   -inputreader <spec>  Optional.
+   -jobconf  <n>=<v>    Optional.
+   -cmdenv   <n>=<v>    Optional. Pass env.var to streaming commands
    -verbose
  
  In -input: globbing on <path> is supported and can have multiple -input
@@ -24, +27 @@

    Ex: -inputreader 'StreamXmlRecordReader,begin=<doc>,end=</doc>'
  Map output format, reduce input/output format:
    Format defined by what mapper command outputs. Line-oriented
+ 
- Mapper and Reducer <cmd> syntax:
-   If the mapper or reducer programs are prefixed with noship: then
-   the paths are assumed to be valid absolute paths on the task tracker machines
-   and are NOT packaged with the Job jar file.
  Use -cluster <name> to switch between "local" Hadoop and one or more remote
    Hadoop clusters.
    The default is to use the normal hadoop-default.xml and hadoop-site.xml
    Else configuration will use $HADOOP_HOME/conf/hadoop-<name>.xml
  
+ To set the number of reduce tasks (num. of output files):
+   -jobconf mapred.reduce.tasks=10
+ To change the local temp directory:
+   -jobconf dfs.data.dir=/tmp
+ Additional local temp directories with -cluster local:
+   -jobconf mapred.local.dir=/tmp/local
+   -jobconf mapred.system.dir=/tmp/system
+   -jobconf mapred.temp.dir=/tmp/temp
+ For more details about jobconf parameters see:
+   http://wiki.apache.org/lucene-hadoop/JobConfFile
+ To set an environement variable in a streaming command:
+    -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
+ 
+ Shortcut to run from any directory:
+    setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/build/hadoop-streaming.jar"
+ 
- Example: hadoopStreaming -mapper "noship:/usr/local/bin/perl5 filter.pl"
+ Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
-            -files /local/filter.pl -input "/logs/0604*/*" [...]
+            -file /local/filter.pl -input "/logs/0604*/*" [...]
    Ships a script, invokes the non-shipped perl interpreter
    Shipped files go to the working directory so filter.pl is found by perl
-   Input files are all the daily logs for days in month 2006-04 
+   Input files are all the daily logs for days in month 2006-04
+ 
  }}}