You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/06/09 02:44:11 UTC
[Lucene-hadoop Wiki] Update of "HadoopStreaming" by MichelTourn
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by MichelTourn:
http://wiki.apache.org/lucene-hadoop/HadoopStreaming
------------------------------------------------------------------------------
{{{
+ Usage: $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar [options]
- % bin/hadoopStreaming
-
- Usage: hadoopStreaming [options]
Options:
- -input <path> DFS input file(s) for the Map step
+ -input <path> DFS input file(s) for the Map step
- -output <path> DFS output directory for the Reduce step
+ -output <path> DFS output directory for the Reduce step
- -mapper <cmd> The streaming command to run
+ -mapper <cmd> The streaming command to run
+ -combiner <cmd> Not implemented. But you can pipe the mapper output
- -reducer <cmd> The streaming command to run
+ -reducer <cmd> The streaming command to run
- -files <file> Additional files to be shipped in the Job jar file
+ -file <file> File/dir to be shipped in the Job jar file
- -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
+ -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
- -config <file> Optional. One or more paths to xml config files
+ -config <file> Optional. One or more paths to xml config files
+ -dfs <h:p> Optional. Override DFS configuration
+ -jt <h:p> Optional. Override JobTracker configuration
- -inputreader <spec> Optional. See below
+ -inputreader <spec> Optional.
+ -jobconf <n>=<v> Optional.
+ -cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-verbose
In -input: globbing on <path> is supported and can have multiple -input
@@ -24, +27 @@
Ex: -inputreader 'StreamXmlRecordReader,begin=<doc>,end=</doc>'
Map output format, reduce input/output format:
Format defined by what mapper command outputs. Line-oriented
+
- Mapper and Reducer <cmd> syntax:
- If the mapper or reducer programs are prefixed with noship: then
- the paths are assumed to be valid absolute paths on the task tracker machines
- and are NOT packaged with the Job jar file.
Use -cluster <name> to switch between "local" Hadoop and one or more remote
Hadoop clusters.
The default is to use the normal hadoop-default.xml and hadoop-site.xml
Else configuration will use $HADOOP_HOME/conf/hadoop-<name>.xml
+ To set the number of reduce tasks (num. of output files):
+ -jobconf mapred.reduce.tasks=10
+ To change the local temp directory:
+ -jobconf dfs.data.dir=/tmp
+ Additional local temp directories with -cluster local:
+ -jobconf mapred.local.dir=/tmp/local
+ -jobconf mapred.system.dir=/tmp/system
+ -jobconf mapred.temp.dir=/tmp/temp
+ For more details about jobconf parameters see:
+ http://wiki.apache.org/lucene-hadoop/JobConfFile
+ To set an environement variable in a streaming command:
+ -cmdenv EXAMPLE_DIR=/home/example/dictionaries/
+
+ Shortcut to run from any directory:
+ setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/build/hadoop-streaming.jar"
+
- Example: hadoopStreaming -mapper "noship:/usr/local/bin/perl5 filter.pl"
+ Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
- -files /local/filter.pl -input "/logs/0604*/*" [...]
+ -file /local/filter.pl -input "/logs/0604*/*" [...]
Ships a script, invokes the non-shipped perl interpreter
Shipped files go to the working directory so filter.pl is found by perl
- Input files are all the daily logs for days in month 2006-04
+ Input files are all the daily logs for days in month 2006-04
+
}}}