You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/06/08 23:16:32 UTC
[Lucene-hadoop Wiki] Update of "HadoopStreaming" by DougCutting
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.
The following page has been changed by DougCutting:
http://wiki.apache.org/lucene-hadoop/HadoopStreaming
The comment on the change is:
initial page on HadoopStreaming
New page:
{{{
% bin/hadoopStreaming
Usage: hadoopStreaming [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd> The streaming command to run
-reducer <cmd> The streaming command to run
-files <file> Additional files to be shipped in the Job jar file
-cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
-config <file> Optional. One or more paths to xml config files
-inputreader <spec> Optional. See below
-verbose
In -input: globbing on <path> is supported and can have multiple -input
Default Map input format: a line is a record in UTF-8
the key part ends at first TAB, the rest of the line is the value
Custom Map input format: -inputreader package.MyRecordReader,n=v,n=v
comma-separated name-values can be specified to configure the InputFormat
Ex: -inputreader 'StreamXmlRecordReader,begin=<doc>,end=</doc>'
Map output format, reduce input/output format:
Format defined by what mapper command outputs. Line-oriented
Mapper and Reducer <cmd> syntax:
If the mapper or reducer programs are prefixed with noship: then
the paths are assumed to be valid absolute paths on the task tracker machines
and are NOT packaged with the Job jar file.
Use -cluster <name> to switch between "local" Hadoop and one or more remote
Hadoop clusters.
The default is to use the normal hadoop-default.xml and hadoop-site.xml
Else configuration will use $HADOOP_HOME/conf/hadoop-<name>.xml
Example: hadoopStreaming -mapper "noship:/usr/local/bin/perl5 filter.pl"
-files /local/filter.pl -input "/logs/0604*/*" [...]
Ships a script, invokes the non-shipped perl interpreter
Shipped files go to the working directory so filter.pl is found by perl
Input files are all the daily logs for days in month 2006-04
}}}