You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/06/08 23:16:32 UTC

[Lucene-hadoop Wiki] Update of "HadoopStreaming" by DougCutting

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by DougCutting:
http://wiki.apache.org/lucene-hadoop/HadoopStreaming

The comment on the change is:
initial page on HadoopStreaming

New page:
{{{

% bin/hadoopStreaming

Usage: hadoopStreaming [options]
Options:
  -input <path> DFS input file(s) for the Map step
  -output <path> DFS output directory for the Reduce step
  -mapper <cmd> The streaming command to run
  -reducer <cmd> The streaming command to run
  -files <file> Additional files to be shipped in the Job jar file
  -cluster <name> Default uses hadoop-default.xml and hadoop-site.xml
  -config <file> Optional. One or more paths to xml config files
  -inputreader <spec> Optional. See below
  -verbose

In -input: globbing on <path> is supported and can have multiple -input
Default Map input format: a line is a record in UTF-8
  the key part ends at first TAB, the rest of the line is the value
Custom Map input format: -inputreader package.MyRecordReader,n=v,n=v
  comma-separated name-values can be specified to configure the InputFormat
  Ex: -inputreader 'StreamXmlRecordReader,begin=<doc>,end=</doc>'
Map output format, reduce input/output format:
  Format defined by what mapper command outputs. Line-oriented
Mapper and Reducer <cmd> syntax:
  If the mapper or reducer programs are prefixed with noship: then
  the paths are assumed to be valid absolute paths on the task tracker machines
  and are NOT packaged with the Job jar file.
Use -cluster <name> to switch between "local" Hadoop and one or more remote
  Hadoop clusters.
  The default is to use the normal hadoop-default.xml and hadoop-site.xml
  Else configuration will use $HADOOP_HOME/conf/hadoop-<name>.xml

Example: hadoopStreaming -mapper "noship:/usr/local/bin/perl5 filter.pl"
           -files /local/filter.pl -input "/logs/0604*/*" [...]
  Ships a script, invokes the non-shipped perl interpreter
  Shipped files go to the working directory so filter.pl is found by perl
  Input files are all the daily logs for days in month 2006-04 
}}}