You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/10/13 06:00:21 UTC

[Lucene-hadoop Wiki] Update of "HadoopMapReduce" by NigelDaley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by NigelDaley:
http://wiki.apache.org/lucene-hadoop/HadoopMapReduce

The comment on the change is:
clean up some wording, formatting, and defects

------------------------------------------------------------------------------
several Splits. The splitting does not know anything about the
input file's internal logical structure, for example
line-oriented text files are split on arbitrary byte boundaries.
- Then a new !MapTask is created per !FileSplit.
+ Then a new map task is created per !FileSplit.

- When an individual !MapTask task starts it will open a new output
+ When an individual map task starts it will open a new output
- writer per configured Reduce task. It will then proceed to read
+ writer per configured reduce task. It will then proceed to read
its !FileSplit using the [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/RecordReader.html RecordReader] it gets from the specified
[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html InputFormat]. !InputFormat parses the input and generates
- key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary - for example [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup TextInputFormat] reads the last line of the !FileSplit past the split boundary and when it starts reading other than the first !FileSplit it ignores the content up to the first newline.
+ key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary. For example [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup TextInputFormat] will read the last line of the !FileSplit past the split boundary and, when reading other than the first !FileSplit, !TestInputFormat ignores the content up to the first newline.

It is not necessary for the !InputFormat to
- generate both "meaningful" keys and values. For example the
+ generate both meaningful keys ''and'' values. For example the
- default !TextInputFormat's output consists of input lines as
+ default output from !TextInputFormat consists of input lines as
values and somewhat meaninglessly line start file offsets as
- values - most applications only use the lines and ignore the
+ keys - most applications only use the lines and ignore the
offsets.

As key-value pairs are read from the !RecordReader they are
passed to the configured [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Mapper.html Mapper]. The user supplied Mapper does
whatever it wants with the input pair and calls [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable) OutputCollector.collect] with key-value pairs of its own choosing. The output it
- generates must use one key class and one value class, because
+ generates must use one key class and one value class. This is because
- the Map output will be eventually written into a [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html SequenceFile],
+ the Map output will be written into a [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html SequenceFile]
- which has per file type information and all the records must
+ which has per-file type information and all the records must
have the same type (use subclassing if you want to output
different data structures). The Map input and output key-value
- pairs are not related typewise or in cardinality.
+ pairs are not necessarily related typewise or in cardinality.

When Mapper output is collected it is partitioned, which means
that it will be written to the output specified by the
- [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Partitioner.html Partitioner]. The default [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/lib/HashPartitioner.html HashPartitioner] uses the key value's
+ [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Partitioner.html Partitioner]. The default [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/lib/HashPartitioner.html HashPartitioner] uses the
+ hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the reduce tasks). See [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/MapTask.java?view=markup MapTask] for details.
- hashcode (which means that for even workload on the Reduce tasks
- the key class hashCode must be good).

- N input files will generate m map tasks to be run and each map
+ N input files will generate M map tasks to be run and each map
task will generate as many output files as there are reduce
tasks configured in the system. Each output file will be
targeted at a specific reduce task and the map output pairs from
- all the Map tasks will be routed so that all pairs for a given
+ all the map tasks will be routed so that all pairs for a given
key end up in files targeted at a specific reduce task.

== Combine ==
- The rationale behind using a Combiner is that as the Map
+ When the map
operation outputs its pairs they are already available in
+ memory. For efficiency reasons, sometimes it makes sense to
+ take advantage of this fact by supplying a combiner class
- memory for a reduce-type function. If a Combiner is used the
+ to perform a reduce-type function. If a combiner is used then the
- Map output is not immediately written to the output. Instead it
+ map key-value pairs are not immediately written to the output. Instead they
will be collected in lists, one list per each key value. When a
- certain number of key-value pairs has been written this buffer
+ certain number of key-value pairs have been written, this buffer
is flushed by passing all the values of each key to the
- Combiner's reduce method and outputting the key-value pairs of
+ combiner's reduce method and outputting the key-value pairs of
- the combine operation as they were created by the original map
+ the combine operation as if they were created by the original map
operation.

- For example a word count !MapReduce application whose Map
+ For example, a word count !MapReduce application whose map
- operation outputs (word, 1) pairs as words are encountered in
+ operation outputs (''word'', 1) pairs as words are encountered in
the input can use a combiner to speed up processing. A combine
- operation will start gathering the output in in-memory (instead
+ operation will start gathering the output in in-memory lists (instead
- of on disk) lists, one list per word. Once a certain number of
+ of on disk), one list per word. Once a certain number of
- pairs is output the combine operation will be called once per
+ pairs is output, the combine operation will be called once per
unique word with the list available as an iterator. The combiner
- then emits (word, count-in-this-part-of-the-input) pairs. From
+ then emits (''word'', count-in-this-part-of-the-input) pairs. From
the viewpoint of the Reduce operation this contains the same
- information as the original Map output, but there might be a lot
+ information as the original Map output, but there should be far fewer
- less bits to output to disk and read from disk.
+ pairs output to disk and read from disk.

== Reduce ==
- When a reduce task starts its input is scattered in many files, across all nodes where map tasks ran. If run in
+ When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in
distributed mode these need to be first copied to the local
filesystem in a ''copy phase'' (see [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup ReduceTaskRunner]).

Once all the data is available locally it is appended to one
- file (''append phase''). The file is then merge sorted so that the key-value pairs for
+ file in an ''append phase''. The file is then merge sorted so that the key-value pairs for
a given key are contiguous (''sort phase''). This makes the actual reduce operation simple: the file is
read sequentially and the values are passed to the reduce method
with an iterator reading the input file until the next key
value is encountered. See [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTask.java?view=markup ReduceTask] for details.

- In the end the output will consist of one output file per Reduce
+ At the end, the output will consist of one output file per executed reduce
- task run. The format of the files can be specified with
+ task. The format of the files can be specified with
- [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class) JobConf.setOutputFormat]. If !SequentialOutputFormat is used the output Key and Value
+ [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class) JobConf.setOutputFormat]. If !SequentialOutputFormat is used then the output key and value
classes must also be specified.