You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/08/26 18:00:35 UTC
[Nutch Wiki] Trivial Update of "MapReduce" by LewisJohnMcgibbney

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "MapReduce" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/MapReduce?action=diff&rev1=7&rev2=8

+ = How Map and Reduce operations are actually carried out =
+ == Introduction ==
+ 
+ This document describes how MapReduce operations are carried out in Hadoop. If you are not familiar with the Google [[http://labs.google.com/papers/mapreduce.html|MapReduce]] programming model you should get acquainted with it first.
+ 
  [[http://weblogs.java.net/blog/tomwhite/archive/2005/09/mapreduce.html#more|"Excerpt from TomWhite's blog: MapReduce"]]<<BR>>
- 
-  * MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper [[http://labs.google.com/papers/mapreduce.html|"MapReduce: Simplified Data Processing on Large Clusters"]].
  
   * In essence, it allows massive data sets to be processed in a distributed fashion by breaking the processing into many small computations of two types:
    1. A Map operation that transforms the input into an intermediate representation.
@@ -10, +13 @@

  
   * This processing model is ideal for the operations a search engine indexer like Nutch or Google needs to perform - like computing inlinks for URLs, or building inverted indexes - and it will [[attachment:Presentations/mapred.pdf|"transform Nutch"]] into a scalable, distributed search engine.
  
+ <<TableOfContents(3)>>
+ 
+ == Map ==
+ 
+ As the Map operation is parallelized the input file set is first
+ split to several pieces called [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/lib/input/FileSplit.html|FileSplits]]. If an individual file
+ is so large that it will affect seek time it will be split to
+ several Splits. The splitting does not know anything about the
+ input file's internal logical structure, for example
+ line-oriented text files are split on arbitrary byte boundaries.
+ Then a new map task is created per !FileSplit.
+ 
+ When an individual map task starts it will open a new output
+ writer per configured reduce task. It will then proceed to read
+ its !FileSplit using the [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/RecordReader.html|RecordReader]] it gets from the specified
+ [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]]. !InputFormat parses the input and generates
+ key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary. For example [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html|TextInputFormat]] will read the last line of the !FileSplit past the split boundary and, when reading other than the first !FileSplit, !TextInputFormat ignores the content up to the first newline.
+ 
+ It is not necessary for the !InputFormat to
+ generate both meaningful keys ''and'' values. For example the
+ default output from !TextInputFormat consists of input lines as
+ values and somewhat meaninglessly line start file offsets as
+ keys - most applications only use the lines and ignore the
+ offsets.
+ 
+ As key-value pairs are read from the !RecordReader they are
+ passed to the configured [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper]]. The user supplied Mapper does
+ whatever it wants with the input pair and calls	[[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable)|OutputCollector.collect]] with key-value pairs of its own choosing. The output it
+ generates must use one key class and one value class.  This is because
+ the Map output will be written into a SequenceFile
+ which has per-file type information and all the records must
+ have the same type (use subclassing if you want to output
+ different data structures). The Map input and output key-value
+ pairs are not necessarily related typewise or in cardinality.
+ 
+ When Mapper output is collected it is partitioned, which means
+ that it will be written to the output specified by the
+ [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/Partitioner.html|Partitioner]]. The default [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapreduce/lib/partition/HashPartitioner.html|HashPartitioner]] uses the
+ hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the reduce tasks).  See [[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/MapTask.java?view=markup|MapTask]] for details.
+ 
+ N input files will generate M map tasks to be run and each map
+ task will generate as many output files as there are reduce
+ tasks configured in the system. Each output file will be
+ targeted at a specific reduce task and the map output pairs from
+ all the map tasks will be routed so that all pairs for a given
+ key end up in files targeted at a specific reduce task.
+ 
+ == Combine ==
+ When the map
+ operation outputs its pairs they are already available in
+ memory. For efficiency reasons, sometimes it makes sense to
+ take advantage of this fact by supplying a combiner class
+ to perform a reduce-type function. If a combiner is used then the
+ map key-value pairs are not immediately written to the output. Instead they
+ will be collected in lists, one list per each key value. When a
+ certain number of key-value pairs have been written, this buffer
+ is flushed by passing all the values of each key to the
+ combiner's reduce method and outputting the key-value pairs of
+ the combine operation as if they were created by the original map
+ operation.
+ 
+ For example, a word count !MapReduce application whose map
+ operation outputs (''word'', 1) pairs as words are encountered in
+ the input can use a combiner to speed up processing. A combine
+ operation will start gathering the output in in-memory lists (instead
+ of on disk), one list per word. Once a certain number of
+ pairs is output, the combine operation will be called once per
+ unique word with the list available as an iterator. The combiner
+ then emits (''word'', count-in-this-part-of-the-input) pairs. From
+ the viewpoint of the Reduce operation this contains the same
+ information as the original Map output, but there should be far fewer
+ pairs output to disk and read from disk.
+ 
+ == Reduce ==
+ When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in
+ distributed mode these need to be first copied to the local
+ filesystem in a ''copy phase'' (see [[https://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup|ReduceTaskRunner]]).
+ 
+ Once all the data is available locally it is appended to one
+ file in an ''append phase''. The file is then merge sorted so that the key-value pairs for
+ a given key are contiguous (''sort phase''). This makes the actual reduce operation simple: the file is
+ read sequentially and the values are passed to the reduce method
+ with an iterator reading the input file until the next key
+ value is encountered. See [[http://svn.apache.org/viewcvs.cgi/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/ReduceTask.java?view=markup|ReduceTask]] for details.
+ 
+ At the end, the output will consist of one output file per executed reduce
+ task. The format of the files can be specified with
+ [[http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class)|JobConf.setOutputFormat]]. If !SequentialOutputFormat is used then the output key and value
+ classes must also be specified.
+