You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2006/10/13 06:00:21 UTC

[Lucene-hadoop Wiki] Update of "HadoopMapReduce" by NigelDaley

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Lucene-hadoop Wiki" for change notification.

The following page has been changed by NigelDaley:
http://wiki.apache.org/lucene-hadoop/HadoopMapReduce

The comment on the change is:
clean up some wording, formatting, and defects

------------------------------------------------------------------------------
  several Splits. The splitting does not know anything about the
  input file's internal logical structure, for example
  line-oriented text files are split on arbitrary byte boundaries.
- Then a new !MapTask is created per !FileSplit.
+ Then a new map task is created per !FileSplit.
  
- When an individual !MapTask task starts it will open a new output
+ When an individual map task starts it will open a new output
- writer per configured Reduce task. It will then proceed to read
+ writer per configured reduce task. It will then proceed to read
  its !FileSplit using the [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/RecordReader.html RecordReader] it gets from the specified
  [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/InputFormat.html InputFormat]. !InputFormat parses the input and generates
- key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary - for example [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup TextInputFormat] reads the last line of the !FileSplit past the split boundary and when it starts reading other than the first !FileSplit it ignores the content up to the first newline.
+ key-value pairs. !InputFormat must also handle records that may be split on the !FileSplit boundary. For example [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/TextInputFormat.java?view=markup TextInputFormat] will read the last line of the !FileSplit past the split boundary and, when reading other than the first !FileSplit, !TestInputFormat ignores the content up to the first newline.
  
  It is not necessary for the !InputFormat to
- generate both "meaningful" keys and values. For example the
+ generate both meaningful keys ''and'' values. For example the
- default !TextInputFormat's output consists of input lines as
+ default output from !TextInputFormat consists of input lines as
  values and somewhat meaninglessly line start file offsets as
- values - most applications only use the lines and ignore the
+ keys - most applications only use the lines and ignore the
  offsets.
  
  As key-value pairs are read from the !RecordReader they are
  passed to the configured [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Mapper.html Mapper]. The user supplied Mapper does
  whatever it wants with the input pair and calls	[http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/OutputCollector.html#collect(org.apache.hadoop.io.WritableComparable,%20org.apache.hadoop.io.Writable) OutputCollector.collect] with key-value pairs of its own choosing. The output it
- generates must use one key class and one value class, because
+ generates must use one key class and one value class.  This is because
- the Map output will be eventually written into a [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html SequenceFile],
+ the Map output will be written into a [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/io/SequenceFile.html SequenceFile]
- which has per file type information and all the records must
+ which has per-file type information and all the records must
  have the same type (use subclassing if you want to output
  different data structures). The Map input and output key-value
- pairs are not related typewise or in cardinality.
+ pairs are not necessarily related typewise or in cardinality.
  
  When Mapper output is collected it is partitioned, which means
  that it will be written to the output specified by the
- [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Partitioner.html Partitioner]. The default [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/lib/HashPartitioner.html HashPartitioner] uses the key value's
+ [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/Partitioner.html Partitioner]. The default [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/lib/HashPartitioner.html HashPartitioner] uses the
+ hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the reduce tasks).  See [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/MapTask.java?view=markup	 MapTask] for details.
- hashcode (which means that for even workload on the Reduce tasks
- the key class hashCode must be good).
  
- N input files will generate m map tasks to be run and each map
+ N input files will generate M map tasks to be run and each map
  task will generate as many output files as there are reduce
  tasks configured in the system. Each output file will be
  targeted at a specific reduce task and the map output pairs from
- all the Map tasks will be routed so that all pairs for a given
+ all the map tasks will be routed so that all pairs for a given
  key end up in files targeted at a specific reduce task.
  
  == Combine ==
- The rationale behind using a Combiner is that as the Map
+ When the map
  operation outputs its pairs they are already available in
+ memory. For efficiency reasons, sometimes it makes sense to
+ take advantage of this fact by supplying a combiner class
- memory for a reduce-type function. If a Combiner is used the
+ to perform a reduce-type function. If a combiner is used then the
- Map output is not immediately written to the output. Instead it
+ map key-value pairs are not immediately written to the output. Instead they
  will be collected in lists, one list per each key value. When a
- certain number of key-value pairs has been written this buffer
+ certain number of key-value pairs have been written, this buffer
  is flushed by passing all the values of each key to the
- Combiner's reduce method and outputting the key-value pairs of
+ combiner's reduce method and outputting the key-value pairs of
- the combine operation as they were created by the original map
+ the combine operation as if they were created by the original map
  operation.
  
- For example a word count !MapReduce application whose Map
+ For example, a word count !MapReduce application whose map
- operation outputs (word, 1) pairs as words are encountered in
+ operation outputs (''word'', 1) pairs as words are encountered in
  the input can use a combiner to speed up processing. A combine
- operation will start gathering the output in in-memory (instead
+ operation will start gathering the output in in-memory lists (instead
- of on disk) lists, one list per word. Once a certain number of
+ of on disk), one list per word. Once a certain number of
- pairs is output the combine operation will be called once per
+ pairs is output, the combine operation will be called once per
  unique word with the list available as an iterator. The combiner
- then emits (word, count-in-this-part-of-the-input) pairs. From
+ then emits (''word'', count-in-this-part-of-the-input) pairs. From
  the viewpoint of the Reduce operation this contains the same
- information as the original Map output, but there might be a lot
+ information as the original Map output, but there should be far fewer
- less bits to output to disk and read from disk.
+ pairs output to disk and read from disk.
  
  == Reduce ==
- When a reduce task starts its input is scattered in many files, across all nodes where map tasks ran. If run in
+ When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in
  distributed mode these need to be first copied to the local
  filesystem in a ''copy phase'' (see [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTaskRunner.java?view=markup ReduceTaskRunner]).
  
  Once all the data is available locally it is appended to one
- file (''append phase''). The file is then merge sorted so that the key-value pairs for
+ file in an ''append phase''. The file is then merge sorted so that the key-value pairs for
  a given key are contiguous (''sort phase''). This makes the actual reduce operation simple: the file is
  read sequentially and the values are passed to the reduce method
  with an iterator reading the input file until the next key
  value is encountered. See [http://svn.apache.org/viewcvs.cgi/lucene/hadoop/trunk/src/java/org/apache/hadoop/mapred/ReduceTask.java?view=markup	 ReduceTask] for details.
  
- In the end the output will consist of one output file per Reduce
+ At the end, the output will consist of one output file per executed reduce
- task run. The format of the files can be specified with
+ task. The format of the files can be specified with
- [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class) JobConf.setOutputFormat]. If !SequentialOutputFormat is used the output Key and Value
+ [http://lucene.apache.org/hadoop/docs/api/org/apache/hadoop/mapred/JobConf.html#setOutputFormat(java.lang.Class) JobConf.setOutputFormat]. If !SequentialOutputFormat is used then the output key and value
  classes must also be specified.