You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by dd...@apache.org on 2008/10/16 14:05:30 UTC
svn commit: r705216 [5/5] - in /hadoop/core/branches/branch-0.19: ./ conf/ docs/ src/docs/src/documentation/content/xdocs/ src/mapred/org/apache/hadoop/mapred/ src/mapred/org/apache/hadoop/mapred/pipes/

Modified: hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/streaming.xml
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/streaming.xml?rev=705216&r1=705215&r2=705216&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/streaming.xml (original)
+++ hadoop/core/branches/branch-0.19/src/docs/src/documentation/content/xdocs/streaming.xml Thu Oct 16 05:05:28 2008
@@ -312,13 +312,20 @@
 </p><p>
 Similarly, you can use "-D stream.reduce.output.field.separator=SEP" and "-D stream.num.reduce.output.fields=NUM" to specify the nth field separator in a line of the reduce outputs as the separator between the key and the value.
 </p>
+<p> Similarly, you can specify "stream.map.input.field.separator" and 
+"stream.reduce.input.field.separator" as the input separator for map/reduce 
+inputs. By default the separator is the tab character.</p>
 </section>
 
 
 <section>
 <title>A Useful Partitioner Class (secondary sort, the -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner option) </title>
 <p>
-Hadoop has a library class, org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner, that is useful for many applications. This class allows the Map/Reduce framework to partition the map outputs based on prefixes of keys, not the whole keys. For example:
+Hadoop has a library class, 
+<a href="ext:api/org/apache/hadoop/mapred/lib/keyfieldbasedpartitioner">KeyFieldBasedPartitioner</a>, 
+that is useful for many applications. This class allows the Map/Reduce 
+framework to partition the map outputs based on certain key fields, not
+the whole keys. For example:
 </p>
 <source>
 $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
@@ -330,13 +337,19 @@
     -D stream.map.output.field.separator=. \
     -D stream.num.map.output.key.fields=4 \
     -D map.output.key.field.separator=. \
-    -D num.key.fields.for.partition=2 \
+    -D mapred.text.key.partitioner.options=-k1,2\
     -D mapred.reduce.tasks=12
 </source>
 <p>
 Here, <em>-D stream.map.output.field.separator=.</em> and <em>-D stream.num.map.output.key.fields=4</em> are as explained in previous example. The two variables are used by streaming to identify the key/value pair of mapper. 
 </p><p>
-The map output keys of the above Map/Reduce job normally have four fields separated by ".". However, the Map/Reduce framework will partition the map outputs by the first two fields of the keys using the <em>-D num.key.fields.for.partition=2</em> option. Here, <em>-D map.output.key.field.separator=.</em> specifies the separator for the partition. This guarantees that all the key/value pairs with the same first two fields in the keys will be partitioned into the same reducer.
+The map output keys of the above Map/Reduce job normally have four fields
+separated by ".". However, the Map/Reduce framework will partition the map
+outputs by the first two fields of the keys using the 
+<em>-D mapred.text.key.partitioner.options=-k1,2</em> option. 
+Here, <em>-D map.output.key.field.separator=.</em> specifies the separator 
+for the partition. This guarantees that all the key/value pairs with the 
+same first two fields in the keys will be partitioned into the same reducer.
 </p><p>
 <em>This is effectively equivalent to specifying the first two fields as the primary key and the next two fields as the secondary. The primary key is used for partitioning, and the combination of the primary and secondary keys is used for sorting.</em> A simple illustration is shown here:
 </p>
@@ -370,13 +383,61 @@
 11.14.2.3
 </source>
 </section>
+<section>
+<title>A Useful Comparator Class</title>
+<p>
+Hadoop has a library class, 
+<a href="ext:api/org/apache/hadoop/mapred/lib/keyfieldbasedcomparator">KeyFieldBasedComparator</a>, 
+that is useful for many applications. This class provides a subset of features
+provided by the Unix/GNU Sort. For example:
+</p>
+<source>
+$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
+    -input myInputDirs \
+    -output myOutputDir \
+    -mapper org.apache.hadoop.mapred.lib.IdentityMapper \
+    -reducer org.apache.hadoop.mapred.lib.IdentityReducer \
+    -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
+    -D stream.map.output.field.separator=. \
+    -D stream.num.map.output.key.fields=4 \
+    -D map.output.key.field.separator=. \
+    -D mapred.text.key.comparator.options=-k2,2nr\
+    -D mapred.reduce.tasks=12
+</source>
+<p>
+The map output keys of the above Map/Reduce job normally have four fields
+separated by ".". However, the Map/Reduce framework will sort the 
+outputs by the second field of the keys using the 
+<em>-D mapred.text.key.comparator.options=-k2,2nr</em> option. 
+Here, <em>-n</em> specifies that the sorting is numerical sorting and 
+<em>-r</em> specifies that the result should be reversed. A simple illustration
+is shown below:
+</p>
+<p>
+Output of map (the keys)</p>
+<source>
+11.12.1.2
+11.14.2.3
+11.11.4.1
+11.12.1.1
+11.14.2.2
+</source>
+<p>
+Sorting output for the reducer(where second field used for sorting)</p>
+<source>
+11.14.2.3
+11.14.2.2
+11.12.1.2
+11.12.1.1
+11.11.4.1
+</source>
+</section>
 
 <section>
 <title>Working with the Hadoop Aggregate Package (the -reduce aggregate option) </title>
 <p>
-Hadoop has a library package called "Aggregate" (
-<a href="https://svn.apache.org/repos/asf/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/aggregate">
-https://svn.apache.org/repos/asf/hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/aggregate</a>).
+Hadoop has a library package called 
+<a href="ext:api/org/apache/hadoop/mapred/lib/aggregate/package-summary">Aggregate</a>.
 Aggregate provides a special reducer class and a special combiner class, and
 a list of simple aggregators that perform aggregations such as "sum", "max",
 "min" and so on  over a sequence of values. Aggregate allows you to define a
@@ -434,7 +495,7 @@
     -reducer org.apache.hadoop.mapred.lib.FieldSelectionMapReduce\
     -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
     -D map.output.key.field.separa=. \
-    -D num.key.fields.for.partition=2 \
+    -D mapred.text.key.partitioner.options=-k1,2 \
     -D mapred.data.field.separator=. \
     -D map.output.key.value.fields.spec=6,5,1-3:0- \
     -D reduce.output.key.value.fields.spec=0-2:5- \
@@ -444,7 +505,11 @@
 The option "-D map.output.key.value.fields.spec=6,5,1-3:0-" specifies key/value selection for the map outputs. Key selection spec and value selection spec are separated by ":". In this case, the map output key will consist of fields 6, 5, 1, 2, and 3. The map output value will consist of all fields (0- means field 0 and all 
 the subsequent fields). 
 </p><p>
-The option "-D reduce.output.key.value.fields.spec=0-2:0-" specifies key/value selection for the reduce outputs. In this case, the reduce output key will consist of fields 0, 1, 2 (corresponding to the original fields 6, 5, 1). The reduce output value will consist of all fields starting from field 5 (corresponding to all the original fields).  
+The option "-D reduce.output.key.value.fields.spec=0-2:5-" specifies 
+key/value selection for the reduce outputs. In this case, the reduce 
+output key will consist of fields 0, 1, 2 (corresponding to the original 
+fields 6, 5, 1). The reduce output value will consist of all fields starting
+from field 5 (corresponding to all the original fields).  
 </p>
 </section>
 </section>

Modified: hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/package.html
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/package.html?rev=705216&r1=705215&r2=705216&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/package.html (original)
+++ hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/package.html Thu Oct 16 05:05:28 2008
@@ -172,8 +172,8 @@
     
     grepJob.setJobName("grep");
 
-    grepJob.setInputPath(new Path(args[0]));
-    grepJob.setOutputPath(args[1]);
+    FileInputFormat.setInputPaths(grepJob, new Path(args[0]));
+    FileOutputFormat.setOutputPath(grepJob, args[1]);
 
     grepJob.setMapperClass(GrepMapper.class);
     grepJob.setCombinerClass(GrepReducer.class);

Modified: hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/pipes/package.html
URL: http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/pipes/package.html?rev=705216&r1=705215&r2=705216&view=diff
==============================================================================
--- hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/pipes/package.html (original)
+++ hadoop/core/branches/branch-0.19/src/mapred/org/apache/hadoop/mapred/pipes/package.html Thu Oct 16 05:05:28 2008
@@ -117,5 +117,11 @@
 called by the C++ framework before the key/value pair is sent back to
 Java.
 
+<p>
+
+The application programs can also register counters with a group and a name 
+and also increment the counters and get the counter values. Word-count 
+example illustrating pipes usage with counters is available at 
+<a href="https://svn.apache.org/repos/asf/hadoop/core/trunk/src/examples/pipes/impl/wordcount-simple.cc">wordcount-simple.cc</a>
 </body>
 </html>