You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by st...@apache.org on 2009/08/26 20:13:04 UTC
svn commit: r808144 - in /hadoop/hbase/trunk: CHANGES.txt
src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
Author: stack
Date: Wed Aug 26 18:13:04 2009
New Revision: 808144
URL: http://svn.apache.org/viewvc?rev=808144&view=rev
Log:
HBASE-1698 Review documentation for o.a.h.h.mapreduce
Modified:
hadoop/hbase/trunk/CHANGES.txt
hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
Modified: hadoop/hbase/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/hbase/trunk/CHANGES.txt?rev=808144&r1=808143&r2=808144&view=diff
==============================================================================
--- hadoop/hbase/trunk/CHANGES.txt (original)
+++ hadoop/hbase/trunk/CHANGES.txt Wed Aug 26 18:13:04 2009
@@ -8,6 +8,7 @@
HBASE-1737 Regions unbalanced when adding new node (recommit)
HBASE-1792 [Regression] Cannot save timestamp in the future
HBASE-1793 [Regression] HTable.get/getRow with a ts is broken
+ HBASE-1698 Review documentation for o.a.h.h.mapreduce
IMPROVEMENTS
HBASE-1760 Cleanup TODOs in HTable
Modified: hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java
URL: http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java?rev=808144&r1=808143&r2=808144&view=diff
==============================================================================
--- hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java (original)
+++ hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/mapreduce/package-info.java Wed Aug 26 18:13:04 2009
@@ -33,41 +33,34 @@
<p>MapReduce jobs deployed to a MapReduce cluster do not by default have access
to the HBase configuration under <code>$HBASE_CONF_DIR</code> nor to HBase classes.
You could add <code>hbase-site.xml</code> to $HADOOP_HOME/conf and add
-<code>hbase-X.X.X.jar</code> to the <code>$HADOOP_HOME/lib</code> and copy these
-changes across your cluster but the cleanest means of adding hbase configuration
+hbase jars to the <code>$HADOOP_HOME/lib</code> and copy these
+changes across your cluster but a cleaner means of adding hbase configuration
and classes to the cluster <code>CLASSPATH</code> is by uncommenting
<code>HADOOP_CLASSPATH</code> in <code>$HADOOP_HOME/conf/hadoop-env.sh</code>
-and adding the path to the hbase jar and <code>$HBASE_CONF_DIR</code> directory.
-Then copy the amended configuration around the cluster.
-You'll probably need to restart the MapReduce cluster if you want it to notice
-the new configuration.
-</p>
-
-<p>For example, here is how you would amend <code>hadoop-env.sh</code> adding the
-built hbase jar, hbase conf, and the <code>PerformanceEvaluation</code> class from
-the built hbase test jar to the hadoop <code>CLASSPATH<code>:
+adding hbase dependencies here. For example, here is how you would amend
+<code>hadoop-env.sh</code> adding the
+built hbase jar, zookeeper (needed by hbase client), hbase conf, and the
+<code>PerformanceEvaluation</code> class from the built hbase test jar to the
+hadoop <code>CLASSPATH<code>:
<blockquote><pre># Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
-export HADOOP_CLASSPATH=$HBASE_HOME/build/test:$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf</pre></blockquote>
+export HADOOP_CLASSPATH=$HBASE_HOME/build/hbase-X.X.X.jar:$HBASE_HOME/build/hbase-X.X.X-test.jar:$HBASE_HOME/conf:${HBASE_HOME}/lib/zookeeper-X.X.X.jar</pre></blockquote>
<p>Expand <code>$HBASE_HOME</code> in the above appropriately to suit your
local environment.</p>
-<p>After copying the above change around your cluster, this is how you would run
-the PerformanceEvaluation MR job to put up 4 clients (Presumes a ready mapreduce
-cluster):
+<p>After copying the above change around your cluster (and restarting), this is
+how you would run the PerformanceEvaluation MR job to put up 4 clients (Presumes
+a ready mapreduce cluster):
<blockquote><pre>$HADOOP_HOME/bin/hadoop org.apache.hadoop.hbase.PerformanceEvaluation sequentialWrite 4</pre></blockquote>
-
-The PerformanceEvaluation class wil be found on the CLASSPATH because you
-added <code>$HBASE_HOME/build/test</code> to HADOOP_CLASSPATH
</p>
<p>Another possibility, if for example you do not have access to hadoop-env.sh or
-are unable to restart the hadoop cluster, is bundling the hbase jar into a mapreduce
+are unable to restart the hadoop cluster, is bundling the hbase jars into a mapreduce
job jar adding it and its dependencies under the job jar <code>lib/</code>
-directory and the hbase conf into a job jar <code>conf/</code> directory.
+directory and the hbase conf into the job jars top-level directory.
</a>
<h2><a name="sink">HBase as MapReduce job data source and sink</a></h2>
@@ -79,7 +72,7 @@
{@link org.apache.hadoop.hbase.mapreduce.TableReducer TableReducer}. See the do-nothing
pass-through classes {@link org.apache.hadoop.hbase.mapreduce.IdentityTableMapper IdentityTableMapper} and
{@link org.apache.hadoop.hbase.mapreduce.IdentityTableReducer IdentityTableReducer} for basic usage. For a more
-involved example, see {@link org.apache.hadoop.hbase.mapreduce.BuildTableIndex BuildTableIndex}
+involved example, see {@link org.apache.hadoop.hbase.mapreduce.RowCounter}
or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test.
</p>
@@ -106,162 +99,22 @@
currently existing regions. The
{@link org.apache.hadoop.hbase.mapreduce.HRegionPartitioner} is suitable
when your table is large and your upload is not such that it will greatly
-alter the number of existing regions when done; other use the default
+alter the number of existing regions when done; otherwise use the default
partitioner.
</p>
<h2><a name="examples">Example Code</a></h2>
<h3>Sample Row Counter</h3>
-<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. You should be able to run
+<p>See {@link org.apache.hadoop.hbase.mapreduce.RowCounter}. This job uses
+{@link org.apache.hadoop.hbase.mapreduce.TableInputFormat TableInputFormat} and
+does a count of all rows in specified table.
+You should be able to run
it by doing: <code>% ./bin/hadoop jar hbase-X.X.X.jar</code>. This will invoke
the hbase MapReduce Driver class. Select 'rowcounter' from the choice of jobs
-offered. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
+offered. This will emit rowcouner 'usage'. Specify tablename, column to count
+and output directory. You may need to add the hbase conf directory to <code>$HADOOP_HOME/conf/hadoop-env.sh#HADOOP_CLASSPATH</code>
so the rowcounter gets pointed at the right hbase cluster (or, build a new jar
with an appropriate hbase-site.xml built into your job jar).
</p>
-<h3>PerformanceEvaluation</h3>
-<p>See org.apache.hadoop.hbase.PerformanceEvaluation from hbase src/test. It runs
-a mapreduce job to run concurrent clients reading and writing hbase.
-</p>
-
-<h3>Sample MR Bulk Uploader</h3>
-<p>A students/classes example based on a contribution by Naama Kraus with logs of
-documentation can be found over in src/examples/mapred.
-Its the <code>org.apache.hadoop.hbase.mapreduce.SampleUploader</code> class.
-Just copy it under src/java/org/apache/hadoop/hbase/mapred to compile and try it
-(until we start generating an hbase examples jar). The class reads a data file
-from HDFS and per line, does an upload to HBase using TableReduce.
-Read the class comment for specification of inputs, prerequisites, etc.
-</p>
-
-<h3>Example to bulk import/load a text file into an HTable
-</h3>
-
-<p>Here's a sample program from
-<a href="http://www.spicylogic.com/allenday/blog/category/computing/distributed-systems/hadoop/hbase/">Allen Day</a>
-that takes an HDFS text file path and an HBase table name as inputs, and loads the contents of the text file to the table
-all up in the map phase.
-</p>
-
-<blockquote><pre>
-package com.spicylogic.hbase;
-package org.apache.hadoop.hbase.mapreduce;
-import java.io.IOException;
-
-import org.apache.hadoop.conf.Configuration;
-import org.apache.hadoop.fs.Path;
-import org.apache.hadoop.hbase.HBaseConfiguration;
-import org.apache.hadoop.hbase.client.HTable;
-import org.apache.hadoop.hbase.io.BatchUpdate;
-import org.apache.hadoop.io.LongWritable;
-import org.apache.hadoop.io.Text;
-import org.apache.hadoop.mapred.FileInputFormat;
-import org.apache.hadoop.mapred.JobClient;
-import org.apache.hadoop.mapred.JobConf;
-import org.apache.hadoop.mapred.MapReduceBase;
-import org.apache.hadoop.mapred.Mapper;
-import org.apache.hadoop.mapred.OutputCollector;
-import org.apache.hadoop.mapred.Reporter;
-import org.apache.hadoop.mapred.lib.NullOutputFormat;
-import org.apache.hadoop.util.Tool;
-import org.apache.hadoop.util.ToolRunner;
-
-/**
- * Class that adds the parsed line from the input to hbase
- * in the map function. Map has no emissions and job
- * has no reduce.
- */
-public class BulkImport implements Tool {
- private static final String NAME = "BulkImport";
- private Configuration conf;
-
- public static class InnerMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {
- private HTable table;
- private HBaseConfiguration HBconf;
-
- public void map(LongWritable key, Text value,
- OutputCollector<Text, Text> output, Reporter reporter)
- throws IOException {
- if ( table == null )
- throw new IOException("table is null");
-
- // Split input line on tab character
- String [] splits = value.toString().split("\t");
- if ( splits.length != 4 )
- return;
-
- String rowID = splits[0];
- int timestamp = Integer.parseInt( splits[1] );
- String colID = splits[2];
- String cellValue = splits[3];
-
- reporter.setStatus("Map emitting cell for row='" + rowID +
- "', column='" + colID + "', time='" + timestamp + "'");
-
- BatchUpdate bu = new BatchUpdate( rowID );
- if ( timestamp > 0 )
- bu.setTimestamp( timestamp );
-
- bu.put(colID, cellValue.getBytes());
- table.commit( bu );
- }
-
- public void configure(JobConf job) {
- HBconf = new HBaseConfiguration(job);
- try {
- table = new HTable( HBconf, job.get("input.table") );
- } catch (IOException e) {
- // TODO Auto-generated catch block
- e.printStackTrace();
- }
- }
- }
-
- public JobConf createSubmittableJob(String[] args) {
- JobConf c = new JobConf(getConf(), BulkImport.class);
- c.setJobName(NAME);
- FileInputFormat.setInputPaths(c, new Path(args[0]));
-
- c.set("input.table", args[1]);
- c.setMapperClass(InnerMap.class);
- c.setNumReduceTasks(0);
- c.setOutputFormat(NullOutputFormat.class);
- return c;
- }
-
- static int printUsage() {
- System.err.println("Usage: " + NAME + " <input> <table_name>");
- System.err.println("\twhere <input> is a tab-delimited text file with 4 columns.");
- System.err.println("\t\tcolumn 1 = row ID");
- System.err.println("\t\tcolumn 2 = timestamp (use a negative value for current time)");
- System.err.println("\t\tcolumn 3 = column ID");
- System.err.println("\t\tcolumn 4 = cell value");
- return -1;
- }
-
- public int run(@SuppressWarnings("unused") String[] args) throws Exception {
- // Make sure there are exactly 3 parameters left.
- if (args.length != 2) {
- return printUsage();
- }
- JobClient.runJob(createSubmittableJob(args));
- return 0;
- }
-
- public Configuration getConf() {
- return this.conf;
- }
-
- public void setConf(final Configuration c) {
- this.conf = c;
- }
-
- public static void main(String[] args) throws Exception {
- int errCode = ToolRunner.run(new Configuration(), new BulkImport(), args);
- System.exit(errCode);
- }
-}
-</pre></blockquote>
-
*/
package org.apache.hadoop.hbase.mapreduce;