You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by st...@apache.org on 2012/06/03 23:59:52 UTC

svn commit: r1345788 [6/10] - in /hbase/trunk: ./ hbase-assembly/ hbase-common/ hbase-server/ hbase-site/ src/ src/assembly/ src/docbkx/ src/site/ src/site/resources/ src/site/resources/css/ src/site/resources/images/ src/site/xdoc/ src/xslt/

Added: hbase/trunk/src/docbkx/ops_mgt.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/ops_mgt.xml?rev=1345788&view=auto
==============================================================================
--- hbase/trunk/src/docbkx/ops_mgt.xml (added)
+++ hbase/trunk/src/docbkx/ops_mgt.xml Sun Jun  3 21:59:50 2012
@@ -0,0 +1,681 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter version="5.0" xml:id="ops_mgt"
+         xmlns="http://docbook.org/ns/docbook"
+         xmlns:xlink="http://www.w3.org/1999/xlink"
+         xmlns:xi="http://www.w3.org/2001/XInclude"
+         xmlns:svg="http://www.w3.org/2000/svg"
+         xmlns:m="http://www.w3.org/1998/Math/MathML"
+         xmlns:html="http://www.w3.org/1999/xhtml"
+         xmlns:db="http://docbook.org/ns/docbook">
+<!--
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+  <title>HBase Operational Management</title>
+  This chapter will cover operational tools and practices required of a running HBase cluster.
+  The subject of operations is related to the topics of <xref linkend="trouble" />, <xref linkend="performance"/>,
+  and <xref linkend="configuration" /> but is a distinct topic in itself.  
+  
+  <section xml:id="tools">
+    <title >HBase Tools and Utilities</title>
+
+    <para>Here we list HBase tools for administration, analysis, fixup, and
+    debugging.</para>
+    <section xml:id="driver"><title>Driver</title>
+      <para>There is a <code>Driver</code> class that is executed by the HBase jar can be used to invoke frequently accessed utilities.  For example, 
+<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar 
+</programlisting>
+... will return...
+<programlisting>
+An example program must be given as the first argument.
+Valid program names are:
+  completebulkload: Complete a bulk data load.
+  copytable: Export a table from local cluster to peer cluster
+  export: Write table data to HDFS.
+  import: Import data written by Export.
+  importtsv: Import data in TSV format.
+  rowcounter: Count rows in HBase table
+  verifyrep: Compare the data from tables in two different clusters. WARNING: It doesn't work for incrementColumnValues'd cells since the timestamp is chan
+</programlisting>
+... for allowable program names.
+      </para>
+    </section>
+    <section xml:id="hbck">
+        <title>HBase <application>hbck</application></title>
+        <subtitle>An <emphasis>fsck</emphasis> for your HBase install</subtitle>
+        <para>To run <application>hbck</application> against your HBase cluster run
+        <programlisting>$ ./bin/hbase hbck</programlisting>
+        At the end of the commands output it prints <emphasis>OK</emphasis>
+        or <emphasis>INCONSISTENCY</emphasis>. If your cluster reports
+        inconsistencies, pass <command>-details</command> to see more detail emitted.
+        If inconsistencies, run <command>hbck</command> a few times because the
+        inconsistency may be transient (e.g. cluster is starting up or a region is
+        splitting).
+        Passing <command>-fix</command> may correct the inconsistency (This latter
+        is an experimental feature).
+        </para>
+        <para>For more information, see <xref linkend="hbck.in.depth"/>.
+        </para>
+    </section>
+    <section xml:id="hfile_tool2"><title>HFile Tool</title>
+        <para>See <xref linkend="hfile_tool" />.</para>
+    </section>
+    <section xml:id="wal_tools">
+      <title>WAL Tools</title>
+
+      <section xml:id="hlog_tool">
+        <title><classname>HLog</classname> tool</title>
+
+        <para>The main method on <classname>HLog</classname> offers manual
+        split and dump facilities. Pass it WALs or the product of a split, the
+        content of the <filename>recovered.edits</filename>. directory.</para>
+
+        <para>You can get a textual dump of a WAL file content by doing the
+        following:<programlisting> <code>$ ./bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --dump hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/10.10.21.10%3A60020.1283973724012</code> </programlisting>The
+        return code will be non-zero if issues with the file so you can test
+        wholesomeness of file by redirecting <varname>STDOUT</varname> to
+        <code>/dev/null</code> and testing the program return.</para>
+
+        <para>Similarly you can force a split of a log file directory by
+        doing:<programlisting> $ ./<code>bin/hbase org.apache.hadoop.hbase.regionserver.wal.HLog --split hdfs://example.org:8020/hbase/.logs/example.org,60020,1283516293161/</code></programlisting></para>
+
+        <section xml:id="hlog_tool.prettyprint">
+          <title><classname>HLogPrettyPrinter</classname></title>
+          <para><classname>HLogPrettyPrinter</classname> is a tool with configurable options to print the contents of an HLog.
+          </para>
+        </section>
+
+      </section>
+    </section>
+    <section xml:id="compression.tool"><title>Compression Tool</title>
+        <para>See <xref linkend="compression.test" />.</para>
+    </section>
+        <section xml:id="copytable">
+        <title>CopyTable</title>
+      <para>
+            CopyTable is a utility that can copy part or of all of a table, either to the same cluster or another cluster. The usage is as follows:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable [--starttime=X] [--endtime=Y] [--new.name=NEW] [--peer.adr=ADR] tablename
+</programlisting>
+        </para>
+        <para>
+        Options:
+        <itemizedlist>
+          <listitem><varname>starttime</varname>  Beginning of the time range.  Without endtime means starttime to forever.</listitem>
+          <listitem><varname>endtime</varname>  End of the time range.  Without endtime means starttime to forever.</listitem>
+          <listitem><varname>versions</varname>  Number of cell versions to copy.</listitem>
+          <listitem><varname>new.name</varname>  New table's name.</listitem>
+          <listitem><varname>peer.adr</varname>  Address of the peer cluster given in the format hbase.zookeeper.quorum:hbase.zookeeper.client.port:zookeeper.znode.parent</listitem>
+          <listitem><varname>families</varname>  Comma-separated list of ColumnFamilies to copy.</listitem>
+          <listitem><varname>all.cells</varname>  Also copy delete markers and uncollected deleted cells (advanced option).</listitem>
+        </itemizedlist>
+         Args:
+        <itemizedlist>
+          <listitem>tablename  Name of table to copy.</listitem>
+        </itemizedlist>
+        </para>
+        <para>Example of copying 'TestTable' to a cluster that uses replication for a 1 hour window:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.CopyTable
+--starttime=1265875194289 --endtime=1265878794289
+--peer.adr=server1,server2,server3:2181:/hbase TestTable</programlisting>
+        </para>
+        <para>Note:  caching for the input Scan is configured via <code>hbase.client.scanner.caching</code> in the job configuration.
+        </para>
+    </section>
+    <section xml:id="export">
+       <title>Export</title>
+       <para>Export is a utility that will dump the contents of table to HDFS in a sequence file.  Invoke via:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export &lt;tablename&gt; &lt;outputdir&gt; [&lt;versions&gt; [&lt;starttime&gt; [&lt;endtime&gt;]]]
+</programlisting>
+       </para>
+        <para>Note:  caching for the input Scan is configured via <code>hbase.client.scanner.caching</code> in the job configuration.
+        </para>
+    </section>
+    <section xml:id="import">
+       <title>Import</title>
+       <para>Import is a utility that will load data that has been exported back into HBase.  Invoke via:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.Import &lt;tablename&gt; &lt;inputdir&gt;
+</programlisting>
+       </para>
+    </section>
+    <section xml:id="importtsv">
+       <title>ImportTsv</title>
+       <para>ImportTsv is a utility that will load data in TSV format into HBase.  It has two distinct usages:  loading data from TSV format in HDFS 
+       into HBase via Puts, and preparing StoreFiles to be loaded via the <code>completebulkload</code>.
+       </para>
+       <para>To load data via Puts (i.e., non-bulk loading):
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;hdfs-inputdir&gt;
+</programlisting>
+       </para>
+       <para>To generate StoreFiles for bulk-loading:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=a,b,c -Dimporttsv.bulk.output=hdfs://storefile-outputdir &lt;tablename&gt; &lt;hdfs-data-inputdir&gt;
+</programlisting>
+       </para>
+       <para>These generated StoreFiles can be loaded into HBase via <xref linkend="completebulkload"/>. 
+       </para>
+       <section xml:id="importtsv.options"><title>ImportTsv Options</title>
+       Running ImportTsv with no arguments prints brief usage information:
+<programlisting>
+Usage: importtsv -Dimporttsv.columns=a,b,c &lt;tablename&gt; &lt;inputdir&gt;
+
+Imports the given input directory of TSV data into the specified table.
+
+The column names of the TSV data must be specified using the -Dimporttsv.columns
+option. This option takes the form of comma-separated column names, where each
+column name is either a simple column family, or a columnfamily:qualifier. The special
+column name HBASE_ROW_KEY is used to designate that this column should be used
+as the row key for each imported record. You must specify exactly one column
+to be the row key, and you must specify a column name for every column that exists in the
+input data.
+
+By default importtsv will load data directly into HBase. To instead generate
+HFiles of data to prepare for a bulk data load, pass the option:
+  -Dimporttsv.bulk.output=/path/for/output
+  Note: if you do not use this option, then the target table must already exist in HBase
+
+Other options that may be specified with -D include:
+  -Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
+  '-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
+  -Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
+  -Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
+</programlisting>       
+       </section>
+       <section xml:id="importtsv.example"><title>ImportTsv Example</title>
+         <para>For example, assume that we are loading data into a table called 'datatsv' with a ColumnFamily called 'd' with two columns "c1" and "c2".
+         </para>
+         <para>Assume that an input file exists as follows:
+<programlisting>
+row1	c1	c2
+row2	c1	c2
+row3	c1	c2
+row4	c1	c2
+row5	c1	c2
+row6	c1	c2
+row7	c1	c2
+row8	c1	c2
+row9	c1	c2
+row10	c1	c2
+</programlisting>
+         </para>
+         <para>For ImportTsv to use this imput file, the command line needs to look like this:
+ <programlisting>
+ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,d:c1,d:c2 -Dimporttsv.bulk.output=hdfs://storefileoutput  datatsv hdfs://inputfile
+ </programlisting>
+         ... and in this example the first column is the rowkey, which is why the HBASE_ROW_KEY is used.  The second and third columns in the file will be imported as "d:c1" and "d:c2", respectively.
+         </para>
+       </section>
+       <section xml:id="importtsv.warning"><title>ImportTsv Warning</title>
+         <para>If you have preparing a lot of data for bulk loading, make sure the target HBase table is pre-split appropriately.
+         </para>
+       </section>
+       <section xml:id="importtsv.also"><title>See Also</title>
+       For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>
+       </section>       
+    </section>
+    
+    <section xml:id="completebulkload">
+       <title>CompleteBulkLoad</title>
+	   <para>The <code>completebulkload</code> utility will move generated StoreFiles into an HBase table.  This utility is often used
+	   in conjunction with output from <xref linkend="importtsv"/>.  
+	   </para>
+	   <para>There are two ways to invoke this utility, with explicit classname and via the driver: 
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFile &lt;hdfs://storefileoutput&gt; &lt;tablename&gt;
+</programlisting>
+.. and via the Driver..
+<programlisting>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-VERSION.jar completebulkload &lt;hdfs://storefileoutput&gt; &lt;tablename&gt;
+</programlisting>
+	  </para>
+       <para>For more information about bulk-loading HFiles into HBase, see <xref linkend="arch.bulk.load"/>.
+       </para>
+    </section>
+    <section xml:id="walplayer">
+       <title>WALPlayer</title>
+       <para>WALPlayer is a utility to replay WAL files into HBase.
+       </para>
+       <para>The WAL can be replayed for a set of tables or all tables, and a timerange can be provided (in milliseconds). The WAL is filtered to this set of tables. The output can optionally be mapped to another set of tables.
+       </para>
+       <para>WALPlayer can also generate HFiles for later bulk importing, in that case only a single table and no mapping can be specified.
+       </para>
+       <para>Invoke via:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer [options] &lt;wal inputdir&gt; &lt;tables&gt; [&lt;tableMappings>]&gt;
+</programlisting>
+       </para>
+       <para>For example:
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.WALPlayer /backuplogdir oldTable1,oldTable2 newTable1,newTable2
+</programlisting>
+       </para>
+    </section>
+    <section xml:id="rowcounter">
+       <title>RowCounter</title>
+       <para>RowCounter is a utility that will count all the rows of a table.  This is a good utility to use
+       as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency.
+<programlisting>$ bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter &lt;tablename&gt; [&lt;column1&gt; &lt;column2&gt;...]
+</programlisting>
+       </para>
+       <para>Note:  caching for the input Scan is configured via <code>hbase.client.scanner.caching</code> in the job configuration.
+       </para>
+    </section>
+           
+    </section>  <!--  tools -->
+
+  <section xml:id="ops.regionmgt">
+    <title>Region Management</title>
+    <section xml:id="ops.regionmgt.majorcompact">
+      <title>Major Compaction</title>
+      <para>Major compactions can be requested via the HBase shell or <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#majorCompact%28java.lang.String%29">HBaseAdmin.majorCompact</link>.
+      </para>
+      <para>Note:  major compactions do NOT do region merges.  See <xref linkend="compaction"/> for more information about compactions.
+      
+      </para>
+    </section>
+    <section xml:id="ops.regionmgt.merge">
+      <title>Merge</title>
+      <para>Merge is a utility that can merge adjoining regions in the same table (see org.apache.hadoop.hbase.util.Merge).</para>
+<programlisting>$ bin/hbase org.apache.hbase.util.Merge &lt;tablename&gt; &lt;region1&gt; &lt;region2&gt;
+</programlisting>
+      <para>If you feel you have too many regions and want to consolidate them, Merge is the utility you need.  Merge must
+      run be done when the cluster is down.  
+      See the <link xlink:href="http://ofps.oreilly.com/titles/9781449396107/performance.html">O'Reilly HBase Book</link> for
+      an example of usage.
+      </para>
+      <para>Additionally, there is a Ruby script attached to <link xlink:href="https://issues.apache.org/jira/browse/HBASE-1621">HBASE-1621</link> 
+      for region merging.
+      </para>
+    </section>
+  </section>
+    
+    <section xml:id="node.management"><title>Node Management</title>
+     <section xml:id="decommission"><title>Node Decommission</title>
+        <para>You can stop an individual RegionServer by running the following
+            script in the HBase directory on the particular  node:
+            <programlisting>$ ./bin/hbase-daemon.sh stop regionserver</programlisting>
+            The RegionServer will first close all regions and then shut itself down.
+            On shutdown, the RegionServer's ephemeral node in ZooKeeper will expire.
+            The master will notice the RegionServer gone and will treat it as
+            a 'crashed' server; it will reassign the nodes the RegionServer was carrying.
+            <note><title>Disable the Load Balancer before Decommissioning a node</title>
+             <para>If the load balancer runs while a node is shutting down, then
+                 there could be contention between the Load Balancer and the
+                 Master's recovery of the just decommissioned RegionServer.
+                 Avoid any problems by disabling the balancer first.
+                 See <xref linkend="lb" /> below.
+             </para>
+            </note>
+        </para>
+        <para>
+        A downside to the above stop of a RegionServer is that regions could be offline for
+        a good period of time.  Regions are closed in order.  If many regions on the server, the
+        first region to close may not be back online until all regions close and after the master
+        notices the RegionServer's znode gone.  In HBase 0.90.2, we added facility for having
+        a node gradually shed its load and then shutdown itself down.  HBase 0.90.2 added the
+            <filename>graceful_stop.sh</filename> script.  Here is its usage:
+            <programlisting>$ ./bin/graceful_stop.sh 
+Usage: graceful_stop.sh [--config &amp;conf-dir>] [--restart] [--reload] [--thrift] [--rest] &amp;hostname>
+ thrift      If we should stop/start thrift before/after the hbase stop/start
+ rest        If we should stop/start rest before/after the hbase stop/start
+ restart     If we should restart after graceful stop
+ reload      Move offloaded regions back on to the stopped server
+ debug       Move offloaded regions back on to the stopped server
+ hostname    Hostname of server we are to stop</programlisting>
+        </para>
+        <para>
+            To decommission a loaded RegionServer, run the following:
+            <programlisting>$ ./bin/graceful_stop.sh HOSTNAME</programlisting>
+            where <varname>HOSTNAME</varname> is the host carrying the RegionServer
+            you would decommission.  
+            <note><title>On <varname>HOSTNAME</varname></title>
+                <para>The <varname>HOSTNAME</varname> passed to <filename>graceful_stop.sh</filename>
+            must match the hostname that hbase is using to identify RegionServers.
+            Check the list of RegionServers in the master UI for how HBase is
+            referring to servers. Its usually hostname but can also be FQDN.
+            Whatever HBase is using, this is what you should pass the
+            <filename>graceful_stop.sh</filename> decommission
+            script.  If you pass IPs, the script is not yet smart enough to make
+            a hostname (or FQDN) of it and so it will fail when it checks if server is
+            currently running; the graceful unloading of regions will not run.
+            </para>
+        </note> The <filename>graceful_stop.sh</filename> script will move the regions off the
+            decommissioned RegionServer one at a time to minimize region churn.
+            It will verify the region deployed in the new location before it
+            will moves the next region and so on until the decommissioned server
+            is carrying zero regions.  At this point, the <filename>graceful_stop.sh</filename>
+            tells the RegionServer <command>stop</command>.  The master will at this point notice the
+            RegionServer gone but all regions will have already been redeployed
+            and because the RegionServer went down cleanly, there will be no
+            WAL logs to split.
+            <note xml:id="lb"><title>Load Balancer</title>
+            <para> 
+                It is assumed that the Region Load Balancer is disabled while the
+                <command>graceful_stop</command> script runs (otherwise the balancer
+                and the decommission script will end up fighting over region deployments).
+                Use the shell to disable the balancer:
+                <programlisting>hbase(main):001:0> balance_switch false
+true
+0 row(s) in 0.3590 seconds</programlisting>
+This turns the balancer OFF.  To reenable, do:
+                <programlisting>hbase(main):001:0> balance_switch true
+false
+0 row(s) in 0.3590 seconds</programlisting>
+            </para> 
+        </note>
+        </para>
+        </section>  
+        <section xml:id="rolling">
+            <title>Rolling Restart</title>
+        <para>
+            You can also ask this script to restart a RegionServer after the shutdown
+            AND move its old regions back into place.  The latter you might do to
+            retain data locality.  A primitive rolling restart might be effected by
+            running something like the following:
+            <programlisting>$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &amp;> /tmp/log.txt &amp;
+            </programlisting>
+            Tail the output of <filename>/tmp/log.txt</filename> to follow the scripts
+            progress. The above does RegionServers only.  Be sure to disable the
+            load balancer before doing the above.  You'd need to do the master
+            update separately.  Do it before you run the above script.
+            Here is a pseudo-script for how you might craft a rolling restart script:
+            <orderedlist>
+                <listitem><para>Untar your release, make sure of its configuration and
+                        then rsync it across the cluster. If this is 0.90.2, patch it
+                        with HBASE-3744 and HBASE-3756.
+                    </para>
+                </listitem>
+                <listitem>
+                    <para>Run hbck to ensure the cluster consistent
+                        <programlisting>$ ./bin/hbase hbck</programlisting>
+                    Effect repairs if inconsistent.
+                    </para>
+                </listitem>
+                <listitem>
+                    <para>Restart the Master: <programlisting>$ ./bin/hbase-daemon.sh stop master; ./bin/hbase-daemon.sh start master</programlisting>
+                    </para>
+                </listitem>
+                <listitem>
+                    <para>
+                       Disable the region balancer:<programlisting>$ echo "balance_switch false" | ./bin/hbase shell</programlisting>
+                    </para>
+                </listitem>
+                <listitem>
+                     <para>Run the <filename>graceful_stop.sh</filename> script per RegionServer.  For example:
+            <programlisting>$ for i in `cat conf/regionservers|sort`; do ./bin/graceful_stop.sh --restart --reload --debug $i; done &amp;> /tmp/log.txt &amp;
+            </programlisting>
+                     If you are running thrift or rest servers on the RegionServer, pass --thrift or --rest options (See usage
+                     for <filename>graceful_stop.sh</filename> script).
+                 </para>
+                </listitem>
+                <listitem>
+                    <para>Restart the Master again.  This will clear out dead servers list and reenable the balancer.
+                    </para>
+                </listitem>
+                <listitem>
+                    <para>Run hbck to ensure the cluster is consistent.
+                    </para>
+                </listitem>
+            </orderedlist>
+        </para>
+    </section>
+    </section>  <!--  node mgt -->
+
+  <section xml:id="hbase_metrics">
+  <title>HBase Metrics</title>
+  <section xml:id="metric_setup">
+  <title>Metric Setup</title>
+  <para>See <link xlink:href="http://hbase.apache.org/metrics.html">Metrics</link> for
+  an introduction and how to enable Metrics emission.
+  </para>
+  </section>
+   <section xml:id="rs_metrics">
+   <title>RegionServer Metrics</title>
+          <section xml:id="hbase.regionserver.blockCacheCount"><title><varname>hbase.regionserver.blockCacheCount</varname></title>
+          <para>Block cache item count in memory.  This is the number of blocks of StoreFiles (HFiles) in the cache.</para>
+		  </section>
+         <section xml:id="hbase.regionserver.blockCacheEvictedCount"><title><varname>hbase.regionserver.blockCacheEvictedCount</varname></title>
+          <para>Number of blocks that had to be evicted from the block cache due to heap size constraints.</para>
+		  </section>
+         <section xml:id="hbase.regionserver.blockCacheFree"><title><varname>hbase.regionserver.blockCacheFree</varname></title>
+          <para>Block cache memory available (bytes).</para>
+		  </section>
+          <section xml:id="hbase.regionserver.blockCacheHitCachingRatio"><title><varname>hbase.regionserver.blockCacheHitCachingRatio</varname></title>
+          <para>Block cache hit caching ratio (0 to 100).  The cache-hit ratio for reads configured to look in the cache (i.e., cacheBlocks=true). </para>
+		  </section>
+          <section xml:id="hbase.regionserver.blockCacheHitCount"><title><varname>hbase.regionserver.blockCacheHitCount</varname></title>
+          <para>Number of blocks of StoreFiles (HFiles) read from the cache.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.blockCacheHitRatio"><title><varname>hbase.regionserver.blockCacheHitRatio</varname></title>
+          <para>Block cache hit ratio (0 to 100).  Includes all read requests, although those with cacheBlocks=false
+           will always read from disk and be counted as a "cache miss".</para>
+		  </section>
+          <section xml:id="hbase.regionserver.blockCacheMissCount"><title><varname>hbase.regionserver.blockCacheMissCount</varname></title>
+          <para>Number of blocks of StoreFiles (HFiles) requested but not read from the cache.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.blockCacheSize"><title><varname>hbase.regionserver.blockCacheSize</varname></title>
+          <para>Block cache size in memory (bytes).  i.e., memory in use by the BlockCache</para>
+		  </section>
+          <section xml:id="hbase.regionserver.compactionQueueSize"><title><varname>hbase.regionserver.compactionQueueSize</varname></title>
+          <para>Size of the compaction queue.  This is the number of Stores in the RegionServer that have been targeted for compaction.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.flushQueueSize"><title><varname>hbase.regionserver.flushQueueSize</varname></title>
+          <para>Number of enqueued regions in the MemStore awaiting flush.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsReadLatency_avg_time"><title><varname>hbase.regionserver.fsReadLatency_avg_time</varname></title>
+          <para>Filesystem read latency (ms).  This is the average time to read from HDFS.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsReadLatency_num_ops"><title><varname>hbase.regionserver.fsReadLatency_num_ops</varname></title>
+          <para>Filesystem read operations.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsSyncLatency_avg_time"><title><varname>hbase.regionserver.fsSyncLatency_avg_time</varname></title>
+          <para>Filesystem sync latency (ms).  Latency to sync the write-ahead log records to the filesystem.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsSyncLatency_num_ops"><title><varname>hbase.regionserver.fsSyncLatency_num_ops</varname></title>
+          <para>Number of operations to sync the write-ahead log records to the filesystem.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsWriteLatency_avg_time"><title><varname>hbase.regionserver.fsWriteLatency_avg_time</varname></title>
+          <para>Filesystem write latency (ms).  Total latency for all writers, including StoreFiles and write-head log.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.fsWriteLatency_num_ops"><title><varname>hbase.regionserver.fsWriteLatency_num_ops</varname></title>
+          <para>Number of filesystem write operations, including StoreFiles and write-ahead log.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.memstoreSizeMB"><title><varname>hbase.regionserver.memstoreSizeMB</varname></title>
+          <para>Sum of all the memstore sizes in this RegionServer (MB)</para>
+		  </section>
+          <section xml:id="hbase.regionserver.regions"><title><varname>hbase.regionserver.regions</varname></title>
+          <para>Number of regions served by the RegionServer</para>
+		  </section>
+          <section xml:id="hbase.regionserver.requests"><title><varname>hbase.regionserver.requests</varname></title>
+          <para>Total number of read and write requests.  Requests correspond to RegionServer RPC calls, thus a single Get will result in 1 request, but a Scan with caching set to 1000 will result in 1 request for each 'next' call (i.e., not each row).  A bulk-load request will constitute 1 request per HFile.</para>
+		  </section>
+          <section xml:id="hbase.regionserver.storeFileIndexSizeMB"><title><varname>hbase.regionserver.storeFileIndexSizeMB</varname></title>
+          <para>Sum of all the StoreFile index sizes in this RegionServer (MB)</para>
+		  </section>
+          <section xml:id="hbase.regionserver.stores"><title><varname>hbase.regionserver.stores</varname></title>
+          <para>Number of Stores open on the RegionServer.  A Store corresponds to a ColumnFamily.  For example, if a table (which contains the column family) has 3 regions on a RegionServer, there will be 3 stores open for that column family. </para>
+		  </section>
+          <section xml:id="hbase.regionserver.storeFiles"><title><varname>hbase.regionserver.storeFiles</varname></title>
+          <para>Number of StoreFiles open on the RegionServer.  A store may have more than one StoreFile (HFile).</para>
+		  </section>
+   </section>
+  </section>
+
+  <section xml:id="ops.monitoring">
+    <title >HBase Monitoring</title>
+    <section xml:id="ops.monitoring.overview">
+    <title>Overview</title>
+      <para>The following metrics are arguably the most important to monitor for each RegionServer for
+      "macro monitoring", preferably with a system like <link xlink:href="http://opentsdb.net/">OpenTSDB</link>.
+      If your cluster is having performance issues it's likely that you'll see something unusual with 
+      this group.
+      </para>
+      <para>HBase: 
+      <itemizedlist>
+      <listitem>Requests</listitem>
+      <listitem>Compactions queue</listitem>
+      </itemizedlist>
+      </para> 
+      <para>OS: 
+      <itemizedlist>
+      <listitem>IO Wait</listitem>
+      <listitem>User CPU</listitem>
+      </itemizedlist>
+      </para> 
+      <para>Java: 
+      <itemizedlist>
+      <listitem>GC</listitem>
+      </itemizedlist>
+      </para> 
+      <para>
+      </para>
+      <para>
+      For more information on HBase metrics, see <xref linkend="hbase_metrics"/>.
+      </para>
+    </section>
+    
+    <section xml:id="ops.slow.query">
+    <title>Slow Query Log</title>
+<para>The HBase slow query log consists of parseable JSON structures describing the properties of those client operations (Gets, Puts, Deletes, etc.) that either took too long to run, or produced too much output. The thresholds for "too long to run" and "too much output" are configurable, as described below. The output is produced inline in the main region server logs so that it is easy to discover further details from context with other logged events. It is also prepended with identifying tags <constant>(responseTooSlow)</constant>, <constant>(responseTooLarge)</constant>, <constant>(operationTooSlow)</constant>, and <constant>(operationTooLarge)</constant> in order to enable easy filtering with grep, in case the user desires to see only slow queries.
+</para>
+
+<section><title>Configuration</title>
+<para>There are two configuration knobs that can be used to adjust the thresholds for when queries are logged.
+</para>
+
+<itemizedlist>
+<listitem>
+<varname>hbase.ipc.warn.response.time</varname> Maximum number of milliseconds that a query can be run without being logged. Defaults to 10000, or 10 seconds. Can be set to -1 to disable logging by time.
+</listitem>
+<listitem><varname>hbase.ipc.warn.response.size</varname> Maximum byte size of response that a query can return without being logged. Defaults to 100 megabytes. Can be set to -1 to disable logging by size.
+</listitem>
+</itemizedlist>
+</section>
+
+<section><title>Metrics</title>
+<para>The slow query log exposes to metrics to JMX.
+<itemizedlist><listitem><varname>hadoop.regionserver_rpc_slowResponse</varname> a global metric reflecting the durations of all responses that triggered logging.</listitem>
+<listitem><varname>hadoop.regionserver_rpc_methodName.aboveOneSec</varname> A metric reflecting the durations of all responses that lasted for more than one second.</listitem>
+</itemizedlist>
+</para>
+</section>
+
+<section><title>Output</title>
+<para>The output is tagged with operation e.g. <constant>(operationTooSlow)</constant> if the call was a client operation, such as a Put, Get, or Delete, which we expose detailed fingerprint information for. If not, it is tagged <constant>(responseTooSlow)</constant> and still produces parseable JSON output, but with less verbose information solely regarding its duration and size in the RPC itself. <constant>TooLarge</constant> is substituted for <constant>TooSlow</constant> if the response size triggered the logging, with <constant>TooLarge</constant> appearing even in the case that both size and duration triggered logging.
+</para>
+</section>
+<section><title>Example</title>
+<para>
+<programlisting>2011-09-08 10:01:25,824 WARN org.apache.hadoop.ipc.HBaseServer: (operationTooSlow): {"tables":{"riley2":{"puts":[{"totalColumns":11,"families":{"actions":[{"timestamp":1315501284459,"qualifier":"0","vlen":9667580},{"timestamp":1315501284459,"qualifier":"1","vlen":10122412},{"timestamp":1315501284459,"qualifier":"2","vlen":11104617},{"timestamp":1315501284459,"qualifier":"3","vlen":13430635}]},"row":"cfcd208495d565ef66e7dff9f98764da:0"}],"families":["actions"]}},"processingtimems":956,"client":"10.47.34.63:33623","starttimems":1315501284456,"queuetimems":0,"totalPuts":1,"class":"HRegionServer","responsesize":0,"method":"multiPut"}</programlisting>
+</para>
+
+<para>Note that everything inside the "tables" structure is output produced by MultiPut's fingerprint, while the rest of the information is RPC-specific, such as processing time and client IP/port. Other client operations follow the same pattern and the same general structure, with necessary differences due to the nature of the individual operations. In the case that the call is not a client operation, that detailed fingerprint information will be completely absent.
+</para>
+
+<para>This particular example, for example, would indicate that the likely cause of slowness is simply a very large (on the order of 100MB) multiput, as we can tell by the "vlen," or value length, fields of each put in the multiPut.
+</para>
+</section>
+</section>
+
+
+
+  </section>
+  
+  <section xml:id="cluster_replication">
+    <title>Cluster Replication</title>
+    <para>See <link xlink:href="http://hbase.apache.org/replication.html">Cluster Replication</link>.
+    </para>
+  </section>
+  <section xml:id="ops.backup">
+    <title >HBase Backup</title>
+    <para>There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. 
+    Each approach has pros and cons.   
+    </para>
+    <para>For additional information, see <link xlink:href="http://blog.sematext.com/2011/03/11/hbase-backup-options/">HBase Backup Options</link> over on the Sematext Blog.
+    </para>
+    <section xml:id="ops.backup.fullshutdown"><title>Full Shutdown Backup</title>
+      <para>Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity
+      and not serving front-end web-pages.  The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing
+      any in-flight changes to either StoreFiles or metadata.  The obvious con is that the cluster is down.  The steps include:
+      </para>
+      <section xml:id="ops.backup.fullshutdown.stop"><title>Stop HBase</title>
+        <para>
+        </para>
+      </section>
+      <section xml:id="ops.backup.fullshutdown.distcp"><title>Distcp</title>
+        <para>Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or 
+        to a different cluster.
+        </para>
+        <para>Note:  Distcp works in this situation because the cluster is down and there are no in-flight edits to files.  
+        Distcp-ing of files in the HBase directory is not generally recommended on a live cluster.
+        </para>
+      </section>
+      <section xml:id="ops.backup.fullshutdown.restore"><title>Restore (if needed)</title>
+        <para>The backup of the hbase directory from HDFS is copied onto the 'real' hbase directory via distcp.  The act of copying these files 
+        creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn't required for this kind of
+        restore, because it's a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
+        </para>
+      </section>
+    </section>
+    <section xml:id="ops.backup.live.replication"><title>Live Cluster Backup - Replication</title>
+      <para>This approach assumes that there is a second cluster.  
+      See the HBase page on <link xlink:href="http://hbase.apache.org/replication.html">replication</link> for more information.
+      </para>
+    </section>
+    <section xml:id="ops.backup.live.copytable"><title>Live Cluster Backup - CopyTable</title>
+      <para>The <xref linkend="copytable" /> utility could either be used to copy data from one table to another on the 
+      same cluster, or to copy data to another table on another cluster.
+      </para>
+      <para>Since the cluster is up, there is a risk that edits could be missed in the copy process.
+      </para>
+    </section>
+    <section xml:id="ops.backup.live.export"><title>Live Cluster Backup - Export</title>
+      <para>The <xref linkend="export" /> approach dumps the content of a table to HDFS on the same cluster.  To restore the data, the
+      <xref linkend="import" /> utility would be used.
+      </para>
+      <para>Since the cluster is up, there is a risk that edits could be missed in the export process.
+      </para>
+    </section>
+  </section>  <!--  backup -->
+  <section xml:id="ops.capacity"><title>Capacity Planning</title>
+    <section xml:id="ops.capacity.storage"><title>Storage</title>
+      <para>A common question for HBase administrators is estimating how much storage will be required for an HBase cluster.
+      There are several apsects to consider, the most important of which is what data load into the cluster.  Start
+      with a solid understanding of how HBase handles data internally (KeyValue).
+      </para>
+      <section xml:id="ops.capacity.storage.kv"><title>KeyValue</title>
+        <para>HBase storage will be dominated by KeyValues.  See <xref linkend="keyvalue" /> and <xref linkend="keysize" /> for 
+        how HBase stores data internally.  
+        </para>
+        <para>It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the 
+        rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other
+        factor.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.sf"><title>StoreFiles and Blocks</title>
+        <para>KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis.
+        Blocks are aggregated into StoreFile's.  See <xref linkend="regions.arch" />.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.hdfs"><title>HDFS Block Replication</title>
+        <para>Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.
+        </para>
+      </section>
+    </section>
+    <section xml:id="ops.capacity.regions"><title>Regions</title>
+      <para>Another common question for HBase administrators is determining the right number of regions per
+      RegionServer.  This affects both storage and hardware planning. See <xref linkend="perf.number.of.regions" />.
+      </para>
+    </section>
+  </section>
+
+</chapter>

Added: hbase/trunk/src/docbkx/performance.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/performance.xml?rev=1345788&view=auto
==============================================================================
--- hbase/trunk/src/docbkx/performance.xml (added)
+++ hbase/trunk/src/docbkx/performance.xml Sun Jun  3 21:59:50 2012
@@ -0,0 +1,547 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter version="5.0" xml:id="performance"
+         xmlns="http://docbook.org/ns/docbook"
+         xmlns:xlink="http://www.w3.org/1999/xlink"
+         xmlns:xi="http://www.w3.org/2001/XInclude"
+         xmlns:svg="http://www.w3.org/2000/svg"
+         xmlns:m="http://www.w3.org/1998/Math/MathML"
+         xmlns:html="http://www.w3.org/1999/xhtml"
+         xmlns:db="http://docbook.org/ns/docbook">
+<!--
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+  <title>Performance Tuning</title>
+
+  <section xml:id="perf.os">
+    <title>Operating System</title>
+        <section xml:id="perf.os.ram">
+          <title>Memory</title>
+          <para>RAM, RAM, RAM.  Don't starve HBase.</para>
+        </section>
+        <section xml:id="perf.os.64">
+          <title>64-bit</title>
+          <para>Use a 64-bit platform (and 64-bit JVM).</para>
+        </section>
+        <section xml:id="perf.os.swap">
+          <title>Swapping</title>
+          <para>Watch out for swapping.  Set swappiness to 0.</para>
+        </section>
+  </section>
+  <section xml:id="perf.network">
+    <title>Network</title>
+    <para>
+    Perhaps the most important factor in avoiding network issues degrading Hadoop and HBbase performance is the switching hardware
+    that is used, decisions made early in the scope of the project can cause major problems when you double or triple the size of your cluster (or more). 
+    </para>
+    <para>
+    Important items to consider:
+        <itemizedlist>
+          <listitem>Switching capacity of the device</listitem>
+          <listitem>Number of systems connected</listitem>
+          <listitem>Uplink capacity</listitem>
+        </itemizedlist>
+    </para>
+    <section xml:id="perf.network.1switch">
+      <title>Single Switch</title>
+      <para>The single most important factor in this configuration is that the switching capacity of the hardware is capable of 
+      handling the traffic which can be generated by all systems connected to the switch. Some lower priced commodity hardware
+      can have a slower switching capacity than could be utilized by a full switch. 
+      </para>
+    </section>
+    <section xml:id="perf.network.2switch">
+      <title>Multiple Switches</title>
+      <para>Multiple switches are a potential pitfall in the architecture.   The most common configuration of lower priced hardware is a
+      simple 1Gbps uplink from one switch to another. This often overlooked pinch point can easily become a bottleneck for cluster communication. 
+      Especially with MapReduce jobs that are both reading and writing a lot of data the communication across this uplink could be saturated.
+      </para>
+      <para>Mitigation of this issue is fairly simple and can be accomplished in multiple ways:
+      <itemizedlist>
+        <listitem>Use appropriate hardware for the scale of the cluster which you're attempting to build.</listitem>
+        <listitem>Use larger single switch configurations i.e. single 48 port as opposed to 2x 24 port</listitem>
+        <listitem>Configure port trunking for uplinks to utilize multiple interfaces to increase cross switch bandwidth.</listitem>
+      </itemizedlist>
+      </para>
+    </section>
+    <section xml:id="perf.network.multirack">
+      <title>Multiple Racks</title>
+      <para>Multiple rack configurations carry the same potential issues as multiple switches, and can suffer performance degradation from two main areas:
+         <itemizedlist>
+           <listitem>Poor switch capacity performance</listitem>
+           <listitem>Insufficient uplink to another rack</listitem>
+         </itemizedlist>
+      If the the switches in your rack have appropriate switching capacity to handle all the hosts at full speed, the next most likely issue will be caused by homing 
+      more of your cluster across racks.  The easiest way to avoid issues when spanning multiple racks is to use port trunking to create a bonded uplink to other racks.
+      The downside of this method however, is in the overhead of ports that could potentially be used. An example of this is, creating an 8Gbps port channel from rack
+      A to rack B, using 8 of your 24 ports to communicate between racks gives you a poor ROI, using too few however can mean you're not getting the most out of your cluster. 
+      </para>
+      <para>Using 10Gbe links between racks will greatly increase performance, and assuming your switches support a 10Gbe uplink or allow for an expansion card will allow you to
+      save your ports for machines as opposed to uplinks.
+      </para>
+    </section>
+    <section xml:id="perf.network.ints">
+      <title>Network Interfaces</title>
+      <para>Are all the network interfaces functioning correctly?  Are you sure?  See the Troubleshooting Case Study in <xref linkend="casestudies.slownode"/>.
+      </para>
+    </section>
+  </section>  <!-- network -->
+
+  <section xml:id="jvm">
+    <title>Java</title>
+
+    <section xml:id="gc">
+      <title>The Garbage Collector and HBase</title>
+
+      <section xml:id="gcpause">
+        <title>Long GC pauses</title>
+
+        <para xml:id="mslab">In his presentation, <link
+        xlink:href="http://www.slideshare.net/cloudera/hbase-hug-presentation">Avoiding
+        Full GCs with MemStore-Local Allocation Buffers</link>, Todd Lipcon
+        describes two cases of stop-the-world garbage collections common in
+        HBase, especially during loading; CMS failure modes and old generation
+        heap fragmentation brought. To address the first, start the CMS
+        earlier than default by adding
+        <code>-XX:CMSInitiatingOccupancyFraction</code> and setting it down
+        from defaults. Start at 60 or 70 percent (The lower you bring down the
+        threshold, the more GCing is done, the more CPU used). To address the
+        second fragmentation issue, Todd added an experimental facility,
+        <indexterm><primary>MSLAB</primary></indexterm>, that
+        must be explicitly enabled in HBase 0.90.x (Its defaulted to be on in
+        0.92.x HBase). See <code>hbase.hregion.memstore.mslab.enabled</code>
+        to true in your <classname>Configuration</classname>. See the cited
+        slides for background and detail<footnote><para>The latest jvms do better
+        regards fragmentation so make sure you are running a recent release.
+        Read down in the message,
+        <link xlink:href="http://osdir.com/ml/hotspot-gc-use/2011-11/msg00002.html">Identifying concurrent mode failures caused by fragmentation</link>.</para></footnote>.</para>
+        <para>For more information about GC logs, see <xref linkend="trouble.log.gc" />.
+        </para>
+      </section>
+    </section>
+  </section>
+
+  <section xml:id="perf.configurations">
+    <title>HBase Configurations</title>
+
+    <para>See <xref linkend="recommended_configurations" />.</para>
+
+    <section xml:id="perf.number.of.regions">
+      <title>Number of Regions</title>
+
+      <para>The number of regions for an HBase table is driven by the <xref
+              linkend="bigger.regions" />. Also, see the architecture
+          section on <xref linkend="arch.regions.size" /></para>
+    </section>
+
+    <section xml:id="perf.compactions.and.splits">
+      <title>Managing Compactions</title>
+
+      <para>For larger systems, managing <link
+      linkend="disable.splitting">compactions and splits</link> may be
+      something you want to consider.</para>
+    </section>
+
+    <section xml:id="perf.handlers">
+        <title><varname>hbase.regionserver.handler.count</varname></title>
+        <para>See <xref linkend="hbase.regionserver.handler.count"/>. 
+	    </para>
+    </section>
+    <section xml:id="perf.hfile.block.cache.size">
+        <title><varname>hfile.block.cache.size</varname></title>
+        <para>See <xref linkend="hfile.block.cache.size"/>. 
+        A memory setting for the RegionServer process.
+        </para>
+    </section>    
+    <section xml:id="perf.rs.memstore.upperlimit">
+        <title><varname>hbase.regionserver.global.memstore.upperLimit</varname></title>
+        <para>See <xref linkend="hbase.regionserver.global.memstore.upperLimit"/>.  
+        This memory setting is often adjusted for the RegionServer process depending on needs.
+        </para>
+    </section>    
+    <section xml:id="perf.rs.memstore.lowerlimit">
+        <title><varname>hbase.regionserver.global.memstore.lowerLimit</varname></title>
+        <para>See <xref linkend="hbase.regionserver.global.memstore.lowerLimit"/>.  
+        This memory setting is often adjusted for the RegionServer process depending on needs.
+        </para>
+    </section>
+    <section xml:id="perf.hstore.blockingstorefiles">
+        <title><varname>hbase.hstore.blockingStoreFiles</varname></title>
+        <para>See <xref linkend="hbase.hstore.blockingStoreFiles"/>.  
+        If there is blocking in the RegionServer logs, increasing this can help.
+        </para>
+    </section>
+    <section xml:id="perf.hregion.memstore.block.multiplier">
+        <title><varname>hbase.hregion.memstore.block.multiplier</varname></title>
+        <para>See <xref linkend="hbase.hregion.memstore.block.multiplier"/>.  
+        If there is enough RAM, increasing this can help.  
+        </para>
+    </section>
+
+  </section>
+  <section xml:id="perf.zookeeper">
+    <title>ZooKeeper</title>
+    <para>See <xref linkend="zookeeper"/> for information on configuring ZooKeeper, and see the part
+    about having a dedicated disk.
+    </para>
+  </section>
+  <section xml:id="perf.schema">
+      <title>Schema Design</title>
+  
+    <section xml:id="perf.number.of.cfs">
+      <title>Number of Column Families</title>
+      <para>See <xref linkend="number.of.cfs" />.</para>
+    </section>
+    <section xml:id="perf.schema.keys">
+      <title>Key and Attribute Lengths</title>
+      <para>See <xref linkend="keysize" />.  See also <xref linkend="perf.compression.however" /> for 
+      compression caveats.</para>
+    </section>
+    <section xml:id="schema.regionsize"><title>Table RegionSize</title>
+    <para>The regionsize can be set on a per-table basis via <code>setFileSize</code> on
+    <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link> in the 
+    event where certain tables require different regionsizes than the configured default regionsize.
+    </para>
+    <para>See <xref linkend="perf.number.of.regions"/> for more information.
+    </para>
+    </section>
+    <section xml:id="schema.bloom">
+    <title>Bloom Filters</title>
+    <para>Bloom Filters can be enabled per-ColumnFamily.
+        Use <code>HColumnDescriptor.setBloomFilterType(NONE | ROW |
+        ROWCOL)</code> to enable blooms per Column Family. Default =
+        <varname>NONE</varname> for no bloom filters. If
+        <varname>ROW</varname>, the hash of the row will be added to the bloom
+        on each insert. If <varname>ROWCOL</varname>, the hash of the row +
+        column family + column family qualifier will be added to the bloom on
+        each key insert.</para>
+    <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> and 
+    <xref linkend="blooms"/> for more information.
+    </para>
+    </section>
+    <section xml:id="schema.cf.blocksize"><title>ColumnFamily BlockSize</title>
+    <para>The blocksize can be configured for each ColumnFamily in a table, and this defaults to 64k.  Larger cell values require larger blocksizes. 
+    There is an inverse relationship between blocksize and the resulting StoreFile indexes (i.e., if the blocksize is doubled then the resulting
+    indexes should be roughly halved).
+    </para>
+    <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> 
+    and <xref linkend="store"/>for more information.
+    </para>
+    </section>
+    <section xml:id="cf.in.memory">
+    <title>In-Memory ColumnFamilies</title>
+    <para>ColumnFamilies can optionally be defined as in-memory.  Data is still persisted to disk, just like any other ColumnFamily.  
+    In-memory blocks have the highest priority in the <xref linkend="block.cache" />, but it is not a guarantee that the entire table
+    will be in memory.
+    </para>
+    <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information.
+    </para>
+    </section>
+    <section xml:id="perf.compression">
+      <title>Compression</title>
+      <para>Production systems should use compression with their ColumnFamily definitions.  See <xref linkend="compression" /> for more information.
+      </para>
+      <section xml:id="perf.compression.however"><title>However...</title>
+         <para>Compression deflates data <emphasis>on disk</emphasis>.  When it's in-memory (e.g., in the 
+         MemStore) or on the wire (e.g., transferring between RegionServer and Client) it's inflated.
+         So while using ColumnFamily compression is a best practice, but it's not going to completely eliminate
+         the impact of over-sized Keys, over-sized ColumnFamily names, or over-sized Column names. 
+         </para>
+         <para>See <xref linkend="keysize" /> on for schema design tips, and <xref linkend="keyvalue"/> for more information on HBase stores data internally.
+         </para> 
+      </section>
+    </section>
+  </section>  <!--  perf schema -->
+  
+  <section xml:id="perf.writing">
+    <title>Writing to HBase</title>
+
+    <section xml:id="perf.batch.loading">
+      <title>Batch Loading</title>
+      <para>Use the bulk load tool if you can.  See
+        <xref linkend="arch.bulk.load"/>.
+        Otherwise, pay attention to the below.
+      </para>
+    </section>  <!-- batch loading -->
+
+    <section xml:id="precreate.regions">
+    <title>
+    Table Creation: Pre-Creating Regions
+    </title>
+<para>
+Tables in HBase are initially created with one region by default.  For bulk imports, this means that all clients will write to the same region until it is large enough to split and become distributed across the cluster.  A useful pattern to speed up the bulk import process is to pre-create empty regions.  Be somewhat conservative in this, because too-many regions can actually degrade performance.  An example of pre-creation using hex-keys is as follows (note:  this example may need to be tweaked to the individual applications keys):
+</para>
+<para>
+<programlisting>public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
+throws IOException {
+  try {
+    admin.createTable( table, splits );
+    return true;
+  } catch (TableExistsException e) {
+    logger.info("table " + table.getNameAsString() + " already exists");
+    // the table already exists...
+    return false;  
+  }
+}
+
+public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
+  byte[][] splits = new byte[numRegions-1][];
+  BigInteger lowestKey = new BigInteger(startKey, 16);
+  BigInteger highestKey = new BigInteger(endKey, 16);
+  BigInteger range = highestKey.subtract(lowestKey);
+  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
+  lowestKey = lowestKey.add(regionIncrement);
+  for(int i=0; i &lt; numRegions-1;i++) {
+    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
+    byte[] b = String.format("%016x", key).getBytes();
+    splits[i] = b;
+  }
+  return splits;
+}</programlisting>
+  </para>
+  </section>
+    <section xml:id="def.log.flush">
+    <title>
+    Table Creation: Deferred Log Flush
+    </title>
+<para>
+The default behavior for Puts using the Write Ahead Log (WAL) is that <classname>HLog</classname> edits will be written immediately.  If deferred log flush is used, 
+WAL edits are kept in memory until the flush period.  The benefit is aggregated and asynchronous <classname>HLog</classname>- writes, but the potential downside is that if
+ the RegionServer goes down the yet-to-be-flushed edits are lost.  This is safer, however, than not using WAL at all with Puts.
+</para>
+<para>
+Deferred log flush can be configured on tables via <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HTableDescriptor.html">HTableDescriptor</link>.  The default value of <varname>hbase.regionserver.optionallogflushinterval</varname> is 1000ms.
+</para>
+    </section>  
+
+    <section xml:id="perf.hbase.client.autoflush">
+      <title>HBase Client:  AutoFlush</title>
+
+      <para>When performing a lot of Puts, make sure that setAutoFlush is set
+      to false on your <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>
+      instance. Otherwise, the Puts will be sent one at a time to the
+      RegionServer. Puts added via <code> htable.add(Put)</code> and <code> htable.add( &lt;List&gt; Put)</code>
+      wind up in the same write buffer. If <code>autoFlush = false</code>,
+      these messages are not sent until the write-buffer is filled. To
+      explicitly flush the messages, call <methodname>flushCommits</methodname>.
+      Calling <methodname>close</methodname> on the <classname>HTable</classname>
+      instance will invoke <methodname>flushCommits</methodname>.</para>
+    </section>
+    <section xml:id="perf.hbase.client.putwal">
+      <title>HBase Client:  Turn off WAL on Puts</title>
+      <para>A frequently discussed option for increasing throughput on <classname>Put</classname>s is to call <code>writeToWAL(false)</code>.  Turning this off means
+          that the RegionServer will <emphasis>not</emphasis> write the <classname>Put</classname> to the Write Ahead Log,
+          only into the memstore, HOWEVER the consequence is that if there
+          is a RegionServer failure <emphasis>there will be data loss</emphasis>.
+          If <code>writeToWAL(false)</code> is used, do so with extreme caution.  You may find in actuality that
+          it makes little difference if your load is well distributed across the cluster.
+      </para>
+      <para>In general, it is best to use WAL for Puts, and where loading throughput
+          is a concern to use <link linkend="perf.batch.loading">bulk loading</link> techniques instead.  
+      </para>
+    </section>
+    <section xml:id="perf.hbase.client.regiongroup">
+      <title>HBase Client: Group Puts by RegionServer</title>
+      <para>In addition to using the writeBuffer, grouping <classname>Put</classname>s by RegionServer can reduce the number of client RPC calls per writeBuffer flush. 
+      There is a utility <classname>HTableUtil</classname> currently on TRUNK that does this, but you can either copy that or implement your own verison for
+      those still on 0.90.x or earlier.
+      </para>
+    </section>    
+    <section xml:id="perf.hbase.write.mr.reducer">
+      <title>MapReduce:  Skip The Reducer</title>
+      <para>When writing a lot of data to an HBase table from a MR job (e.g., with <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>), and specifically where Puts are being emitted
+      from the Mapper, skip the Reducer step.  When a Reducer step is used, all of the output (Puts) from the Mapper will get spooled to disk, then sorted/shuffled to other 
+      Reducers that will most likely be off-node.  It's far more efficient to just write directly to HBase.   
+      </para>
+      <para>For summary jobs where HBase is used as a source and a sink, then writes will be coming from the Reducer step (e.g., summarize values then write out result). 
+      This is a different processing problem than from the the above case. 
+      </para>
+    </section>
+
+  <section xml:id="perf.one.region">
+    <title>Anti-Pattern:  One Hot Region</title>
+    <para>If all your data is being written to one region at a time, then re-read the
+    section on processing <link linkend="timeseries">timeseries</link> data.</para>
+    <para>Also, if you are pre-splitting regions and all your data is <emphasis>still</emphasis> winding up in a single region even though
+    your keys aren't monotonically increasing, confirm that your keyspace actually works with the split strategy.  There are a 
+    variety of reasons that regions may appear "well split" but won't work with your data.   As
+    the HBase client communicates directly with the RegionServers, this can be obtained via 
+    <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#getRegionLocation%28byte[]%29">HTable.getRegionLocation</link>.
+    </para>
+    <para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>   
+  </section>
+
+  </section>  <!--  writing -->
+  
+  <section xml:id="perf.reading">
+    <title>Reading from HBase</title>
+
+    <section xml:id="perf.hbase.client.caching">
+      <title>Scan Caching</title>
+
+      <para>If HBase is used as an input source for a MapReduce job, for
+      example, make sure that the input <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
+      instance to the MapReduce job has <methodname>setCaching</methodname> set to something greater
+      than the default (which is 1). Using the default value means that the
+      map-task will make call back to the region-server for every record
+      processed. Setting this value to 500, for example, will transfer 500
+      rows at a time to the client to be processed. There is a cost/benefit to
+      have the cache value be large because it costs more in memory for both
+      client and RegionServer, so bigger isn't always better.</para>
+      <section xml:id="perf.hbase.client.caching.mr">
+        <title>Scan Caching in MapReduce Jobs</title>
+        <para>Scan settings in MapReduce jobs deserve special attention.  Timeouts can result (e.g., UnknownScannerException)
+        in Map tasks if it takes longer to process a batch of records before the client goes back to the RegionServer for the
+        next set of data.  This problem can occur because there is non-trivial processing occuring per row.  If you process
+        rows quickly, set caching higher.  If you process rows more slowly (e.g., lots of transformations per row, writes), 
+        then set caching lower.
+        </para>
+        <para>Timeouts can also happen in a non-MapReduce use case (i.e., single threaded HBase client doing a Scan), but the
+        processing that is often performed in MapReduce jobs tends to exacerbate this issue.
+        </para>
+      </section>
+    </section>
+    <section xml:id="perf.hbase.client.selection">
+      <title>Scan Attribute Selection</title>
+
+      <para>Whenever a Scan is used to process large numbers of rows (and especially when used
+      as a MapReduce source), be aware of which attributes are selected.   If <code>scan.addFamily</code> is called
+      then <emphasis>all</emphasis> of the attributes in the specified ColumnFamily will be returned to the client.
+      If only a small number of the available attributes are to be processed, then only those attributes should be specified
+      in the input scan because attribute over-selection is a non-trivial performance penalty over large datasets.
+      </para>
+    </section>
+    <section xml:id="perf.hbase.mr.input">
+        <title>MapReduce - Input Splits</title>
+        <para>For MapReduce jobs that use HBase tables as a source, if there a pattern where the "slow" map tasks seem to 
+        have the same Input Split (i.e., the RegionServer serving the data), see the 
+        Troubleshooting Case Study in <xref linkend="casestudies.slownode"/>.
+        </para>
+    </section>
+
+    <section xml:id="perf.hbase.client.scannerclose">
+      <title>Close ResultScanners</title>
+
+      <para>This isn't so much about improving performance but rather
+      <emphasis>avoiding</emphasis> performance problems. If you forget to
+      close <link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html">ResultScanners</link>
+      you can cause problems on the RegionServers. Always have ResultScanner
+      processing enclosed in try/catch blocks... <programlisting>
+Scan scan = new Scan();
+// set attrs...
+ResultScanner rs = htable.getScanner(scan);
+try {
+  for (Result r = rs.next(); r != null; r = rs.next()) {
+  // process result...
+} finally {
+  rs.close();  // always close the ResultScanner!
+}
+htable.close();</programlisting></para>
+    </section>
+
+    <section xml:id="perf.hbase.client.blockcache">
+      <title>Block Cache</title>
+
+      <para><link
+      xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
+      instances can be set to use the block cache in the RegionServer via the
+      <methodname>setCacheBlocks</methodname> method. For input Scans to MapReduce jobs, this should be
+      <varname>false</varname>. For frequently accessed rows, it is advisable to use the block
+      cache.</para>
+    </section>
+    <section xml:id="perf.hbase.client.rowkeyonly">
+      <title>Optimal Loading of Row Keys</title>
+      <para>When performing a table <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">scan</link>
+            where only the row keys are needed (no families, qualifiers, values or timestamps), add a FilterList with a
+            <varname>MUST_PASS_ALL</varname> operator to the scanner using <methodname>setFilter</methodname>. The filter list
+            should include both a <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html">FirstKeyOnlyFilter</link>
+            and a <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/KeyOnlyFilter.html">KeyOnlyFilter</link>.
+            Using this filter combination will result in a worst case scenario of a RegionServer reading a single value from disk
+            and minimal network traffic to the client for a single row.
+      </para>
+    </section>
+   <section xml:id="perf.hbase.read.dist">
+      <title>Concurrency:  Monitor Data Spread</title>
+      <para>When performing a high number of concurrent reads, monitor the data spread of the target tables.  If the target table(s) have 
+      too few regions then the reads could likely be served from too few nodes.  </para>
+      <para>See <xref linkend="precreate.regions"/>, as well as <xref linkend="perf.configurations"/> </para>   
+   </section>
+    
+  </section>  <!--  reading -->
+  
+  <section xml:id="perf.deleting">
+    <title>Deleting from HBase</title>
+     <section xml:id="perf.deleting.queue">
+       <title>Using HBase Tables as Queues</title>
+       <para>HBase tables are sometimes used as queues.  In this case, special care must be taken to regularly perform major compactions on tables used in
+       this manner.  As is documented in <xref linkend="datamodel" />, marking rows as deleted creates additional StoreFiles which then need to be processed
+       on reads.  Tombstones only get cleaned up with major compactions.
+       </para>
+       <para>See also <xref linkend="compaction" /> and <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html#majorCompact%28java.lang.String%29">HBaseAdmin.majorCompact</link>.
+       </para>
+     </section>
+     <section xml:id="perf.deleting.rpc">
+       <title>Delete RPC Behavior</title>
+       <para>Be aware that <code>htable.delete(Delete)</code> doesn't use the writeBuffer.  It will execute an RegionServer RPC with each invocation.
+       For a large number of deletes, consider <code>htable.delete(List)</code>.
+       </para>
+       <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29"></link>
+       </para>
+     </section>
+  </section>  <!--  deleting -->
+
+  <section xml:id="perf.hdfs"><title>HDFS</title>
+   <para>Because HBase runs on <xref linkend="arch.hdfs" /> it is important to understand how it works and how it affects
+   HBase.
+   </para>
+    <section xml:id="perf.hdfs.curr"><title>Current Issues With Low-Latency Reads</title>
+      <para>The original use-case for HDFS was batch processing.  As such, there low-latency reads were historically not a priority.
+      With the increased adoption of HBase this is changing, and several improvements are already in development.
+      See the 
+      <link xlink:href="https://issues.apache.org/jira/browse/HDFS-1599">Umbrella Jira Ticket for HDFS Improvements for HBase</link>.
+      </para>
+    </section>
+    <section xml:id="perf.hdfs.comp"><title>Performance Comparisons of HBase vs. HDFS</title>
+     <para>A fairly common question on the dist-list is why HBase isn't as performant as HDFS files in a batch context (e.g., as 
+     a MapReduce source or sink).  The short answer is that HBase is doing a lot more than HDFS (e.g., reading the KeyValues, 
+     returning the most current row or specified timestamps, etc.), and as such HBase is 4-5 times slower than HDFS in this 
+     processing context.  Not that there isn't room for improvement (and this gap will, over time, be reduced), but HDFS
+      will always be faster in this use-case.
+     </para>
+    </section>
+  </section>
+  
+  <section xml:id="perf.ec2"><title>Amazon EC2</title>
+   <para>Performance questions are common on Amazon EC2 environments because it is a shared environment.  You will
+   not see the same throughput as a dedicated server.  In terms of running tests on EC2, run them several times for the same
+   reason (i.e., it's a shared environment and you don't know what else is happening on the server).
+   </para>
+   <para>If you are running on EC2 and post performance questions on the dist-list, please state this fact up-front that
+    because EC2 issues are practically a separate class of performance issues.
+   </para>
+  </section>
+  
+  <section xml:id="perf.casestudy"><title>Case Studies</title>
+      <para>For Performance and Troubleshooting Case Studies, see <xref linkend="casestudies"/>.
+      </para>
+  </section>
+</chapter>

Added: hbase/trunk/src/docbkx/preface.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/preface.xml?rev=1345788&view=auto
==============================================================================
--- hbase/trunk/src/docbkx/preface.xml (added)
+++ hbase/trunk/src/docbkx/preface.xml Sun Jun  3 21:59:50 2012
@@ -0,0 +1,65 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<preface version="5.0" xml:id="preface" xmlns="http://docbook.org/ns/docbook"
+         xmlns:xlink="http://www.w3.org/1999/xlink"
+         xmlns:xi="http://www.w3.org/2001/XInclude"
+         xmlns:svg="http://www.w3.org/2000/svg"
+         xmlns:m="http://www.w3.org/1998/Math/MathML"
+         xmlns:html="http://www.w3.org/1999/xhtml"
+         xmlns:db="http://docbook.org/ns/docbook">
+<!--
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+  <title>Preface</title>
+
+  <para>This is the official reference guide for the <link
+  xlink:href="http://hbase.apache.org/">HBase</link> version it ships with.
+  This document describes HBase version <emphasis><?eval ${project.version}?></emphasis>.
+  Herein you will find either the definitive documentation on an HBase topic
+  as of its standing when the referenced HBase version shipped, or it
+  will point to the location in <link
+  xlink:href="http://hbase.apache.org/apidocs/index.html">javadoc</link>,
+  <link xlink:href="https://issues.apache.org/jira/browse/HBASE">JIRA</link>
+  or <link xlink:href="http://wiki.apache.org/hadoop/Hbase">wiki</link> where
+  the pertinent information can be found.</para>
+
+  <para>This reference guide is a work in progress.  Feel free to add content by adding
+  a patch to an issue up in the HBase <link
+  xlink:href="https://issues.apache.org/jira/browse/HBASE">JIRA</link>.</para>
+
+  <note xml:id="headsup">
+      <title>Heads-up</title>
+      <para>
+          If this is your first foray into the wonderful world of
+          Distributed Computing, then you are in for
+          some interesting times.  First off, distributed systems are
+          hard; making a distributed system hum requires a disparate
+          skillset that spans systems (hardware and software) and
+          networking.  Your cluster' operation can hiccup because of any
+          of a myriad set of reasons from bugs in HBase itself through misconfigurations
+          -- misconfiguration of HBase but also operating system misconfigurations --
+          through to hardware problems whether it be a bug in your network card
+          drivers or an underprovisioned RAM bus (to mention two recent
+          examples of hardware issues that manifested as "HBase is slow").
+          You will also need to do a recalibration if up to this your
+          computing has been bound to a single box.  Here is one good
+          starting point:
+          <link xlink:href="http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing">Fallacies of Distributed Computing</link>.
+      </para>
+  </note>
+</preface>