You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by mi...@apache.org on 2014/12/22 06:45:39 UTC

[1/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Repository: hbase
Updated Branches:
  refs/heads/master d9f25e30a -> a1fe1e096


http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/hbase_history.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/hbase_history.xml b/src/main/docbkx/hbase_history.xml
new file mode 100644
index 0000000..f7b9064
--- /dev/null
+++ b/src/main/docbkx/hbase_history.xml
@@ -0,0 +1,41 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="hbase.history"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+    <title>HBase History</title>
+    <itemizedlist>
+        <listitem><para>2006:  <link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper published by Google.
+        </para></listitem>
+        <listitem><para>2006 (end of year):  HBase development starts.
+        </para></listitem>
+        <listitem><para>2008:  HBase becomes Hadoop sub-project.
+        </para></listitem>
+        <listitem><para>2010:  HBase becomes Apache top-level project.
+        </para></listitem>
+    </itemizedlist>
+</appendix>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/hbck_in_depth.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/hbck_in_depth.xml b/src/main/docbkx/hbck_in_depth.xml
new file mode 100644
index 0000000..e2ee34f
--- /dev/null
+++ b/src/main/docbkx/hbck_in_depth.xml
@@ -0,0 +1,237 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="hbck.in.depth"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+
+        <title>hbck In Depth</title>
+        <para>HBaseFsck (hbck) is a tool for checking for region consistency and table integrity problems
+            and repairing a corrupted HBase. It works in two basic modes -- a read-only inconsistency
+            identifying mode and a multi-phase read-write repair mode.
+        </para>
+        <section>
+            <title>Running hbck to identify inconsistencies</title>
+            <para>To check to see if your HBase cluster has corruptions, run hbck against your HBase cluster:</para>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck
+</programlisting>
+            <para>
+                At the end of the commands output it prints OK or tells you the number of INCONSISTENCIES
+                present. You may also want to run run hbck a few times because some inconsistencies can be
+                transient (e.g. cluster is starting up or a region is splitting). Operationally you may want to run
+                hbck regularly and setup alert (e.g. via nagios) if it repeatedly reports inconsistencies .
+                A run of hbck will report a list of inconsistencies along with a brief description of the regions and
+                tables affected. The using the <code>-details</code> option will report more details including a representative
+                listing of all the splits present in all the tables.
+            </para>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck -details
+</programlisting>
+            <para>If you just want to know if some tables are corrupted, you can limit hbck to identify inconsistencies
+                in only specific tables. For example the following command would only attempt to check table
+                TableFoo and TableBar. The benefit is that hbck will run in less time.</para>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck TableFoo TableBar
+</programlisting>
+        </section>
+        <section><title>Inconsistencies</title>
+            <para>
+                If after several runs, inconsistencies continue to be reported, you may have encountered a
+                corruption. These should be rare, but in the event they occur newer versions of HBase include
+                the hbck tool enabled with automatic repair options.
+            </para>
+            <para>
+                There are two invariants that when violated create inconsistencies in HBase:
+            </para>
+            <itemizedlist>
+                <listitem><para>HBase’s region consistency invariant is satisfied if every region is assigned and
+                    deployed on exactly one region server, and all places where this state kept is in
+                    accordance.</para>
+                </listitem>
+                <listitem><para>HBase’s table integrity invariant is satisfied if for each table, every possible row key
+                    resolves to exactly one region.</para>
+                </listitem>
+            </itemizedlist>
+            <para>
+                Repairs generally work in three phases -- a read-only information gathering phase that identifies
+                inconsistencies, a table integrity repair phase that restores the table integrity invariant, and then
+                finally a region consistency repair phase that restores the region consistency invariant.
+                Starting from version 0.90.0, hbck could detect region consistency problems report on a subset
+                of possible table integrity problems. It also included the ability to automatically fix the most
+                common inconsistency, region assignment and deployment consistency problems. This repair
+                could be done by using the <code>-fix</code> command line option. These problems close regions if they are
+                open on the wrong server or on multiple region servers and also assigns regions to region
+                servers if they are not open.
+            </para>
+            <para>
+                Starting from HBase versions 0.90.7, 0.92.2 and 0.94.0, several new command line options are
+                introduced to aid repairing a corrupted HBase. This hbck sometimes goes by the nickname
+                “uberhbck”. Each particular version of uber hbck is compatible with the HBase’s of the same
+                major version (0.90.7 uberhbck can repair a 0.90.4). However, versions &lt;=0.90.6 and versions
+                &lt;=0.92.1 may require restarting the master or failing over to a backup master.
+            </para>
+        </section>
+        <section><title>Localized repairs</title>
+            <para>
+                When repairing a corrupted HBase, it is best to repair the lowest risk inconsistencies first.
+                These are generally region consistency repairs -- localized single region repairs, that only modify
+                in-memory data, ephemeral zookeeper data, or patch holes in the META table.
+                Region consistency requires that the HBase instance has the state of the region’s data in HDFS
+                (.regioninfo files), the region’s row in the hbase:meta table., and region’s deployment/assignments on
+                region servers and the master in accordance. Options for repairing region consistency include:
+                <itemizedlist>
+                    <listitem><para><code>-fixAssignments</code> (equivalent to the 0.90 <code>-fix</code> option) repairs unassigned, incorrectly
+                        assigned or multiply assigned regions.</para>
+                    </listitem>
+                    <listitem><para><code>-fixMeta</code> which removes meta rows when corresponding regions are not present in
+                        HDFS and adds new meta rows if they regions are present in HDFS while not in META.</para>
+                    </listitem>
+                </itemizedlist>
+                To fix deployment and assignment problems you can run this command:
+            </para>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck -fixAssignments
+</programlisting>
+            <para>To fix deployment and assignment problems as well as repairing incorrect meta rows you can
+                run this command:</para>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck -fixAssignments -fixMeta
+</programlisting>
+            <para>There are a few classes of table integrity problems that are low risk repairs. The first two are
+                degenerate (startkey == endkey) regions and backwards regions (startkey > endkey). These are
+                automatically handled by sidelining the data to a temporary directory (/hbck/xxxx).
+                The third low-risk class is hdfs region holes. This can be repaired by using the:</para>
+            <itemizedlist>
+                <listitem><para><code>-fixHdfsHoles</code> option for fabricating new empty regions on the file system.
+                    If holes are detected you can use -fixHdfsHoles and should include -fixMeta and -fixAssignments to make the new region consistent.</para>
+                </listitem>
+            </itemizedlist>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck -fixAssignments -fixMeta -fixHdfsHoles
+</programlisting>
+            <para>Since this is a common operation, we’ve added a the <code>-repairHoles</code> flag that is equivalent to the
+                previous command:</para>
+            <programlisting language="bourne">
+$ ./bin/hbase hbck -repairHoles
+</programlisting>
+            <para>If inconsistencies still remain after these steps, you most likely have table integrity problems
+                related to orphaned or overlapping regions.</para>
+        </section>
+        <section><title>Region Overlap Repairs</title>
+            <para>Table integrity problems can require repairs that deal with overlaps. This is a riskier operation
+                because it requires modifications to the file system, requires some decision making, and may
+                require some manual steps. For these repairs it is best to analyze the output of a <code>hbck -details</code>
+                run so that you isolate repairs attempts only upon problems the checks identify. Because this is
+                riskier, there are safeguard that should be used to limit the scope of the repairs.
+                WARNING: This is a relatively new and have only been tested on online but idle HBase instances
+                (no reads/writes). Use at your own risk in an active production environment!
+                The options for repairing table integrity violations include:</para>
+            <itemizedlist>
+                <listitem><para><code>-fixHdfsOrphans</code> option for “adopting” a region directory that is missing a region
+                    metadata file (the .regioninfo file).</para>
+                </listitem>
+                <listitem><para><code>-fixHdfsOverlaps</code> ability for fixing overlapping regions</para>
+                </listitem>
+            </itemizedlist>
+            <para>When repairing overlapping regions, a region’s data can be modified on the file system in two
+                ways: 1) by merging regions into a larger region or 2) by sidelining regions by moving data to
+                “sideline” directory where data could be restored later. Merging a large number of regions is
+                technically correct but could result in an extremely large region that requires series of costly
+                compactions and splitting operations. In these cases, it is probably better to sideline the regions
+                that overlap with the most other regions (likely the largest ranges) so that merges can happen on
+                a more reasonable scale. Since these sidelined regions are already laid out in HBase’s native
+                directory and HFile format, they can be restored by using HBase’s bulk load mechanism.
+                The default safeguard thresholds are conservative. These options let you override the default
+                thresholds and to enable the large region sidelining feature.</para>
+            <itemizedlist>
+                <listitem><para><code>-maxMerge &lt;n&gt;</code> maximum number of overlapping regions to merge</para>
+                </listitem>
+                <listitem><para><code>-sidelineBigOverlaps</code> if more than maxMerge regions are overlapping, sideline attempt
+                    to sideline the regions overlapping with the most other regions.</para>
+                </listitem>
+                <listitem><para><code>-maxOverlapsToSideline &lt;n&gt;</code> if sidelining large overlapping regions, sideline at most n
+                    regions.</para>
+                </listitem>
+            </itemizedlist>
+            
+            <para>Since often times you would just want to get the tables repaired, you can use this option to turn
+                on all repair options:</para>
+            <itemizedlist>
+                <listitem><para><code>-repair</code> includes all the region consistency options and only the hole repairing table
+                    integrity options.</para>
+                </listitem>
+            </itemizedlist>
+            <para>Finally, there are safeguards to limit repairs to only specific tables. For example the following
+                command would only attempt to check and repair table TableFoo and TableBar.</para>
+            <screen language="bourne">
+$ ./bin/hbase hbck -repair TableFoo TableBar
+</screen>
+            <section><title>Special cases: Meta is not properly assigned</title>
+                <para>There are a few special cases that hbck can handle as well.
+                    Sometimes the meta table’s only region is inconsistently assigned or deployed. In this case
+                    there is a special <code>-fixMetaOnly</code> option that can try to fix meta assignments.</para>
+                <screen language="bourne">
+$ ./bin/hbase hbck -fixMetaOnly -fixAssignments
+</screen>
+            </section>
+            <section><title>Special cases: HBase version file is missing</title>
+                <para>HBase’s data on the file system requires a version file in order to start. If this flie is missing, you
+                    can use the <code>-fixVersionFile</code> option to fabricating a new HBase version file. This assumes that
+                    the version of hbck you are running is the appropriate version for the HBase cluster.</para>
+            </section>
+            <section><title>Special case: Root and META are corrupt.</title>
+                <para>The most drastic corruption scenario is the case where the ROOT or META is corrupted and
+                    HBase will not start. In this case you can use the OfflineMetaRepair tool create new ROOT
+                    and META regions and tables.
+                    This tool assumes that HBase is offline. It then marches through the existing HBase home
+                    directory, loads as much information from region metadata files (.regioninfo files) as possible
+                    from the file system. If the region metadata has proper table integrity, it sidelines the original root
+                    and meta table directories, and builds new ones with pointers to the region directories and their
+                    data.</para>
+                <screen language="bourne">
+$ ./bin/hbase org.apache.hadoop.hbase.util.hbck.OfflineMetaRepair
+</screen>
+                <para>NOTE: This tool is not as clever as uberhbck but can be used to bootstrap repairs that uberhbck
+                    can complete.
+                    If the tool succeeds you should be able to start hbase and run online repairs if necessary.</para>
+            </section>
+            <section><title>Special cases: Offline split parent</title>
+                <para>
+                    Once a region is split, the offline parent will be cleaned up automatically. Sometimes, daughter regions
+                    are split again before their parents are cleaned up. HBase can clean up parents in the right order. However,
+                    there could be some lingering offline split parents sometimes. They are in META, in HDFS, and not deployed.
+                    But HBase can't clean them up. In this case, you can use the <code>-fixSplitParents</code> option to reset
+                    them in META to be online and not split. Therefore, hbck can merge them with other regions if fixing
+                    overlapping regions option is used.
+                </para>
+                <para>
+                    This option should not normally be used, and it is not in <code>-fixAll</code>.
+                </para>
+            </section>
+        </section>
+    
+</appendix>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/mapreduce.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/mapreduce.xml b/src/main/docbkx/mapreduce.xml
new file mode 100644
index 0000000..9e9e474
--- /dev/null
+++ b/src/main/docbkx/mapreduce.xml
@@ -0,0 +1,630 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter
+    xml:id="mapreduce"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+
+    <title>HBase and MapReduce</title>
+    <para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
+      the framework used most often with <link
+        xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
+      scope of this document. A good place to get started with MapReduce is <link
+        xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
+      2 (MR2)is now part of <link
+        xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
+
+    <para> This chapter discusses specific configuration steps you need to take to use MapReduce on
+      data within HBase. In addition, it discusses other interactions and issues between HBase and
+      MapReduce jobs.
+      <note> 
+      <title>mapred and mapreduce</title>
+      <para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
+      and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
+      the new style.  The latter has more facility though you can usually find an equivalent in the older
+      package.  Pick the package that goes with your mapreduce deploy.  When in doubt or starting over, pick the
+      <filename>org.apache.hadoop.hbase.mapreduce</filename>.  In the notes below, we refer to
+      o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
+      </para>
+      </note> 
+    </para>
+
+    <section
+      xml:id="hbase.mapreduce.classpath">
+      <title>HBase, MapReduce, and the CLASSPATH</title>
+      <para>By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
+        the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
+      <para>To give the MapReduce jobs the access they need, you could add
+          <filename>hbase-site.xml</filename> to the
+            <filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
+        HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
+        directory, then copy these changes across your cluster. You could add hbase-site.xml to
+        $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
+        these changes across your cluster or edit
+          <filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
+        them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
+        recommended because it will pollute your Hadoop install with HBase references. It also
+        requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
+      <para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
+        dependencies only need to be available on the local CLASSPATH. The following example runs
+        the bundled HBase <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
+        MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
+        the environment variables expected in the command (the parts prefixed by a
+          <literal>$</literal> sign and curly braces), you can use the actual system paths instead.
+        Be sure to use the correct version of the HBase JAR for your system. The backticks
+          (<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
+        CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
+      <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable</userinput></screen>
+      <para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
+        zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
+        and adds the JARs to the MapReduce job configuration. See the source at
+        TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
+      <note>
+        <para> The example may not work if you are running HBase from its build directory rather
+          than an installed location. You may see an error like the following:</para>
+        <screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
+        <para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
+          from the <filename>target/</filename> directory within the build environment.</para>
+        <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable</userinput></screen>
+      </note>
+      <caution>
+        <title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
+        <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
+          to the following:</para>
+        <screen>
+Exception in thread "main" java.lang.IllegalAccessError: class
+    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
+    com.google.protobuf.LiteralByteString
+    at java.lang.ClassLoader.defineClass1(Native Method)
+    at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
+    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
+    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
+    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
+    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
+    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
+    at java.security.AccessController.doPrivileged(Native Method)
+    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
+    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
+    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
+    at
+    org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
+    at
+    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
+    at
+    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
+    at
+    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
+    at
+    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
+    at
+    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
+...
+</screen>
+        <para>This is caused by an optimization introduced in <link
+            xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
+          inadvertently introduced a classloader dependency. </para>
+        <para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
+          which package their runtime dependencies in a nested <code>lib</code> folder.</para>
+        <para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
+          included in Hadoop's classpath. See <xref
+            linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
+          classpath errors. The following is included for historical purposes.</para>
+        <para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
+          hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
+        <para>This can also be achieved on a per-job launch basis by including it in the
+            <code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
+          launching jobs that package their dependencies, all three of the following job launching
+          commands satisfy this requirement:</para>
+        <screen language="bourne">
+$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
+$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
+$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
+        </screen>
+        <para>For jars that do not package their dependencies, the following command structure is
+          necessary:</para>
+        <screen language="bourne">
+$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
+        </screen>
+        <para>See also <link
+            xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
+          further discussion of this issue.</para>
+      </caution>
+    </section>
+
+    <section>
+      <title>MapReduce Scan Caching</title>
+      <para>TableMapReduceUtil now restores the option to set scanner caching (the number of rows
+        which are cached before returning the result to the client) on the Scan object that is
+        passed in. This functionality was lost due to a bug in HBase 0.95 (<link
+          xlink:href="https://issues.apache.org/jira/browse/HBASE-11558">HBASE-11558</link>), which
+        is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is
+        as follows:</para>
+      <orderedlist>
+        <listitem>
+          <para>Caching settings which are set on the scan object.</para>
+        </listitem>
+        <listitem>
+          <para>Caching settings which are specified via the configuration option
+              <option>hbase.client.scanner.caching</option>, which can either be set manually in
+              <filename>hbase-site.xml</filename> or via the helper method
+              <code>TableMapReduceUtil.setScannerCaching()</code>.</para>
+        </listitem>
+        <listitem>
+          <para>The default value <code>HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING</code>, which is set to
+            <literal>100</literal>.</para>
+        </listitem>
+      </orderedlist>
+      <para>Optimizing the caching settings is a balance between the time the client waits for a
+        result and the number of sets of results the client needs to receive. If the caching setting
+        is too large, the client could end up waiting for a long time or the request could even time
+        out. If the setting is too small, the scan needs to return results in several pieces.
+        If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger
+        shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the
+        bucket.</para>
+      <para>The list of priorities mentioned above allows you to set a reasonable default, and
+        override it for specific operations.</para>
+      <para>See the API documentation for <link
+          xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html"
+          >Scan</link> for more details.</para>
+    </section>
+
+    <section>
+      <title>Bundled HBase MapReduce Jobs</title>
+      <para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
+        the bundled MapReduce jobs, run the following command.</para>
+
+      <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar</userinput>
+<computeroutput>An example program must be given as the first argument.
+Valid program names are:
+  copytable: Export a table from local cluster to peer cluster
+  completebulkload: Complete a bulk data load.
+  export: Write table data to HDFS.
+  import: Import data written by Export.
+  importtsv: Import data in TSV format.
+  rowcounter: Count rows in HBase table</computeroutput>
+    </screen>
+      <para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
+        model your command after the following example.</para>
+      <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable</userinput></screen>
+    </section>
+
+    <section>
+      <title>HBase as a MapReduce Job Data Source and Data Sink</title>
+      <para>HBase can be used as a data source, <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
+        and data sink, <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
+        or <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
+        for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
+        subclass <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
+        and/or <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
+        See the do-nothing pass-through classes <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
+        and <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
+        for basic usage. For a more involved example, see <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
+        or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
+      <para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
+        sink table and column names in your configuration.</para>
+
+      <para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
+        from HBase and makes a map, which is either a <code>map-per-region</code> or
+          <code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
+        raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
+        will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
+        node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
+        HBase from within your map. This approach works when your job does not need the sort and
+        collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
+        no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
+        to. If you do not need the Reduce, you myour map might emit counts of records processed for
+        reporting at the end of the jobj, or set the number of Reduces to zero and use
+        TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
+        use multiple reducers so that load is spread across the HBase cluster.</para>
+
+      <para>A new HBase partitioner, the <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
+        can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
+        when your table is large and your upload will not greatly alter the number of existing
+        regions upon completion. Otherwise use the default partitioner. </para>
+    </section>
+
+    <section>
+      <title>Writing HFiles Directly During Bulk Import</title>
+      <para>If you are importing into a new table, you can bypass the HBase API and write your
+        content directly to the filesystem, formatted into HBase data files (HFiles). Your import
+        will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
+        see <xref
+          linkend="arch.bulk.load" />.</para>
+    </section>
+
+    <section>
+      <title>RowCounter Example</title>
+      <para>The included <link
+        xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
+        MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
+        table. To run it, use the following command: </para>
+      <screen language="bourne">$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen> 
+      <para>This will
+        invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
+        offered. This will print rowcouner usage advice to standard output. Specify the tablename,
+        column to count, and output
+        directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
+    </section>
+
+    <section
+      xml:id="splitter">
+      <title>Map-Task Splitting</title>
+      <section
+        xml:id="splitter.default">
+        <title>The Default HBase MapReduce Splitter</title>
+        <para>When <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
+          is used to source an HBase table in a MapReduce job, its splitter will make a map task for
+          each region of the table. Thus, if there are 100 regions in the table, there will be 100
+          map-tasks for the job - regardless of how many column families are selected in the
+          Scan.</para>
+      </section>
+      <section
+        xml:id="splitter.custom">
+        <title>Custom Splitters</title>
+        <para>For those interested in implementing custom splitters, see the method
+            <code>getSplits</code> in <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
+          That is where the logic for map-task assignment resides. </para>
+      </section>
+    </section>
+    <section
+      xml:id="mapreduce.example">
+      <title>HBase MapReduce Examples</title>
+      <section
+        xml:id="mapreduce.example.read">
+        <title>HBase MapReduce Read Example</title>
+        <para>The following is an example of using HBase as a MapReduce source in read-only manner.
+          Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
+          the Mapper. There job would be defined as follows...</para>
+        <programlisting language="java">
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config, "ExampleRead");
+job.setJarByClass(MyReadJob.class);     // class that contains mapper
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+...
+
+TableMapReduceUtil.initTableMapperJob(
+  tableName,        // input HBase table name
+  scan,             // Scan instance to control CF and attribute selection
+  MyMapper.class,   // mapper
+  null,             // mapper output key
+  null,             // mapper output value
+  job);
+job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+  throw new IOException("error with job!");
+}
+  </programlisting>
+        <para>...and the mapper instance would extend <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
+        <programlisting language="java">
+public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
+
+  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
+    // process data for the row from the Result instance.
+   }
+}
+    </programlisting>
+      </section>
+      <section
+        xml:id="mapreduce.example.readwrite">
+        <title>HBase MapReduce Read/Write Example</title>
+        <para>The following is an example of using HBase both as a source and as a sink with
+          MapReduce. This example will simply copy data from one table to another.</para>
+        <programlisting language="java">
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config,"ExampleReadWrite");
+job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+
+TableMapReduceUtil.initTableMapperJob(
+	sourceTable,      // input table
+	scan,	          // Scan instance to control CF and attribute selection
+	MyMapper.class,   // mapper class
+	null,	          // mapper output key
+	null,	          // mapper output value
+	job);
+TableMapReduceUtil.initTableReducerJob(
+	targetTable,      // output table
+	null,             // reducer class
+	job);
+job.setNumReduceTasks(0);
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+    throw new IOException("error with job!");
+}
+    </programlisting>
+        <para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
+          especially with the reducer. <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
+          is being used as the outputFormat class, and several parameters are being set on the
+          config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
+          to <classname>ImmutableBytesWritable</classname> and reducer value to
+            <classname>Writable</classname>. These could be set by the programmer on the job and
+          conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
+        <para>The following is the example mapper, which will create a <classname>Put</classname>
+          and matching the input <classname>Result</classname> and emit it. Note: this is what the
+          CopyTable utility does. </para>
+        <programlisting language="java">
+public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt;  {
+
+	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
+		// this example is just copying the data from the source table...
+   		context.write(row, resultToPut(row,value));
+   	}
+
+  	private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
+  		Put put = new Put(key.get());
+ 		for (KeyValue kv : result.raw()) {
+			put.add(kv);
+		}
+		return put;
+   	}
+}
+    </programlisting>
+        <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
+          care of sending the <classname>Put</classname> to the target table. </para>
+        <para>This is just an example, developers could choose not to use
+            <classname>TableOutputFormat</classname> and connect to the target table themselves.
+        </para>
+      </section>
+      <section
+        xml:id="mapreduce.example.readwrite.multi">
+        <title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
+        <para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
+      </section>
+      <section
+        xml:id="mapreduce.example.summary">
+        <title>HBase MapReduce Summary to HBase Example</title>
+        <para>The following example uses HBase as a MapReduce source and sink with a summarization
+          step. This example will count the number of distinct instances of a value in a table and
+          write those summarized counts in another table.
+          <programlisting language="java">
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config,"ExampleSummary");
+job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+
+TableMapReduceUtil.initTableMapperJob(
+	sourceTable,        // input table
+	scan,               // Scan instance to control CF and attribute selection
+	MyMapper.class,     // mapper class
+	Text.class,         // mapper output key
+	IntWritable.class,  // mapper output value
+	job);
+TableMapReduceUtil.initTableReducerJob(
+	targetTable,        // output table
+	MyTableReducer.class,    // reducer class
+	job);
+job.setNumReduceTasks(1);   // at least one, adjust as required
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+	throw new IOException("error with job!");
+}
+    </programlisting>
+          In this example mapper a column with a String-value is chosen as the value to summarize
+          upon. This value is used as the key to emit from the mapper, and an
+            <classname>IntWritable</classname> represents an instance counter.
+          <programlisting language="java">
+public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt;  {
+	public static final byte[] CF = "cf".getBytes();
+	public static final byte[] ATTR1 = "attr1".getBytes();
+
+	private final IntWritable ONE = new IntWritable(1);
+   	private Text text = new Text();
+
+   	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
+        	String val = new String(value.getValue(CF, ATTR1));
+          	text.set(val);     // we can only emit Writables...
+
+        	context.write(text, ONE);
+   	}
+}
+    </programlisting>
+          In the reducer, the "ones" are counted (just like any other MR example that does this),
+          and then emits a <classname>Put</classname>.
+          <programlisting language="java">
+public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt;  {
+	public static final byte[] CF = "cf".getBytes();
+	public static final byte[] COUNT = "count".getBytes();
+
+ 	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
+    		int i = 0;
+    		for (IntWritable val : values) {
+    			i += val.get();
+    		}
+    		Put put = new Put(Bytes.toBytes(key.toString()));
+    		put.add(CF, COUNT, Bytes.toBytes(i));
+
+    		context.write(null, put);
+   	}
+}
+    </programlisting>
+        </para>
+      </section>
+      <section
+        xml:id="mapreduce.example.summary.file">
+        <title>HBase MapReduce Summary to File Example</title>
+        <para>This very similar to the summary example above, with exception that this is using
+          HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
+          in the reducer. The mapper remains the same. </para>
+        <programlisting language="java">
+Configuration config = HBaseConfiguration.create();
+Job job = new Job(config,"ExampleSummaryToFile");
+job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper and reducer
+
+Scan scan = new Scan();
+scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
+scan.setCacheBlocks(false);  // don't set to true for MR jobs
+// set other scan attrs
+
+TableMapReduceUtil.initTableMapperJob(
+	sourceTable,        // input table
+	scan,               // Scan instance to control CF and attribute selection
+	MyMapper.class,     // mapper class
+	Text.class,         // mapper output key
+	IntWritable.class,  // mapper output value
+	job);
+job.setReducerClass(MyReducer.class);    // reducer class
+job.setNumReduceTasks(1);    // at least one, adjust as required
+FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // adjust directories as required
+
+boolean b = job.waitForCompletion(true);
+if (!b) {
+	throw new IOException("error with job!");
+}
+    </programlisting>
+        <para>As stated above, the previous Mapper can run unchanged with this example. As for the
+          Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
+          Puts.</para>
+        <programlisting language="java">
+ public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
+
+	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
+		int i = 0;
+		for (IntWritable val : values) {
+			i += val.get();
+		}
+		context.write(key, new IntWritable(i));
+	}
+}
+    </programlisting>
+      </section>
+      <section
+        xml:id="mapreduce.example.summary.noreducer">
+        <title>HBase MapReduce Summary to HBase Without Reducer</title>
+        <para>It is also possible to perform summaries without a reducer - if you use HBase as the
+          reducer. </para>
+        <para>An HBase target table would need to exist for the job summary. The Table method
+            <code>incrementColumnValue</code> would be used to atomically increment values. From a
+          performance perspective, it might make sense to keep a Map of values with their values to
+          be incremeneted for each map-task, and make one update per key at during the <code>
+            cleanup</code> method of the mapper. However, your milage may vary depending on the
+          number of rows to be processed and unique keys. </para>
+        <para>In the end, the summary results are in HBase. </para>
+      </section>
+      <section
+        xml:id="mapreduce.example.summary.rdbms">
+        <title>HBase MapReduce Summary to RDBMS</title>
+        <para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
+          it is possible to generate summaries directly to an RDBMS via a custom reducer. The
+            <code>setup</code> method can connect to an RDBMS (the connection information can be
+          passed via custom parameters in the context) and the cleanup method can close the
+          connection. </para>
+        <para>It is critical to understand that number of reducers for the job affects the
+          summarization implementation, and you'll have to design this into your reducer.
+          Specifically, whether it is designed to run as a singleton (one reducer) or multiple
+          reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
+          reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
+          be created - this will scale, but only to a point. </para>
+        <programlisting language="java">
+ public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
+
+	private Connection c = null;
+
+	public void setup(Context context) {
+  		// create DB connection...
+  	}
+
+	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
+		// do summarization
+		// in this example the keys are Text, but this is just an example
+	}
+
+	public void cleanup(Context context) {
+  		// close db connection
+  	}
+
+}
+    </programlisting>
+        <para>In the end, the summary results are written to your RDBMS table/s. </para>
+      </section>
+
+    </section>
+    <!--  mr examples -->
+    <section
+      xml:id="mapreduce.htable.access">
+      <title>Accessing Other HBase Tables in a MapReduce Job</title>
+      <para>Although the framework currently allows one HBase table as input to a MapReduce job,
+        other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
+        an Table instance in the setup method of the Mapper.
+        <programlisting language="java">public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
+  private Table myOtherTable;
+
+  public void setup(Context context) {
+    // In here create a Connection to the cluster and save it or use the Connection
+    // from the existing table
+    myOtherTable = connection.getTable("myOtherTable");
+  }
+
+  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
+	// process Result...
+	// use 'myOtherTable' for lookups
+  }
+
+  </programlisting>
+      </para>
+    </section>
+    <section
+      xml:id="mapreduce.specex">
+      <title>Speculative Execution</title>
+      <para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
+        HBase as a source. This can either be done on a per-Job basis through properties, on on the
+        entire cluster. Especially for longer running jobs, speculative execution will create
+        duplicate map-tasks which will double-write your data to HBase; this is probably not what
+        you want. </para>
+      <para>See <xref
+          linkend="spec.ex" /> for more information. </para>
+    </section>
+  
+</chapter>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/orca.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/orca.xml b/src/main/docbkx/orca.xml
new file mode 100644
index 0000000..29d8727
--- /dev/null
+++ b/src/main/docbkx/orca.xml
@@ -0,0 +1,47 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="orca"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+    <title>Apache HBase Orca</title>
+    <figure>
+        <title>Apache HBase Orca</title>
+        <mediaobject>
+            <imageobject>
+                <imagedata align="center" valign="right"
+                    fileref="jumping-orca_rotated_25percent.png"/>
+            </imageobject>
+        </mediaobject>
+    </figure>
+    <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-4920">An Orca is the Apache
+            HBase mascot.</link>
+        See NOTICES.txt.  Our Orca logo we got here: http://www.vectorfree.com/jumping-orca
+        It is licensed Creative Commons Attribution 3.0.  See https://creativecommons.org/licenses/by/3.0/us/
+        We changed the logo by stripping the colored background, inverting
+        it and then rotating it some.
+    </para>
+</appendix>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/other_info.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/other_info.xml b/src/main/docbkx/other_info.xml
new file mode 100644
index 0000000..72ff274
--- /dev/null
+++ b/src/main/docbkx/other_info.xml
@@ -0,0 +1,83 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="other.info"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+    <title>Other Information About HBase</title>
+    <section xml:id="other.info.videos"><title>HBase Videos</title>
+        <para>Introduction to HBase
+            <itemizedlist>
+                <listitem><para><link xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/presentation/chicago_data_summit_apache_hbase_an_introduction_todd_lipcon.html">Introduction to HBase</link> by Todd Lipcon (Chicago Data Summit 2011).
+                </para></listitem>
+                <listitem><para><link xlink:href="http://www.cloudera.com/videos/intorduction-hbase-todd-lipcon">Introduction to HBase</link> by Todd Lipcon (2010).
+                </para></listitem>
+            </itemizedlist>
+        </para>
+        <para><link xlink:href="http://www.cloudera.com/videos/hadoop-world-2011-presentation-video-building-realtime-big-data-services-at-facebook-with-hadoop-and-hbase">Building Real Time Services at Facebook with HBase</link> by Jonathan Gray (Hadoop World 2011).
+        </para>
+        <para><link xlink:href="http://www.cloudera.com/videos/hw10_video_how_stumbleupon_built_and_advertising_platform_using_hbase_and_hadoop">HBase and Hadoop, Mixing Real-Time and Batch Processing at StumbleUpon</link> by JD Cryans (Hadoop World 2010).
+        </para>
+    </section>
+    <section xml:id="other.info.pres"><title>HBase Presentations (Slides)</title>
+        <para><link xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-advanced-hbase-schema-design.html">Advanced HBase Schema Design</link> by Lars George (Hadoop World 2011).
+        </para>
+        <para><link xlink:href="http://www.slideshare.net/cloudera/chicago-data-summit-apache-hbase-an-introduction">Introduction to HBase</link> by Todd Lipcon (Chicago Data Summit 2011).
+        </para>
+        <para><link xlink:href="http://www.slideshare.net/cloudera/hw09-practical-h-base-getting-the-most-from-your-h-base-install">Getting The Most From Your HBase Install</link> by Ryan Rawson, Jonathan Gray (Hadoop World 2009).
+        </para>
+    </section>
+    <section xml:id="other.info.papers"><title>HBase Papers</title>
+        <para><link xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> by Google (2006).
+        </para>
+        <para><link xlink:href="http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html">HBase and HDFS Locality</link> by Lars George (2010).
+        </para>
+        <para><link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link> by Ian Varley (2009).
+        </para>
+    </section>
+    <section xml:id="other.info.sites"><title>HBase Sites</title>
+        <para><link xlink:href="http://www.cloudera.com/blog/category/hbase/">Cloudera's HBase Blog</link> has a lot of links to useful HBase information.
+            <itemizedlist>
+                <listitem><para><link xlink:href="http://www.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/">CAP Confusion</link> is a relevant entry for background information on
+                    distributed storage systems.</para>
+                </listitem>
+            </itemizedlist>
+        </para>
+        <para><link xlink:href="http://wiki.apache.org/hadoop/HBase/HBasePresentations">HBase Wiki</link> has a page with a number of presentations.
+        </para>
+        <para><link xlink:href="http://refcardz.dzone.com/refcardz/hbase">HBase RefCard</link> from DZone.
+        </para>
+    </section>
+    <section xml:id="other.info.books"><title>HBase Books</title>
+        <para><link xlink:href="http://shop.oreilly.com/product/0636920014348.do">HBase:  The Definitive Guide</link> by Lars George.
+        </para>
+    </section>
+    <section xml:id="other.info.books.hadoop"><title>Hadoop Books</title>
+        <para><link xlink:href="http://shop.oreilly.com/product/9780596521981.do">Hadoop:  The Definitive Guide</link> by Tom White.
+        </para>
+    </section>
+
+</appendix>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/performance.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/performance.xml b/src/main/docbkx/performance.xml
index 1757d3f..42ed79b 100644
--- a/src/main/docbkx/performance.xml
+++ b/src/main/docbkx/performance.xml
@@ -273,7 +273,7 @@ tableDesc.addFamily(cfDesc);
         If there is enough RAM, increasing this can help.
         </para>
     </section>
-    <section xml:id="hbase.regionserver.checksum.verify">
+    <section xml:id="hbase.regionserver.checksum.verify.performance">
         <title><varname>hbase.regionserver.checksum.verify</varname></title>
         <para>Have HBase write the checksum into the datablock and save
         having to do the checksum seek whenever you read.</para>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/sql.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/sql.xml b/src/main/docbkx/sql.xml
new file mode 100644
index 0000000..40f43d6
--- /dev/null
+++ b/src/main/docbkx/sql.xml
@@ -0,0 +1,40 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="sql"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+    <title>SQL over HBase</title>
+    <section xml:id="phoenix">
+        <title>Apache Phoenix</title>
+        <para><link xlink:href="http://phoenix.apache.org">Apache Phoenix</link></para>
+    </section>
+    <section xml:id="trafodion">
+        <title>Trafodion</title>
+        <para><link xlink:href="https://wiki.trafodion.org/">Trafodion: Transactional SQL-on-HBase</link></para>
+    </section>
+
+</appendix>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/upgrading.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/upgrading.xml b/src/main/docbkx/upgrading.xml
index d5708a4..5d71e0f 100644
--- a/src/main/docbkx/upgrading.xml
+++ b/src/main/docbkx/upgrading.xml
@@ -240,7 +240,7 @@
         </table>
       </section>
 
-	    <section xml:id="hbase.client.api">
+	    <section xml:id="hbase.client.api.surface">
 		  <title>HBase API surface</title>
 		  <para> HBase has a lot of API points, but for the compatibility matrix above, we differentiate between Client API, Limited Private API, and Private API. HBase uses a version of 
 		  <link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Compatibility.html">Hadoop's Interface classification</link>. HBase's Interface classification classes can be found <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/classification/package-summary.html"> here</link>. 

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/ycsb.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/ycsb.xml b/src/main/docbkx/ycsb.xml
new file mode 100644
index 0000000..695614c
--- /dev/null
+++ b/src/main/docbkx/ycsb.xml
@@ -0,0 +1,36 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix xml:id="ycsb" version="5.0" xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg" xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml" xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+    <title>YCSB</title>
+    <para><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The
+            Yahoo! Cloud Serving Benchmark</link> and HBase</para>
+    <para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para>
+    <para>TODO: Describe setup of YCSB for HBase. In particular, presplit your tables before you
+        start a run. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-4163"
+            >HBASE-4163 Create Split Strategy for YCSB Benchmark</link> for why and a little shell
+        command for how to do it.</para>
+    <para>Ted Dunning redid YCSB so it's mavenized and added facility for verifying workloads. See
+            <link xlink:href="https://github.com/tdunning/YCSB">Ted Dunning's YCSB</link>.</para>
+
+
+</appendix>


[4/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/compression.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/compression.xml b/src/main/docbkx/compression.xml
new file mode 100644
index 0000000..d1971b1
--- /dev/null
+++ b/src/main/docbkx/compression.xml
@@ -0,0 +1,535 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="compression"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+
+    <title>Compression and Data Block Encoding In
+          HBase<indexterm><primary>Compression</primary><secondary>Data Block
+          Encoding</secondary><seealso>codecs</seealso></indexterm></title>
+    <note>
+      <para>Codecs mentioned in this section are for encoding and decoding data blocks or row keys.
+       For information about replication codecs, see <xref
+          linkend="cluster.replication.preserving.tags" />.</para>
+    </note>
+    <para>Some of the information in this section is pulled from a <link
+        xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1/v=threaded">discussion</link> on the
+      HBase Development mailing list.</para>
+    <para>HBase supports several different compression algorithms which can be enabled on a
+      ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking
+      advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys
+      and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in
+      cells, and can significantly reduce the storage space needed to store uncompressed
+      data.</para>
+    <para>Compressors and data block encoding can be used together on the same ColumnFamily.</para>
+    
+    <formalpara>
+      <title>Changes Take Effect Upon Compaction</title>
+      <para>If you change compression or encoding for a ColumnFamily, the changes take effect during
+       compaction.</para>
+    </formalpara>
+
+    <para>Some codecs take advantage of capabilities built into Java, such as GZip compression.
+      Others rely on native libraries. Native libraries may be available as part of Hadoop, such as
+      LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs,
+      such as Google Snappy, need to be installed first. Some codecs are licensed in ways that
+      conflict with HBase's license and cannot be shipped as part of HBase.</para>
+
+    <para>This section discusses common codecs that are used and tested with HBase. No matter what
+      codec you use, be sure to test that it is installed correctly and is available on all nodes in
+      your cluster. Extra operational steps may be necessary to be sure that codecs are available on
+      newly-deployed nodes. You can use the <xref
+        linkend="compression.test" /> utility to check that a given codec is correctly
+      installed.</para>
+
+    <para>To configure HBase to use a compressor, see <xref
+        linkend="compressor.install" />. To enable a compressor for a ColumnFamily, see <xref
+        linkend="changing.compression" />. To enable data block encoding for a ColumnFamily, see
+      <xref linkend="data.block.encoding.enable" />.</para>
+    <itemizedlist>
+      <title>Block Compressors</title>
+      <listitem>
+        <para>none</para>
+      </listitem>
+      <listitem>
+        <para>Snappy</para>
+      </listitem>
+      <listitem>
+        <para>LZO</para>
+      </listitem>
+      <listitem>
+        <para>LZ4</para>
+      </listitem>
+      <listitem>
+        <para>GZ</para>
+      </listitem>
+    </itemizedlist>
+
+
+    <itemizedlist xml:id="data.block.encoding.types">
+      <title>Data Block Encoding Types</title>
+      <listitem>
+        <para>Prefix - Often, keys are very similar. Specifically, keys often share a common prefix
+          and only differ near the end. For instance, one key might be
+            <literal>RowKey:Family:Qualifier0</literal> and the next key might be
+            <literal>RowKey:Family:Qualifier1</literal>. In Prefix encoding, an extra column is
+          added which holds the length of the prefix shared between the current key and the previous
+          key. Assuming the first key here is totally different from the key before, its prefix
+          length is 0. The second key's prefix length is <literal>23</literal>, since they have the
+          first 23 characters in common.</para>
+        <para>Obviously if the keys tend to have nothing in common, Prefix will not provide much
+          benefit.</para>
+        <para>The following image shows a hypothetical ColumnFamily with no data block encoding.</para>
+        <figure>
+          <title>ColumnFamily with No Encoding</title>
+          <mediaobject>
+            <imageobject>
+              <imagedata fileref="data_block_no_encoding.png" width="800"/>
+            </imageobject>
+            <caption><para>A ColumnFamily with no encoding></para></caption>
+          </mediaobject>
+        </figure>
+        <para>Here is the same data with prefix data encoding.</para>
+        <figure>
+          <title>ColumnFamily with Prefix Encoding</title>
+          <mediaobject>
+            <imageobject>
+              <imagedata fileref="data_block_prefix_encoding.png" width="800"/>
+            </imageobject>
+            <caption><para>A ColumnFamily with prefix encoding</para></caption>
+          </mediaobject>
+        </figure>
+      </listitem>
+      <listitem>
+        <para>Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key
+          sequentially as a monolithic series of bytes, each key field is split so that each part of
+          the key can be compressed more efficiently. Two new fields are added: timestamp and type.
+          If the ColumnFamily is the same as the previous row, it is omitted from the current row.
+          If the key length, value length or type are the same as the previous row, the field is
+          omitted. In addition, for increased compression, the timestamp is stored as a Diff from
+          the previous row's timestamp, rather than being stored in full. Given the two row keys in
+          the Prefix example, and given an exact match on timestamp and the same type, neither the
+          value length, or type needs to be stored for the second row, and the timestamp value for
+          the second row is just 0, rather than a full timestamp.</para>
+        <para>Diff encoding is disabled by default because writing and scanning are slower but more
+          data is cached.</para>
+        <para>This image shows the same ColumnFamily from the previous images, with Diff encoding.</para>
+        <figure>
+          <title>ColumnFamily with Diff Encoding</title>
+          <mediaobject>
+            <imageobject>
+              <imagedata fileref="data_block_diff_encoding.png" width="800"/>
+            </imageobject>
+            <caption><para>A ColumnFamily with diff encoding</para></caption>
+          </mediaobject>
+        </figure>
+      </listitem>
+      <listitem>
+        <para>Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also
+          adds another field which stores a single bit to track whether the data itself is the same
+          as the previous row. If it is, the data is not stored again. Fast Diff is the recommended
+          codec to use if you have long keys or many columns. The data format is nearly identical to
+        Diff encoding, so there is not an image to illustrate it.</para>
+      </listitem>
+      <listitem>
+        <para>Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It
+          provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides
+          faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate
+          for applications that have high block cache hit ratios. It introduces new 'tree' fields
+          for the row and column. The row tree field contains a list of offsets/references
+          corresponding to the cells in that row. This allows for a good deal of compression. For
+          more details about Prefix Tree encoding, see <link
+            xlink:href="https://issues.apache.org/jira/browse/HBASE-4676">HBASE-4676</link>. It is
+          difficult to graphically illustrate a prefix tree, so no image is included. See the
+          Wikipedia article for <link
+            xlink:href="http://en.wikipedia.org/wiki/Trie">Trie</link> for more general information
+          about this data structure.</para>
+      </listitem>
+    </itemizedlist>
+
+    <section>
+      <title>Which Compressor or Data Block Encoder To Use</title>
+      <para>The compression or codec type to use depends on the characteristics of your data.
+        Choosing the wrong type could cause your data to take more space rather than less, and can
+        have performance implications. In general, you need to weigh your options between smaller
+        size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>. </para>
+      <itemizedlist>
+        <listitem>
+          <para>If you have long keys (compared to the values) or many columns, use a prefix
+            encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree
+            encoding.</para>
+        </listitem>
+        <listitem>
+          <para>If the values are large (and not precompressed, such as images), use a data block
+            compressor.</para>
+        </listitem>
+        <listitem>
+          <para>Use GZIP for <firstterm>cold data</firstterm>, which is accessed infrequently. GZIP
+            compression uses more CPU resources than Snappy or LZO, but provides a higher
+            compression ratio.</para>
+        </listitem>
+        <listitem>
+          <para>Use Snappy or LZO for <firstterm>hot data</firstterm>, which is accessed
+            frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high
+          of a compression ratio.</para>
+        </listitem>
+        <listitem>
+          <para>In most cases, enabling Snappy or LZO by default is a good choice, because they have
+            a low performance overhead and provide space savings.</para>
+        </listitem>
+        <listitem>
+          <para>Before Snappy became available by Google in 2011, LZO was the default. Snappy has
+            similar qualities as LZO but has been shown to perform better.</para>
+        </listitem>
+      </itemizedlist>
+    </section>
+    <section xml:id="hadoop.native.lib">
+      <title>Making use of Hadoop Native Libraries in HBase</title>
+      <para>The Hadoop shared library has a bunch of facility including
+        compression libraries and fast crc'ing. To make this facility available
+        to HBase, do the following. HBase/Hadoop will fall back to use
+        alternatives if it cannot find the native library versions -- or
+        fail outright if you asking for an explicit compressor and there is
+      no alternative available.</para>
+    <para>If you see the following in your HBase logs, you know that HBase was unable
+      to locate the Hadoop native libraries:
+      <programlisting>2014-08-07 09:26:20,139 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable</programlisting>
+      If the libraries loaded successfully, the WARN message does not show.
+    </para>
+    <para>Lets presume your Hadoop shipped with a native library that
+      suits the platform you are running HBase on.  To check if the Hadoop
+      native library is available to HBase, run the following tool (available in 
+      Hadoop 2.1 and greater):
+      <programlisting>$ ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
+2014-08-26 13:15:38,717 WARN  [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
+Native library checking:
+hadoop: false
+zlib:   false
+snappy: false
+lz4:    false
+bzip2:  false
+2014-08-26 13:15:38,863 INFO  [main] util.ExitUtil: Exiting with status 1</programlisting>
+Above shows that the native hadoop library is not available in HBase context.
+    </para>
+    <para>To fix the above, either copy the Hadoop native libraries local or symlink to
+      them if the Hadoop and HBase stalls are adjacent in the filesystem. 
+      You could also point at their location by setting the <varname>LD_LIBRARY_PATH</varname> environment
+      variable.</para>
+    <para>Where the JVM looks to find native librarys is "system dependent"
+      (See <classname>java.lang.System#loadLibrary(name)</classname>). On linux, by default,
+      is going to look in <filename>lib/native/PLATFORM</filename> where <varname>PLATFORM</varname>
+      is the label for the platform your HBase is installed on.
+      On a local linux machine, it seems to be the concatenation of the java properties
+      <varname>os.name</varname> and <varname>os.arch</varname> followed by whether 32 or 64 bit.
+      HBase on startup prints out all of the java system properties so find the os.name and os.arch
+      in the log. For example:
+      <programlisting>....
+      2014-08-06 15:27:22,853 INFO  [main] zookeeper.ZooKeeper: Client environment:os.name=Linux
+      2014-08-06 15:27:22,853 INFO  [main] zookeeper.ZooKeeper: Client environment:os.arch=amd64
+      ...
+    </programlisting>
+     So in this case, the PLATFORM string is <varname>Linux-amd64-64</varname>.
+     Copying the Hadoop native libraries or symlinking at <filename>lib/native/Linux-amd64-64</filename>
+     will ensure they are found.  Check with the Hadoop <filename>NativeLibraryChecker</filename>. 
+    </para>
+
+    <para>Here is example of how to point at the Hadoop libs with <varname>LD_LIBRARY_PATH</varname>
+      environment variable:
+      <programlisting>$ LD_LIBRARY_PATH=~/hadoop-2.5.0-SNAPSHOT/lib/native ./bin/hbase --config ~/conf_hbase org.apache.hadoop.util.NativeLibraryChecker
+2014-08-26 13:42:49,332 INFO  [main] bzip2.Bzip2Factory: Successfully loaded &amp; initialized native-bzip2 library system-native
+2014-08-26 13:42:49,337 INFO  [main] zlib.ZlibFactory: Successfully loaded &amp; initialized native-zlib library
+Native library checking:
+hadoop: true /home/stack/hadoop-2.5.0-SNAPSHOT/lib/native/libhadoop.so.1.0.0
+zlib:   true /lib64/libz.so.1
+snappy: true /usr/lib64/libsnappy.so.1
+lz4:    true revision:99
+bzip2:  true /lib64/libbz2.so.1</programlisting>
+Set in <filename>hbase-env.sh</filename> the LD_LIBRARY_PATH environment variable when starting your HBase.
+    </para>
+    </section>
+
+    <section>
+      <title>Compressor Configuration, Installation, and Use</title>
+      <section
+        xml:id="compressor.install">
+        <title>Configure HBase For Compressors</title>
+        <para>Before HBase can use a given compressor, its libraries need to be available. Due to
+          licensing issues, only GZ compression is available to HBase (via native Java libraries) in
+          a default installation. Other compression libraries are available via the shared library
+          bundled with your hadoop.  The hadoop native library needs to be findable when HBase
+          starts.  See </para>
+        <section>
+          <title>Compressor Support On the Master</title>
+          <para>A new configuration setting was introduced in HBase 0.95, to check the Master to
+            determine which data block encoders are installed and configured on it, and assume that
+            the entire cluster is configured the same. This option,
+              <code>hbase.master.check.compression</code>, defaults to <literal>true</literal>. This
+            prevents the situation described in <link
+              xlink:href="https://issues.apache.org/jira/browse/HBASE-6370">HBASE-6370</link>, where
+            a table is created or modified to support a codec that a region server does not support,
+            leading to failures that take a long time to occur and are difficult to debug. </para>
+          <para>If <code>hbase.master.check.compression</code> is enabled, libraries for all desired
+            compressors need to be installed and configured on the Master, even if the Master does
+            not run a region server.</para>
+        </section>
+        <section>
+          <title>Install GZ Support Via Native Libraries</title>
+          <para>HBase uses Java's built-in GZip support unless the native Hadoop libraries are
+            available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to
+            set the environment variable <envar>HBASE_LIBRARY_PATH</envar> for the user running
+            HBase. If native libraries are not available and Java's GZIP is used, <literal>Got
+              brand-new compressor</literal> reports will be present in the logs. See <xref
+              linkend="brand.new.compressor" />).</para>
+        </section>
+        <section
+          xml:id="lzo.compression">
+          <title>Install LZO Support</title>
+          <para>HBase cannot ship with LZO because of incompatibility between HBase, which uses an
+            Apache Software License (ASL) and LZO, which uses a GPL license. See the <link
+              xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO
+              Compression</link> wiki page for information on configuring LZO support for HBase. </para>
+          <para>If you depend upon LZO compression, consider configuring your RegionServers to fail
+            to start if LZO is not available. See <xref
+              linkend="hbase.regionserver.codecs" />.</para>
+        </section>
+        <section
+          xml:id="lz4.compression">
+          <title>Configure LZ4 Support</title>
+          <para>LZ4 support is bundled with Hadoop. Make sure the hadoop shared library
+            (libhadoop.so) is accessible when you start
+            HBase. After configuring your platform (see <xref
+              linkend="hbase.native.platform" />), you can make a symbolic link from HBase to the native Hadoop
+            libraries. This assumes the two software installs are colocated. For example, if my
+            'platform' is Linux-amd64-64:
+            <programlisting language="bourne">$ cd $HBASE_HOME
+$ mkdir lib/native
+$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64</programlisting>
+            Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart)
+            HBase. Afterward, you can create and alter tables to enable LZ4 as a
+            compression codec.:
+            <screen>
+hbase(main):003:0> <userinput>alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</userinput>
+            </screen>
+          </para>
+        </section>
+        <section
+          xml:id="snappy.compression.installation">
+          <title>Install Snappy Support</title>
+          <para>HBase does not ship with Snappy support because of licensing issues. You can install
+            Snappy binaries (for instance, by using <command>yum install snappy</command> on CentOS)
+            or build Snappy from source. After installing Snappy, search for the shared library,
+            which will be called <filename>libsnappy.so.X</filename> where X is a number. If you
+            built from source, copy the shared library to a known location on your system, such as
+              <filename>/opt/snappy/lib/</filename>.</para>
+          <para>In addition to the Snappy library, HBase also needs access to the Hadoop shared
+            library, which will be called something like <filename>libhadoop.so.X.Y</filename>,
+            where X and Y are both numbers. Make note of the location of the Hadoop library, or copy
+            it to the same location as the Snappy library.</para>
+          <note>
+            <para>The Snappy and Hadoop libraries need to be available on each node of your cluster.
+              See <xref
+                linkend="compression.test" /> to find out how to test that this is the case.</para>
+            <para>See <xref
+                linkend="hbase.regionserver.codecs" /> to configure your RegionServers to fail to
+              start if a given compressor is not available.</para>
+          </note>
+          <para>Each of these library locations need to be added to the environment variable
+              <envar>HBASE_LIBRARY_PATH</envar> for the operating system user that runs HBase. You
+            need to restart the RegionServer for the changes to take effect.</para>
+        </section>
+
+
+        <section
+          xml:id="compression.test">
+          <title>CompressionTest</title>
+          <para>You can use the CompressionTest tool to verify that your compressor is available to
+            HBase:</para>
+          <screen language="bourne">
+ $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://<replaceable>host/path/to/hbase</replaceable> snappy       
+          </screen>
+        </section>
+
+
+        <section
+          xml:id="hbase.regionserver.codecs">
+          <title>Enforce Compression Settings On a RegionServer</title>
+          <para>You can configure a RegionServer so that it will fail to restart if compression is
+            configured incorrectly, by adding the option hbase.regionserver.codecs to the
+              <filename>hbase-site.xml</filename>, and setting its value to a comma-separated list
+            of codecs that need to be available. For example, if you set this property to
+              <literal>lzo,gz</literal>, the RegionServer would fail to start if both compressors
+            were not available. This would prevent a new server from being added to the cluster
+            without having codecs configured properly.</para>
+        </section>
+      </section>
+
+      <section
+        xml:id="changing.compression">
+        <title>Enable Compression On a ColumnFamily</title>
+        <para>To enable compression for a ColumnFamily, use an <code>alter</code> command. You do
+          not need to re-create the table or copy data. If you are changing codecs, be sure the old
+          codec is still available until all the old StoreFiles have been compacted.</para>
+        <example>
+          <title>Enabling Compression on a ColumnFamily of an Existing Table using HBase
+            Shell</title>
+          <screen><![CDATA[
+hbase> disable 'test'
+hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'}
+hbase> enable 'test']]>
+        </screen>
+        </example>
+        <example>
+          <title>Creating a New Table with Compression On a ColumnFamily</title>
+          <screen><![CDATA[
+hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }         
+          ]]></screen>
+        </example>
+        <example>
+          <title>Verifying a ColumnFamily's Compression Settings</title>
+          <screen><![CDATA[
+hbase> describe 'test'
+DESCRIPTION                                          ENABLED
+ 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false
+ ', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
+ VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS
+ => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa
+ lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B
+ LOCKCACHE => 'true'}
+1 row(s) in 0.1070 seconds
+          ]]></screen>
+        </example>
+      </section>
+
+      <section>
+        <title>Testing Compression Performance</title>
+        <para>HBase includes a tool called LoadTestTool which provides mechanisms to test your
+          compression performance. You must specify either <literal>-write</literal> or
+          <literal>-update-read</literal> as your first parameter, and if you do not specify another
+        parameter, usage advice is printed for each option.</para>
+        <example>
+          <title><command>LoadTestTool</command> Usage</title>
+          <screen language="bourne"><![CDATA[
+$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h            
+usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
+Options:
+ -batchupdate                 Whether to use batch as opposed to separate
+                              updates for every column in a row
+ -bloom <arg>                 Bloom filter type, one of [NONE, ROW, ROWCOL]
+ -compression <arg>           Compression type, one of [LZO, GZ, NONE, SNAPPY,
+                              LZ4]
+ -data_block_encoding <arg>   Encoding algorithm (e.g. prefix compression) to
+                              use for data blocks in the test column family, one
+                              of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE].
+ -encryption <arg>            Enables transparent encryption on the test table,
+                              one of [AES]
+ -generator <arg>             The class which generates load for the tool. Any
+                              args for this class can be passed as colon
+                              separated after class name
+ -h,--help                    Show usage
+ -in_memory                   Tries to keep the HFiles of the CF inmemory as far
+                              as possible.  Not guaranteed that reads are always
+                              served from inmemory
+ -init_only                   Initialize the test table only, don't do any
+                              loading
+ -key_window <arg>            The 'key window' to maintain between reads and
+                              writes for concurrent write/read workload. The
+                              default is 0.
+ -max_read_errors <arg>       The maximum number of read errors to tolerate
+                              before terminating all reader threads. The default
+                              is 10.
+ -multiput                    Whether to use multi-puts as opposed to separate
+                              puts for every column in a row
+ -num_keys <arg>              The number of keys to read/write
+ -num_tables <arg>            A positive integer number. When a number n is
+                              speicfied, load test tool  will load n table
+                              parallely. -tn parameter value becomes table name
+                              prefix. Each table name is in format
+                              <tn>_1...<tn>_n
+ -read <arg>                  <verify_percent>[:<#threads=20>]
+ -regions_per_server <arg>    A positive integer number. When a number n is
+                              specified, load test tool will create the test
+                              table with n regions per server
+ -skip_init                   Skip the initialization; assume test table already
+                              exists
+ -start_key <arg>             The first key to read/write (a 0-based index). The
+                              default value is 0.
+ -tn <arg>                    The name of the table to read or write
+ -update <arg>                <update_percent>[:<#threads=20>][:<#whether to
+                              ignore nonce collisions=0>]
+ -write <arg>                 <avg_cols_per_key>:<avg_data_size>[:<#threads=20>]
+ -zk <arg>                    ZK quorum as comma-separated host names without
+                              port numbers
+ -zk_root <arg>               name of parent znode in zookeeper            
+          ]]></screen>
+        </example>
+        <example>
+          <title>Example Usage of LoadTestTool</title>
+          <screen language="bourne">
+$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000
+          -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
+          </screen>
+        </example>
+      </section>
+    </section>
+
+    <section xml:id="data.block.encoding.enable">
+      <title>Enable Data Block Encoding</title>
+      <para>Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a
+        table by setting the <code>DATA_BLOCK_ENCODING</code> property. Disable the table before
+        altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell:</para>
+      <example>
+        <title>Enable Data Block Encoding On a Table</title>
+        <screen><![CDATA[
+hbase>  disable 'test'
+hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
+Updating all regions with the new schema...
+0/1 regions updated.
+1/1 regions updated.
+Done.
+0 row(s) in 2.2820 seconds
+hbase> enable 'test'
+0 row(s) in 0.1580 seconds          
+          ]]></screen>
+      </example>
+      <example>
+        <title>Verifying a ColumnFamily's Data Block Encoding</title>
+        <screen><![CDATA[
+hbase> describe 'test'
+DESCRIPTION                                          ENABLED
+ 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true
+ _DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
+ '0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS
+ IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =
+ > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
+ e', BLOCKCACHE => 'true'}
+1 row(s) in 0.0650 seconds          
+        ]]></screen>
+      </example>
+    </section>
+  
+
+</appendix>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/configuration.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/configuration.xml b/src/main/docbkx/configuration.xml
index 74b8e52..a0b7d11 100644
--- a/src/main/docbkx/configuration.xml
+++ b/src/main/docbkx/configuration.xml
@@ -925,8 +925,8 @@ stopping hbase...............</screen>
       <!--presumes the pre-site target has put the hbase-default.xml at this location-->
       <xi:include
         xmlns:xi="http://www.w3.org/2001/XInclude"
-        href="../../../target/docbkx/hbase-default.xml">
-        <xi:fallback>
+        href="hbase-default.xml">
+	<!--<xi:fallback>
           <section
             xml:id="hbase_default_configurations">
             <title />
@@ -1007,7 +1007,7 @@ stopping hbase...............</screen>
               </section>
             </section>
           </section>
-        </xi:fallback>
+	</xi:fallback>-->
       </xi:include>
     </section>
 

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/customization-pdf.xsl
----------------------------------------------------------------------
diff --git a/src/main/docbkx/customization-pdf.xsl b/src/main/docbkx/customization-pdf.xsl
new file mode 100644
index 0000000..b21236f
--- /dev/null
+++ b/src/main/docbkx/customization-pdf.xsl
@@ -0,0 +1,129 @@
+<?xml version="1.0"?>
+<xsl:stylesheet
+  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+  version="1.0">
+<!--
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+  <xsl:import href="urn:docbkx:stylesheet/docbook.xsl"/>
+  <xsl:import href="urn:docbkx:stylesheet/highlight.xsl"/>
+
+
+    <!--###################################################
+                  Paper & Page Size
+   ################################################### -->
+
+    <!-- Paper type, no headers on blank pages, no double sided printing -->
+    <xsl:param name="paper.type" select="'USletter'"/>
+    <xsl:param name="double.sided">0</xsl:param>
+    <xsl:param name="headers.on.blank.pages">0</xsl:param>
+    <xsl:param name="footers.on.blank.pages">0</xsl:param>
+
+    <!-- Space between paper border and content (chaotic stuff, don't touch) -->
+    <xsl:param name="page.margin.top">5mm</xsl:param>
+    <xsl:param name="region.before.extent">10mm</xsl:param>
+    <xsl:param name="body.margin.top">10mm</xsl:param>
+
+    <xsl:param name="body.margin.bottom">15mm</xsl:param>
+    <xsl:param name="region.after.extent">10mm</xsl:param>
+    <xsl:param name="page.margin.bottom">0mm</xsl:param>
+
+    <xsl:param name="page.margin.outer">18mm</xsl:param>
+    <xsl:param name="page.margin.inner">18mm</xsl:param>
+
+    <!-- No intendation of Titles -->
+    <xsl:param name="title.margin.left">0pc</xsl:param>
+
+    <!--###################################################
+                  Fonts & Styles
+   ################################################### -->
+
+    <!-- Left aligned text and no hyphenation -->
+    <xsl:param name="alignment">justify</xsl:param>
+    <xsl:param name="hyphenate">true</xsl:param>
+
+    <!-- Default Font size -->
+    <xsl:param name="body.font.master">11</xsl:param>
+    <xsl:param name="body.font.small">8</xsl:param>
+
+    <!-- Line height in body text -->
+    <xsl:param name="line-height">1.4</xsl:param>
+
+    <!-- Force line break in long URLs -->
+    <xsl:param name="ulink.hyphenate.chars">/&amp;?</xsl:param>
+	<xsl:param name="ulink.hyphenate">&#x200B;</xsl:param>
+
+    <!-- Monospaced fonts are smaller than regular text -->
+    <xsl:attribute-set name="monospace.properties">
+        <xsl:attribute name="font-family">
+            <xsl:value-of select="$monospace.font.family"/>
+        </xsl:attribute>
+        <xsl:attribute name="font-size">0.8em</xsl:attribute>
+        <xsl:attribute name="wrap-option">wrap</xsl:attribute>
+        <xsl:attribute name="hyphenate">true</xsl:attribute>
+    </xsl:attribute-set>
+
+
+	<!-- add page break after abstract block -->
+	<xsl:attribute-set name="abstract.properties">
+		<xsl:attribute name="break-after">page</xsl:attribute>
+	</xsl:attribute-set>
+
+	<!-- add page break after toc -->
+	<xsl:attribute-set name="toc.margin.properties">
+		<xsl:attribute name="break-after">page</xsl:attribute>
+	</xsl:attribute-set>
+
+	<!-- add page break after first level sections -->
+	<xsl:attribute-set name="section.level1.properties">
+		<xsl:attribute name="break-after">page</xsl:attribute>
+	</xsl:attribute-set>
+
+    <!-- Show only Sections up to level 3 in the TOCs -->
+    <xsl:param name="toc.section.depth">2</xsl:param>
+
+    <!-- Dot and Whitespace as separator in TOC between Label and Title-->
+    <xsl:param name="autotoc.label.separator" select="'.  '"/>
+
+	<!-- program listings / examples formatting -->
+	<xsl:attribute-set name="monospace.verbatim.properties">
+		<xsl:attribute name="font-family">Courier</xsl:attribute>
+		<xsl:attribute name="font-size">8pt</xsl:attribute>
+		<xsl:attribute name="keep-together.within-column">always</xsl:attribute>
+	</xsl:attribute-set>
+
+	<xsl:param name="shade.verbatim" select="1" />
+
+	<xsl:attribute-set name="shade.verbatim.style">
+		<xsl:attribute name="background-color">#E8E8E8</xsl:attribute>
+		<xsl:attribute name="border-width">0.5pt</xsl:attribute>
+		<xsl:attribute name="border-style">solid</xsl:attribute>
+		<xsl:attribute name="border-color">#575757</xsl:attribute>
+		<xsl:attribute name="padding">3pt</xsl:attribute>
+	</xsl:attribute-set>
+
+	<!-- callouts customization -->
+	<xsl:param name="callout.unicode" select="1" />
+	<xsl:param name="callout.graphics" select="0" />
+    <xsl:param name="callout.defaultcolumn">90</xsl:param>	
+
+    <!-- Syntax Highlighting -->
+
+
+</xsl:stylesheet>

http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/datamodel.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/datamodel.xml b/src/main/docbkx/datamodel.xml
new file mode 100644
index 0000000..bdf697d
--- /dev/null
+++ b/src/main/docbkx/datamodel.xml
@@ -0,0 +1,865 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter
+    xml:id="datamodel"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+
+    <title>Data Model</title>
+    <para>In HBase, data is stored in tables, which have rows and columns. This is a terminology
+      overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can
+    be helpful to think of an HBase table as a multi-dimensional map.</para>
+    <variablelist>
+      <title>HBase Data Model Terminology</title>
+      <varlistentry>
+        <term>Table</term>
+        <listitem>
+          <para>An HBase table consists of multiple rows.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Row</term>
+        <listitem>
+          <para>A row in HBase consists of a row key and one or more columns with values associated
+            with them. Rows are sorted alphabetically by the row key as they are stored. For this
+            reason, the design of the row key is very important. The goal is to store data in such a
+            way that related rows are near each other. A common row key pattern is a website domain.
+            If your row keys are domains, you should probably store them in reverse (org.apache.www,
+            org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each
+            other in the table, rather than being spread out based on the first letter of the
+            subdomain.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Column</term>
+        <listitem>
+          <para>A column in HBase consists of a column family and a column qualifier, which are
+            delimited by a <literal>:</literal> (colon) character.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Column Family</term>
+        <listitem>
+          <para>Column families physically colocate a set of columns and their values, often for
+            performance reasons. Each column family has a set of storage properties, such as whether
+            its values should be cached in memory, how its data is compressed or its row keys are
+            encoded, and others. Each row in a table has the same column
+            families, though a given row might not store anything in a given column family.</para>
+          <para>Column families are specified when you create your table, and influence the way your
+            data is stored in the underlying filesystem. Therefore, the column families should be
+            considered carefully during schema design.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Column Qualifier</term>
+        <listitem>
+          <para>A column qualifier is added to a column family to provide the index for a given
+            piece of data. Given a column family <literal>content</literal>, a column qualifier
+            might be <literal>content:html</literal>, and another might be
+            <literal>content:pdf</literal>. Though column families are fixed at table creation,
+            column qualifiers are mutable and may differ greatly between rows.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Cell</term>
+        <listitem>
+          <para>A cell is a combination of row, column family, and column qualifier, and contains a
+            value and a timestamp, which represents the value's version.</para>
+          <para>A cell's value is an uninterpreted array of bytes.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Timestamp</term>
+        <listitem>
+          <para>A timestamp is written alongside each value, and is the identifier for a given
+            version of a value. By default, the timestamp represents the time on the RegionServer
+            when the data was written, but you can specify a different timestamp value when you put
+            data into the cell.</para>
+          <caution>
+            <para>Direct manipulation of timestamps is an advanced feature which is only exposed for
+              special cases that are deeply integrated with HBase, and is discouraged in general.
+              Encoding a timestamp at the application level is the preferred pattern.</para>
+          </caution>
+          <para>You can specify the maximum number of versions of a value that HBase retains, per column
+            family. When the maximum number of versions is reached, the oldest versions are 
+            eventually deleted. By default, only the newest version is kept.</para>
+        </listitem>
+      </varlistentry>
+    </variablelist>
+
+    <section
+      xml:id="conceptual.view">
+      <title>Conceptual View</title>
+      <para>You can read a very understandable explanation of the HBase data model in the blog post <link
+          xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding
+          HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the
+        PDF <link
+          xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction
+          to Basic Schema Design</link> by Amandeep Khurana. It may help to read different
+        perspectives to get a solid understanding of HBase schema design. The linked articles cover
+        the same ground as the information in this section.</para>
+      <para> The following example is a slightly modified form of the one on page 2 of the <link
+          xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
+        is a table called <varname>webtable</varname> that contains two rows
+        (<literal>com.cnn.www</literal>
+          and <literal>com.example.www</literal>), three column families named
+          <varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In
+          this example, for the first row (<literal>com.cnn.www</literal>), 
+          <varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
+          <varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
+          (<varname>contents:html</varname>). This example contains 5 versions of the row with the
+        row key <literal>com.cnn.www</literal>, and one version of the row with the row key
+        <literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire
+        HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each
+        contain the external site which links to the site represented by the row, along with the
+        text it used in the anchor of its link. The <varname>people</varname> column family represents
+        people associated with the site.
+      </para>
+        <note>
+          <title>Column Names</title>
+        <para> By convention, a column name is made of its column family prefix and a
+            <emphasis>qualifier</emphasis>. For example, the column
+            <emphasis>contents:html</emphasis> is made up of the column family
+            <varname>contents</varname> and the <varname>html</varname> qualifier. The colon
+          character (<literal>:</literal>) delimits the column family from the column family
+            <emphasis>qualifier</emphasis>. </para>
+        </note>
+        <table
+          frame="all">
+          <title>Table <varname>webtable</varname></title>
+          <tgroup
+            cols="5"
+            align="left"
+            colsep="1"
+            rowsep="1">
+            <colspec
+              colname="c1" />
+            <colspec
+              colname="c2" />
+            <colspec
+              colname="c3" />
+            <colspec
+              colname="c4" />
+            <colspec
+              colname="c5" />
+            <thead>
+              <row>
+                <entry>Row Key</entry>
+                <entry>Time Stamp</entry>
+                <entry>ColumnFamily <varname>contents</varname></entry>
+                <entry>ColumnFamily <varname>anchor</varname></entry>
+                <entry>ColumnFamily <varname>people</varname></entry>
+              </row>
+            </thead>
+            <tbody>
+              <row>
+                <entry>"com.cnn.www"</entry>
+                <entry>t9</entry>
+                <entry />
+                <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
+                <entry />
+              </row>
+              <row>
+                <entry>"com.cnn.www"</entry>
+                <entry>t8</entry>
+                <entry />
+                <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
+                <entry />
+              </row>
+              <row>
+                <entry>"com.cnn.www"</entry>
+                <entry>t6</entry>
+                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+                <entry />
+                <entry />
+              </row>
+              <row>
+                <entry>"com.cnn.www"</entry>
+                <entry>t5</entry>
+                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+                <entry />
+                <entry />
+              </row>
+              <row>
+                <entry>"com.cnn.www"</entry>
+                <entry>t3</entry>
+                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+                <entry />
+                <entry />
+              </row>
+              <row>
+                <entry>"com.example.www"</entry>
+                <entry>t5</entry>
+                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+                <entry></entry>
+                <entry>people:author = "John Doe"</entry>
+              </row>
+            </tbody>
+          </tgroup>
+        </table>
+      <para>Cells in this table that appear to be empty do not take space, or in fact exist, in
+        HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to
+        look at data in HBase, or even the most accurate. The following represents the same
+        information as a multi-dimensional map. This is only a mock-up for illustrative
+        purposes and may not be strictly accurate.</para>
+      <programlisting><![CDATA[
+{
+	"com.cnn.www": {
+		contents: {
+			t6: contents:html: "<html>..."
+			t5: contents:html: "<html>..."
+			t3: contents:html: "<html>..."
+		}
+		anchor: {
+			t9: anchor:cnnsi.com = "CNN"
+			t8: anchor:my.look.ca = "CNN.com"
+		}
+		people: {}
+	}
+	"com.example.www": {
+		contents: {
+			t5: contents:html: "<html>..."
+		}
+		anchor: {}
+		people: {
+			t5: people:author: "John Doe"
+		}
+	}
+}        
+        ]]></programlisting>
+
+    </section>
+    <section
+      xml:id="physical.view">
+      <title>Physical View</title>
+      <para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are
+        physically stored by column family. A new column qualifier (column_family:column_qualifier)
+        can be added to an existing column family at any time.</para>
+      <table
+        frame="all">
+        <title>ColumnFamily <varname>anchor</varname></title>
+        <tgroup
+          cols="3"
+          align="left"
+          colsep="1"
+          rowsep="1">
+          <colspec
+            colname="c1" />
+          <colspec
+            colname="c2" />
+          <colspec
+            colname="c3" />
+          <thead>
+            <row>
+              <entry>Row Key</entry>
+              <entry>Time Stamp</entry>
+              <entry>Column Family <varname>anchor</varname></entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t9</entry>
+              <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
+            </row>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t8</entry>
+              <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </table>
+      <table
+        frame="all">
+        <title>ColumnFamily <varname>contents</varname></title>
+        <tgroup
+          cols="3"
+          align="left"
+          colsep="1"
+          rowsep="1">
+          <colspec
+            colname="c1" />
+          <colspec
+            colname="c2" />
+          <colspec
+            colname="c3" />
+          <thead>
+            <row>
+              <entry>Row Key</entry>
+              <entry>Time Stamp</entry>
+              <entry>ColumnFamily "contents:"</entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t6</entry>
+              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+            </row>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t5</entry>
+              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+            </row>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t3</entry>
+              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </table>
+      <para>The empty cells shown in the
+        conceptual view are not stored at all.
+        Thus a request for the value of the <varname>contents:html</varname> column at time stamp
+          <literal>t8</literal> would return no value. Similarly, a request for an
+          <varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
+        return no value. However, if no timestamp is supplied, the most recent value for a
+        particular column would be returned. Given multiple versions, the most recent is also the
+        first one found,  since timestamps
+        are stored in descending order. Thus a request for the values of all columns in the row
+          <varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
+          <varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
+          <varname>anchor:cnnsi.com</varname> from timestamp <literal>t9</literal>, the value of
+          <varname>anchor:my.look.ca</varname> from timestamp <literal>t8</literal>. </para>
+      <para>For more information about the internals of how Apache HBase stores data, see <xref
+          linkend="regions.arch" />. </para>
+    </section>
+
+    <section
+      xml:id="namespace">
+      <title>Namespace</title>
+      <para> A namespace is a logical grouping of tables analogous to a database in relation
+        database systems. This abstraction lays the groundwork for upcoming multi-tenancy related
+        features: <itemizedlist>
+          <listitem>
+            <para>Quota Management (HBASE-8410) - Restrict the amount of resources (ie regions,
+              tables) a namespace can consume.</para>
+          </listitem>
+          <listitem>
+            <para>Namespace Security Administration (HBASE-9206) - provide another level of security
+              administration for tenants.</para>
+          </listitem>
+          <listitem>
+            <para>Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset
+              of regionservers thus guaranteeing a course level of isolation.</para>
+          </listitem>
+        </itemizedlist>
+      </para>
+      <section
+        xml:id="namespace_creation">
+        <title>Namespace management</title>
+        <para> A namespace can be created, removed or altered. Namespace membership is determined
+          during table creation by specifying a fully-qualified table name of the form:</para>
+
+        <programlisting language="xml"><![CDATA[<table namespace>:<table qualifier>]]></programlisting>
+
+
+        <example>
+          <title>Examples</title>
+
+          <programlisting language="bourne">
+#Create a namespace
+create_namespace 'my_ns'
+            </programlisting>
+          <programlisting language="bourne">
+#create my_table in my_ns namespace
+create 'my_ns:my_table', 'fam'
+          </programlisting>
+          <programlisting language="bourne">
+#drop namespace
+drop_namespace 'my_ns'
+          </programlisting>
+          <programlisting language="bourne">
+#alter namespace
+alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}
+        </programlisting>
+        </example>
+      </section>
+      <section
+        xml:id="namespace_special">
+        <title>Predefined namespaces</title>
+        <para> There are two predefined special namespaces: </para>
+        <itemizedlist>
+          <listitem>
+            <para>hbase - system namespace, used to contain hbase internal tables</para>
+          </listitem>
+          <listitem>
+            <para>default - tables with no explicit specified namespace will automatically fall into
+              this namespace.</para>
+          </listitem>
+        </itemizedlist>
+        <example>
+          <title>Examples</title>
+
+          <programlisting language="bourne">
+#namespace=foo and table qualifier=bar
+create 'foo:bar', 'fam'
+
+#namespace=default and table qualifier=bar
+create 'bar', 'fam'
+</programlisting>
+        </example>
+      </section>
+    </section>
+
+    <section
+      xml:id="table">
+      <title>Table</title>
+      <para> Tables are declared up front at schema definition time. </para>
+    </section>
+
+    <section
+      xml:id="row">
+      <title>Row</title>
+      <para>Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest
+        order appearing first in a table. The empty byte array is used to denote both the start and
+        end of a tables' namespace.</para>
+    </section>
+
+    <section
+      xml:id="columnfamily">
+      <title>Column Family<indexterm><primary>Column Family</primary></indexterm></title>
+      <para> Columns in Apache HBase are grouped into <emphasis>column families</emphasis>. All
+        column members of a column family have the same prefix. For example, the columns
+          <emphasis>courses:history</emphasis> and <emphasis>courses:math</emphasis> are both
+        members of the <emphasis>courses</emphasis> column family. The colon character
+          (<literal>:</literal>) delimits the column family from the <indexterm><primary>column
+            family qualifier</primary><secondary>Column Family Qualifier</secondary></indexterm>.
+        The column family prefix must be composed of <emphasis>printable</emphasis> characters. The
+        qualifying tail, the column family <emphasis>qualifier</emphasis>, can be made of any
+        arbitrary bytes. Column families must be declared up front at schema definition time whereas
+        columns do not need to be defined at schema time but can be conjured on the fly while the
+        table is up an running.</para>
+      <para>Physically, all column family members are stored together on the filesystem. Because
+        tunings and storage specifications are done at the column family level, it is advised that
+        all column family members have the same general access pattern and size
+        characteristics.</para>
+
+    </section>
+    <section
+      xml:id="cells">
+      <title>Cells<indexterm><primary>Cells</primary></indexterm></title>
+      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
+          <literal>cell</literal> in HBase. Cell content is uninterrpreted bytes</para>
+    </section>
+    <section
+      xml:id="data_model_operations">
+      <title>Data Model Operations</title>
+      <para>The four primary data model operations are Get, Put, Scan, and Delete. Operations are
+        applied via <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link>
+        instances.
+      </para>
+      <section
+        xml:id="get">
+        <title>Get</title>
+        <para><link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
+          returns attributes for a specified row. Gets are executed via <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#get(org.apache.hadoop.hbase.client.Get)">
+            Table.get</link>. </para>
+      </section>
+      <section
+        xml:id="put">
+        <title>Put</title>
+        <para><link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link>
+          either adds new rows to a table (if the key is new) or can update existing rows (if the
+          key already exists). Puts are executed via <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#put(org.apache.hadoop.hbase.client.Put)">
+            Table.put</link> (writeBuffer) or <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List, java.lang.Object[])">
+            Table.batch</link> (non-writeBuffer). </para>
+      </section>
+      <section
+        xml:id="scan">
+        <title>Scans</title>
+        <para><link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
+          allow iteration over multiple rows for specified attributes. </para>
+        <para>The following is an example of a Scan on a Table instance. Assume that a table is
+          populated with rows with keys "row1", "row2", "row3", and then another set of rows with
+          the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan
+          instance to return the rows beginning with "row".</para>
+<programlisting language="java">
+public static final byte[] CF = "cf".getBytes();
+public static final byte[] ATTR = "attr".getBytes();
+...
+
+Table table = ...      // instantiate a Table instance
+
+Scan scan = new Scan();
+scan.addColumn(CF, ATTR);
+scan.setRowPrefixFilter(Bytes.toBytes("row"));
+ResultScanner rs = table.getScanner(scan);
+try {
+  for (Result r = rs.next(); r != null; r = rs.next()) {
+  // process result...
+} finally {
+  rs.close();  // always close the ResultScanner!
+</programlisting>
+        <para>Note that generally the easiest way to specify a specific stop point for a scan is by
+          using the <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html">InclusiveStopFilter</link>
+          class. </para>
+      </section>
+      <section
+        xml:id="delete">
+        <title>Delete</title>
+        <para><link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html">Delete</link>
+          removes a row from a table. Deletes are executed via <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)">
+            HTable.delete</link>. </para>
+        <para>HBase does not modify data in place, and so deletes are handled by creating new
+          markers called <emphasis>tombstones</emphasis>. These tombstones, along with the dead
+          values, are cleaned up on major compactions. </para>
+        <para>See <xref
+            linkend="version.delete" /> for more information on deleting versions of columns, and
+          see <xref
+            linkend="compaction" /> for more information on compactions. </para>
+
+      </section>
+
+    </section>
+
+
+    <section
+      xml:id="versions">
+      <title>Versions<indexterm><primary>Versions</primary></indexterm></title>
+
+      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
+          <literal>cell</literal> in HBase. It's possible to have an unbounded number of cells where
+        the row and column are the same but the cell address differs only in its version
+        dimension.</para>
+
+      <para>While rows and column keys are expressed as bytes, the version is specified using a long
+        integer. Typically this long contains time instances such as those returned by
+          <code>java.util.Date.getTime()</code> or <code>System.currentTimeMillis()</code>, that is:
+          <quote>the difference, measured in milliseconds, between the current time and midnight,
+          January 1, 1970 UTC</quote>.</para>
+
+      <para>The HBase version dimension is stored in decreasing order, so that when reading from a
+        store file, the most recent values are found first.</para>
+
+      <para>There is a lot of confusion over the semantics of <literal>cell</literal> versions, in
+        HBase. In particular:</para>
+      <itemizedlist>
+        <listitem>
+          <para>If multiple writes to a cell have the same version, only the last written is
+            fetchable.</para>
+        </listitem>
+
+        <listitem>
+          <para>It is OK to write cells in a non-increasing version order.</para>
+        </listitem>
+      </itemizedlist>
+
+      <para>Below we describe how the version dimension in HBase currently works. See <link
+              xlink:href="https://issues.apache.org/jira/browse/HBASE-2406">HBASE-2406</link> for
+            discussion of HBase versions. <link
+              xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in HBase</link>
+            makes for a good read on the version, or time, dimension in HBase. It has more detail on
+            versioning than is provided here. As of this writing, the limiitation
+              <emphasis>Overwriting values at existing timestamps</emphasis> mentioned in the
+            article no longer holds in HBase. This section is basically a synopsis of this article
+            by Bruno Dumon.</para>
+      
+      <section xml:id="specify.number.of.versions">
+        <title>Specifying the Number of Versions to Store</title>
+        <para>The maximum number of versions to store for a given column is part of the column
+          schema and is specified at table creation, or via an <command>alter</command> command, via
+            <code>HColumnDescriptor.DEFAULT_VERSIONS</code>. Prior to HBase 0.96, the default number
+          of versions kept was <literal>3</literal>, but in 0.96 and newer has been changed to
+            <literal>1</literal>.</para>
+        <example>
+          <title>Modify the Maximum Number of Versions for a Column</title>
+          <para>This example uses HBase Shell to keep a maximum of 5 versions of column
+              <code>f1</code>. You could also use <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
+              >HColumnDescriptor</link>.</para>
+          <screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5]]></screen>
+        </example>
+        <example>
+          <title>Modify the Minimum Number of Versions for a Column</title>
+          <para>You can also specify the minimum number of versions to store. By default, this is
+            set to 0, which means the feature is disabled. The following example sets the minimum
+            number of versions on field <code>f1</code> to <literal>2</literal>, via HBase Shell.
+            You could also use <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
+              >HColumnDescriptor</link>.</para>
+          <screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2]]></screen>
+        </example>
+        <para>Starting with HBase 0.98.2, you can specify a global default for the maximum number of
+          versions kept for all newly-created columns, by setting
+            <option>hbase.column.max.version</option> in <filename>hbase-site.xml</filename>. See
+            <xref linkend="hbase.column.max.version"/>.</para>
+      </section>
+
+      <section
+        xml:id="versions.ops">
+        <title>Versions and HBase Operations</title>
+
+        <para>In this section we look at the behavior of the version dimension for each of the core
+          HBase operations.</para>
+
+        <section>
+          <title>Get/Scan</title>
+
+          <para>Gets are implemented on top of Scans. The below discussion of <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
+            applies equally to <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scans</link>.</para>
+
+          <para>By default, i.e. if you specify no explicit version, when doing a
+              <literal>get</literal>, the cell whose version has the largest value is returned
+            (which may or may not be the latest one written, see later). The default behavior can be
+            modified in the following ways:</para>
+
+          <itemizedlist>
+            <listitem>
+              <para>to return more than one version, see <link
+                  xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()">Get.setMaxVersions()</link></para>
+            </listitem>
+
+            <listitem>
+              <para>to return versions other than the latest, see <link
+                  xlink:href="???">Get.setTimeRange()</link></para>
+
+              <para>To retrieve the latest version that is less than or equal to a given value, thus
+                giving the 'latest' state of the record at a certain point in time, just use a range
+                from 0 to the desired version and set the max versions to 1.</para>
+            </listitem>
+          </itemizedlist>
+
+        </section>
+        <section
+          xml:id="default_get_example">
+          <title>Default Get Example</title>
+          <para>The following Get will only retrieve the current version of the row</para>
+          <programlisting language="java">
+public static final byte[] CF = "cf".getBytes();
+public static final byte[] ATTR = "attr".getBytes();
+...
+Get get = new Get(Bytes.toBytes("row1"));
+Result r = table.get(get);
+byte[] b = r.getValue(CF, ATTR);  // returns current version of value
+</programlisting>
+        </section>
+        <section
+          xml:id="versioned_get_example">
+          <title>Versioned Get Example</title>
+          <para>The following Get will return the last 3 versions of the row.</para>
+          <programlisting language="java">
+public static final byte[] CF = "cf".getBytes();
+public static final byte[] ATTR = "attr".getBytes();
+...
+Get get = new Get(Bytes.toBytes("row1"));
+get.setMaxVersions(3);  // will return last 3 versions of row
+Result r = table.get(get);
+byte[] b = r.getValue(CF, ATTR);  // returns current version of value
+List&lt;KeyValue&gt; kv = r.getColumn(CF, ATTR);  // returns all versions of this column
+</programlisting>
+        </section>
+
+        <section>
+          <title>Put</title>
+
+          <para>Doing a put always creates a new version of a <literal>cell</literal>, at a certain
+            timestamp. By default the system uses the server's <literal>currentTimeMillis</literal>,
+            but you can specify the version (= the long integer) yourself, on a per-column level.
+            This means you could assign a time in the past or the future, or use the long value for
+            non-time purposes.</para>
+
+          <para>To overwrite an existing value, do a put at exactly the same row, column, and
+            version as that of the cell you would overshadow.</para>
+          <section
+            xml:id="implicit_version_example">
+            <title>Implicit Version Example</title>
+            <para>The following Put will be implicitly versioned by HBase with the current
+              time.</para>
+            <programlisting language="java">
+public static final byte[] CF = "cf".getBytes();
+public static final byte[] ATTR = "attr".getBytes();
+...
+Put put = new Put(Bytes.toBytes(row));
+put.add(CF, ATTR, Bytes.toBytes( data));
+table.put(put);
+</programlisting>
+          </section>
+          <section
+            xml:id="explicit_version_example">
+            <title>Explicit Version Example</title>
+            <para>The following Put has the version timestamp explicitly set.</para>
+            <programlisting language="java">
+public static final byte[] CF = "cf".getBytes();
+public static final byte[] ATTR = "attr".getBytes();
+...
+Put put = new Put( Bytes.toBytes(row));
+long explicitTimeInMs = 555;  // just an example
+put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));
+table.put(put);
+</programlisting>
+            <para>Caution: the version timestamp is internally by HBase for things like time-to-live
+              calculations. It's usually best to avoid setting this timestamp yourself. Prefer using
+              a separate timestamp attribute of the row, or have the timestamp a part of the rowkey,
+              or both. </para>
+          </section>
+
+        </section>
+
+        <section
+          xml:id="version.delete">
+          <title>Delete</title>
+
+          <para>There are three different types of internal delete markers. See Lars Hofhansl's blog
+            for discussion of his attempt adding another, <link
+              xlink:href="http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html">Scanning
+              in HBase: Prefix Delete Marker</link>. </para>
+          <itemizedlist>
+            <listitem>
+              <para>Delete: for a specific version of a column.</para>
+            </listitem>
+            <listitem>
+              <para>Delete column: for all versions of a column.</para>
+            </listitem>
+            <listitem>
+              <para>Delete family: for all columns of a particular ColumnFamily</para>
+            </listitem>
+          </itemizedlist>
+          <para>When deleting an entire row, HBase will internally create a tombstone for each
+            ColumnFamily (i.e., not each individual column). </para>
+          <para>Deletes work by creating <emphasis>tombstone</emphasis> markers. For example, let's
+            suppose we want to delete a row. For this you can specify a version, or else by default
+            the <literal>currentTimeMillis</literal> is used. What this means is <quote>delete all
+              cells where the version is less than or equal to this version</quote>. HBase never
+            modifies data in place, so for example a delete will not immediately delete (or mark as
+            deleted) the entries in the storage file that correspond to the delete condition.
+            Rather, a so-called <emphasis>tombstone</emphasis> is written, which will mask the
+            deleted values. When HBase does a major compaction, the tombstones are processed to
+            actually remove the dead values, together with the tombstones themselves. If the version
+            you specified when deleting a row is larger than the version of any value in the row,
+            then you can consider the complete row to be deleted.</para>
+          <para>For an informative discussion on how deletes and versioning interact, see the thread <link
+              xlink:href="http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28421">Put w/
+              timestamp -> Deleteall -> Put w/ timestamp fails</link> up on the user mailing
+            list.</para>
+          <para>Also see <xref
+              linkend="keyvalue" /> for more information on the internal KeyValue format. </para>
+          <para>Delete markers are purged during the next major compaction of the store, unless the
+              <option>KEEP_DELETED_CELLS</option> option is set in the column family. To keep the
+            deletes for a configurable amount of time, you can set the delete TTL via the
+              <option>hbase.hstore.time.to.purge.deletes</option> property in
+              <filename>hbase-site.xml</filename>. If
+              <option>hbase.hstore.time.to.purge.deletes</option> is not set, or set to 0, all
+            delete markers, including those with timestamps in the future, are purged during the
+            next major compaction. Otherwise, a delete marker with a timestamp in the future is kept
+            until the major compaction which occurs after the time represented by the marker's
+            timestamp plus the value of <option>hbase.hstore.time.to.purge.deletes</option>, in
+            milliseconds. </para>
+          <note>
+            <para>This behavior represents a fix for an unexpected change that was introduced in
+              HBase 0.94, and was fixed in <link
+                xlink:href="https://issues.apache.org/jira/browse/HBASE-10118">HBASE-10118</link>.
+              The change has been backported to HBase 0.94 and newer branches.</para>
+          </note>
+        </section>
+      </section>
+
+      <section>
+        <title>Current Limitations</title>
+
+        <section>
+          <title>Deletes mask Puts</title>
+
+          <para>Deletes mask puts, even puts that happened after the delete
+          was entered. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-2256"
+              >HBASE-2256</link>. Remember that a delete writes a tombstone, which only
+          disappears after then next major compaction has run. Suppose you do
+          a delete of everything &lt;= T. After this you do a new put with a
+          timestamp &lt;= T. This put, even if it happened after the delete,
+          will be masked by the delete tombstone. Performing the put will not
+          fail, but when you do a get you will notice the put did have no
+          effect. It will start working again after the major compaction has
+          run. These issues should not be a problem if you use
+          always-increasing versions for new puts to a row. But they can occur
+          even if you do not care about time: just do delete and put
+          immediately after each other, and there is some chance they happen
+          within the same millisecond.</para>
+        </section>
+
+        <section
+          xml:id="major.compactions.change.query.results">
+          <title>Major compactions change query results</title>
+          
+          <para><quote>...create three cell versions at t1, t2 and t3, with a maximum-versions
+              setting of 2. So when getting all versions, only the values at t2 and t3 will be
+              returned. But if you delete the version at t2 or t3, the one at t1 will appear again.
+              Obviously, once a major compaction has run, such behavior will not be the case
+              anymore...</quote> (See <emphasis>Garbage Collection</emphasis> in <link
+              xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in
+            HBase</link>.)</para>
+        </section>
+      </section>
+    </section>
+    <section xml:id="dm.sort">
+      <title>Sort Order</title>
+      <para>All data model operations HBase return data in sorted order.  First by row,
+      then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted
+      in reverse, so newest records are returned first).
+      </para>
+    </section>
+    <section xml:id="dm.column.metadata">
+      <title>Column Metadata</title>
+      <para>There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily.
+      Thus, while HBase can support not only a wide number of columns per row, but a heterogenous set of columns
+      between rows as well, it is your responsibility to keep track of the column names.
+      </para>
+      <para>The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows.
+      For more information about how HBase stores data internally, see <xref linkend="keyvalue" />.
+	  </para>
+    </section>
+    <section xml:id="joins"><title>Joins</title>
+      <para>Whether HBase supports joins is a common question on the dist-list, and there is a simple answer:  it doesn't,
+      at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL).  As has been illustrated
+      in this chapter, the read data model operations in HBase are Get and Scan.
+      </para>
+      <para>However, that doesn't mean that equivalent join functionality can't be supported in your application, but
+      you have to do it yourself.  The two primary strategies are either denormalizing the data upon writing to HBase,
+      or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS'
+      demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs.
+      hash-joins).  So which is the best approach?  It depends on what you are trying to do, and as such there isn't a single
+      answer that works for every use case.
+      </para>
+    </section>
+    <section xml:id="acid"><title>ACID</title>
+        <para>See <link xlink:href="http://hbase.apache.org/acid-semantics.html">ACID Semantics</link>.
+            Lars Hofhansl has also written a note on
+            <link xlink:href="http://hadoop-hbase.blogspot.com/2012/03/acid-in-hbase.html">ACID in HBase</link>.</para>
+    </section>
+  </chapter>


[2/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/hbase-default.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/hbase-default.xml b/src/main/docbkx/hbase-default.xml
new file mode 100644
index 0000000..125e3d2
--- /dev/null
+++ b/src/main/docbkx/hbase-default.xml
@@ -0,0 +1,538 @@
+<?xml version="1.0" encoding="UTF-8"?><glossary xml:id="hbase_default_configurations" version="5.0" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:m="http://www.w3.org/1998/Math/MathML" xmlns:db="http://docbook.org/ns/docbook" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:svg="http://www.w3.org/2000/svg" xmlns:html="http://www.w3.org/1999/xhtml" xmlns="http://docbook.org/ns/docbook"><title>HBase Default Configuration</title><para>
+The documentation below is generated using the default hbase configuration file,
+<filename>hbase-default.xml</filename>, as source.
+</para><glossentry xml:id="hbase.tmp.dir"><glossterm><varname>hbase.tmp.dir</varname></glossterm><glossdef><para>Temporary directory on the local filesystem.
+    Change this setting to point to a location more permanent
+    than '/tmp', the usual resolve for java.io.tmpdir, as the
+    '/tmp' directory is cleared on machine restart.</para><formalpara><title>Default</title><para><varname>${java.io.tmpdir}/hbase-${user.name}</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rootdir"><glossterm><varname>hbase.rootdir</varname></glossterm><glossdef><para>The directory shared by region servers and into
+    which HBase persists.  The URL should be 'fully-qualified'
+    to include the filesystem scheme.  For example, to specify the
+    HDFS directory '/hbase' where the HDFS instance's namenode is
+    running at namenode.example.org on port 9000, set this value to:
+    hdfs://namenode.example.org:9000/hbase.  By default, we write
+    to whatever ${hbase.tmp.dir} is set too -- usually /tmp --
+    so change this configuration or else all data will be lost on
+    machine restart.</para><formalpara><title>Default</title><para><varname>${hbase.tmp.dir}/hbase</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.cluster.distributed"><glossterm><varname>hbase.cluster.distributed</varname></glossterm><glossdef><para>The mode the cluster will be in. Possible values are
+      false for standalone mode and true for distributed mode.  If
+      false, startup will run all HBase and ZooKeeper daemons together
+      in the one JVM.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.quorum"><glossterm><varname>hbase.zookeeper.quorum</varname></glossterm><glossdef><para>Comma separated list of servers in the ZooKeeper ensemble
+    (This config. should have been named hbase.zookeeper.ensemble).
+    For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
+    By default this is set to localhost for local and pseudo-distributed modes
+    of operation. For a fully-distributed setup, this should be set to a full
+    list of ZooKeeper ensemble servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
+    this is the list of servers which hbase will start/stop ZooKeeper on as
+    part of cluster start/stop.  Client-side, we will take this list of
+    ensemble members and put it together with the hbase.zookeeper.clientPort
+    config. and pass it into zookeeper constructor as the connectString
+    parameter.</para><formalpara><title>Default</title><para><varname>localhost</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.local.dir"><glossterm><varname>hbase.local.dir</varname></glossterm><glossdef><para>Directory on the local filesystem to be used
+    as a local storage.</para><formalpara><title>Default</title><para><varname>${hbase.tmp.dir}/local/</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.info.port"><glossterm><varname>hbase.master.info.port</varname></glossterm><glossdef><para>The port for the HBase Master web UI.
+    Set to -1 if you do not want a UI instance run.</para><formalpara><title>Default</title><para><varname>16010</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.info.bindAddress"><glossterm><varname>hbase.master.info.bindAddress</varname></glossterm><glossdef><para>The bind address for the HBase Master web UI
+    </para><formalpara><title>Default</title><para><varname>0.0.0.0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.logcleaner.plugins"><glossterm><varname>hbase.master.logcleaner.plugins</varname></glossterm><glossdef><para>A comma-separated list of BaseLogCleanerDelegate invoked by
+    the LogsCleaner service. These WAL cleaners are called in order,
+    so put the cleaner that prunes the most files in front. To
+    implement your own BaseLogCleanerDelegate, just put it in HBase's classpath
+    and add the fully qualified class name here. Always add the above
+    default log cleaners in the list.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.logcleaner.ttl"><glossterm><varname>hbase.master.logcleaner.ttl</varname></glossterm><glossdef><para>Maximum time a WAL can stay in the .oldlogdir directory,
+    after which it will be cleaned by a Master thread.</para><formalpara><title>Default</title><para><varname>600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.hfilecleaner.plugins"><glossterm><varname>hbase.master.hfilecleaner.plugins</varname></glossterm><glossdef><para>A comma-separated list of BaseHFileCleanerDelegate invoked by
+    the HFileCleaner service. These HFiles cleaners are called in order,
+    so put the cleaner that prunes the most files in front. To
+    implement your own BaseHFileCleanerDelegate, just put it in HBase's classpath
+    and add the fully qualified class name here. Always add the above
+    default log cleaners in the list as they will be overwritten in
+    hbase-site.xml.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.catalog.timeout"><glossterm><varname>hbase.master.catalog.timeout</varname></glossterm><glossdef><para>Timeout value for the Catalog Janitor from the master to
+    META.</para><formalpara><title>Default</title><para><varname>600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.infoserver.redirect"><glossterm><varname>hbase.master.infoserver.redirect</varname></glossterm><glossdef><para>Whether or not the Master listens to the Master web
+      UI port (hbase.master.info.port) and redirects requests to the web
+      UI server shared by the Master and RegionServer.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.port"><glossterm><varname>hbase.regionserver.port</varname></glossterm><glossdef><para>The port the HBase RegionServer binds to.</para><formalpara><title>Default</title><para><varname>16020</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.info.port"><glossterm><varname>hbase.regionserver.info.port</varname></glossterm><glossdef><para>The port for the HBase RegionServer web UI
+    Set to -1 if you do not want the RegionServer UI to run.</para><formalpara><title>Default</title><para><varname>16030</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.info.bindAddress"><glossterm><varname>hbase.regionserver.info.bindAddress</varname></glossterm><glossdef><para>The address for the HBase RegionServer web UI</para><formalpara><title>Default</title><para><varname>0.0.0.0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.info.port.auto"><glossterm><varname>hbase.regionserver.info.port.auto</varname></glossterm><glossdef><para>Whether or not the Master or RegionServer
+    UI should search for a port to bind to. Enables automatic port
+    search if hbase.regionserver.info.port is already in use.
+    Useful for testing, turned off by default.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.handler.count"><glossterm><varname>hbase.regionserver.handler.count</varname></glossterm><glossdef><para>Count of RPC Listener instances spun up on RegionServers.
+    Same property is used by the Master for count of master handlers.</para><formalpara><title>Default</title><para><varname>30</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.server.callqueue.handler.factor"><glossterm><varname>hbase.ipc.server.callqueue.handler.factor</varname></glossterm><glossdef><para>Factor to determine the number of call queues.
+      A value of 0 means a single queue shared between all the handlers.
+      A value of 1 means that each handler has its own queue.</para><formalpara><title>Default</title><para><varname>0.1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.server.callqueue.read.ratio"><glossterm><varname>hbase.ipc.server.callqueue.read.ratio</varname></glossterm><glossdef><para>Split the call queues into read and write queues.
+      The specified interval (which should be between 0.0 and 1.0)
+      will be multiplied by the number of call queues.
+      A value of 0 indicate to not split the call queues, meaning that both read and write
+      requests will be pushed to the same set of queues.
+      A value lower than 0.5 means that there will be less read queues than write queues.
+      A value of 0.5 means there will be the same number of read and write queues.
+      A value greater than 0.5 means that there will be more read queues than write queues.
+      A value of 1.0 means that all the queues except one are used to dispatch read requests.
+
+      Example: Given the total number of call queues being 10
+      a read.ratio of 0 means that: the 10 queues will contain both read/write requests.
+      a read.ratio of 0.3 means that: 3 queues will contain only read requests
+      and 7 queues will contain only write requests.
+      a read.ratio of 0.5 means that: 5 queues will contain only read requests
+      and 5 queues will contain only write requests.
+      a read.ratio of 0.8 means that: 8 queues will contain only read requests
+      and 2 queues will contain only write requests.
+      a read.ratio of 1 means that: 9 queues will contain only read requests
+      and 1 queues will contain only write requests.
+    </para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.server.callqueue.scan.ratio"><glossterm><varname>hbase.ipc.server.callqueue.scan.ratio</varname></glossterm><glossdef><para>Given the number of read call queues, calculated from the total number
+      of call queues multiplied by the callqueue.read.ratio, the scan.ratio property
+      will split the read call queues into small-read and long-read queues.
+      A value lower than 0.5 means that there will be less long-read queues than short-read queues.
+      A value of 0.5 means that there will be the same number of short-read and long-read queues.
+      A value greater than 0.5 means that there will be more long-read queues than short-read queues
+      A value of 0 or 1 indicate to use the same set of queues for gets and scans.
+
+      Example: Given the total number of read call queues being 8
+      a scan.ratio of 0 or 1 means that: 8 queues will contain both long and short read requests.
+      a scan.ratio of 0.3 means that: 2 queues will contain only long-read requests
+      and 6 queues will contain only short-read requests.
+      a scan.ratio of 0.5 means that: 4 queues will contain only long-read requests
+      and 4 queues will contain only short-read requests.
+      a scan.ratio of 0.8 means that: 6 queues will contain only long-read requests
+      and 2 queues will contain only short-read requests.
+    </para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.msginterval"><glossterm><varname>hbase.regionserver.msginterval</varname></glossterm><glossdef><para>Interval between messages from the RegionServer to Master
+    in milliseconds.</para><formalpara><title>Default</title><para><varname>3000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.regionSplitLimit"><glossterm><varname>hbase.regionserver.regionSplitLimit</varname></glossterm><glossdef><para>Limit for the number of regions after which no more region
+    splitting should take place. This is not a hard limit for the number of
+    regions but acts as a guideline for the regionserver to stop splitting after
+    a certain limit. Default is MAX_INT; i.e. do not block splitting.</para><formalpara><title>Default</title><para><varname>2147483647</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.logroll.period"><glossterm><varname>hbase.regionserver.logroll.period</varname></glossterm><glossdef><para>Period at which we will roll the commit log regardless
+    of how many edits it has.</para><formalpara><title>Default</title><para><varname>3600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.logroll.errors.tolerated"><glossterm><varname>hbase.regionserver.logroll.errors.tolerated</varname></glossterm><glossdef><para>The number of consecutive WAL close errors we will allow
+    before triggering a server abort.  A setting of 0 will cause the
+    region server to abort if closing the current WAL writer fails during
+    log rolling.  Even a small value (2 or 3) will allow a region server
+    to ride over transient HDFS errors.</para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.hlog.reader.impl"><glossterm><varname>hbase.regionserver.hlog.reader.impl</varname></glossterm><glossdef><para>The WAL file reader implementation.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.hlog.writer.impl"><glossterm><varname>hbase.regionserver.hlog.writer.impl</varname></glossterm><glossdef><para>The WAL file writer implementation.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.distributed.log.replay"><glossterm><varname>hbase.master.distributed.log.replay</varname></glossterm><glossd
 ef><para>Enable 'distributed log replay' as default engine splitting
+    WAL files on server crash.  This default is new in hbase 1.0.  To fall
+    back to the old mode 'distributed log splitter', set the value to
+    'false'.  'Disributed log replay' improves MTTR because it does not
+    write intermediate files.  'DLR' required that 'hfile.format.version'
+    be set to version 3 or higher. 
+    </para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.global.memstore.size"><glossterm><varname>hbase.regionserver.global.memstore.size</varname></glossterm><glossdef><para>Maximum size of all memstores in a region server before new
+      updates are blocked and flushes are forced. Defaults to 40% of heap.
+      Updates are blocked and flushes are forced until size of all memstores
+      in a region server hits hbase.regionserver.global.memstore.size.lower.limit.</para><formalpara><title>Default</title><para><varname>0.4</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.global.memstore.size.lower.limit"><glossterm><varname>hbase.regionserver.global.memstore.size.lower.limit</varname></glossterm><glossdef><para>Maximum size of all memstores in a region server before flushes are forced.
+      Defaults to 95% of hbase.regionserver.global.memstore.size.
+      A 100% value for this value causes the minimum possible flushing to occur when updates are 
+      blocked due to memstore limiting.</para><formalpara><title>Default</title><para><varname>0.95</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.optionalcacheflushinterval"><glossterm><varname>hbase.regionserver.optionalcacheflushinterval</varname></glossterm><glossdef><para>
+    Maximum amount of time an edit lives in memory before being automatically flushed.
+    Default 1 hour. Set it to 0 to disable automatic flushing.</para><formalpara><title>Default</title><para><varname>3600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.catalog.timeout"><glossterm><varname>hbase.regionserver.catalog.timeout</varname></glossterm><glossdef><para>Timeout value for the Catalog Janitor from the regionserver to META.</para><formalpara><title>Default</title><para><varname>600000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.dns.interface"><glossterm><varname>hbase.regionserver.dns.interface</varname></glossterm><glossdef><para>The name of the Network Interface from which a region server
+      should report its IP address.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.dns.nameserver"><glossterm><varname>hbase.regionserver.dns.nameserver</varname></glossterm><glossdef><para>The host name or IP address of the name server (DNS)
+      which a region server should use to determine the host name used by the
+      master for communication and display purposes.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.region.split.policy"><glossterm><varname>hbase.regionserver.region.split.policy</varname></glossterm><glossdef><para>
+      A split policy determines when a region should be split. The various other split policies that
+      are available currently are ConstantSizeRegionSplitPolicy, DisabledRegionSplitPolicy,
+      DelimitedKeyPrefixRegionSplitPolicy, KeyPrefixRegionSplitPolicy etc.
+    </para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.regionserver.IncreasingToUpperBoundRegionSplitPolicy</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.session.timeout"><glossterm><varname>zookeeper.session.timeout</varname></glossterm><glossdef><para>ZooKeeper session timeout in milliseconds. It is used in two different ways.
+      First, this value is used in the ZK client that HBase uses to connect to the ensemble.
+      It is also used by HBase when it starts a ZK server and it is passed as the 'maxSessionTimeout'. See
+      http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions.
+      For example, if a HBase region server connects to a ZK ensemble that's also managed by HBase, then the
+      session timeout will be the one specified by this configuration. But, a region server that connects
+      to an ensemble managed with a different configuration will be subjected that ensemble's maxSessionTimeout. So,
+      even though HBase might propose using 90 seconds, the ensemble can have a max timeout lower than this and
+      it will take precedence. The current default that ZK ships with is 40 seconds, which is lower than HBase's.
+    </para><formalpara><title>Default</title><para><varname>90000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.znode.parent"><glossterm><varname>zookeeper.znode.parent</varname></glossterm><glossdef><para>Root ZNode for HBase in ZooKeeper. All of HBase's ZooKeeper
+      files that are configured with a relative path will go under this node.
+      By default, all of HBase's ZooKeeper file path are configured with a
+      relative path, so they will all go under this directory unless changed.</para><formalpara><title>Default</title><para><varname>/hbase</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.znode.rootserver"><glossterm><varname>zookeeper.znode.rootserver</varname></glossterm><glossdef><para>Path to ZNode holding root region location. This is written by
+      the master and read by clients and region servers. If a relative path is
+      given, the parent folder will be ${zookeeper.znode.parent}. By default,
+      this means the root location is stored at /hbase/root-region-server.</para><formalpara><title>Default</title><para><varname>root-region-server</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="zookeeper.znode.acl.parent"><glossterm><varname>zookeeper.znode.acl.parent</varname></glossterm><glossdef><para>Root ZNode for access control lists.</para><formalpara><title>Default</title><para><varname>acl</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.dns.interface"><glossterm><varname>hbase.zookeeper.dns.interface</varname></glossterm><glossdef><para>The name of the Network Interface from which a ZooKeeper server
+      should report its IP address.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.dns.nameserver"><glossterm><varname>hbase.zookeeper.dns.nameserver</varname></glossterm><glossdef><para>The host name or IP address of the name server (DNS)
+      which a ZooKeeper server should use to determine the host name used by the
+      master for communication and display purposes.</para><formalpara><title>Default</title><para><varname>default</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.peerport"><glossterm><varname>hbase.zookeeper.peerport</varname></glossterm><glossdef><para>Port used by ZooKeeper peers to talk to each other.
+    See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
+    for more information.</para><formalpara><title>Default</title><para><varname>2888</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.leaderport"><glossterm><varname>hbase.zookeeper.leaderport</varname></glossterm><glossdef><para>Port used by ZooKeeper for leader election.
+    See http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperStarted.html#sc_RunningReplicatedZooKeeper
+    for more information.</para><formalpara><title>Default</title><para><varname>3888</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.useMulti"><glossterm><varname>hbase.zookeeper.useMulti</varname></glossterm><glossdef><para>Instructs HBase to make use of ZooKeeper's multi-update functionality.
+    This allows certain ZooKeeper operations to complete more quickly and prevents some issues
+    with rare Replication failure scenarios (see the release note of HBASE-2611 for an example).
+    IMPORTANT: only set this to true if all ZooKeeper servers in the cluster are on version 3.4+
+    and will not be downgraded.  ZooKeeper versions before 3.4 do not support multi-update and
+    will not fail gracefully if multi-update is invoked (see ZOOKEEPER-1495).</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.config.read.zookeeper.config"><glossterm><varname>hbase.config.read.zookeeper.config</varname></glossterm><glossdef><para>
+        Set to true to allow HBaseConfiguration to read the
+        zoo.cfg file for ZooKeeper properties. Switching this to true
+        is not recommended, since the functionality of reading ZK
+        properties from a zoo.cfg file has been deprecated.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.initLimit"><glossterm><varname>hbase.zookeeper.property.initLimit</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
+    The number of ticks that the initial synchronization phase can take.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.syncLimit"><glossterm><varname>hbase.zookeeper.property.syncLimit</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
+    The number of ticks that can pass between sending a request and getting an
+    acknowledgment.</para><formalpara><title>Default</title><para><varname>5</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.dataDir"><glossterm><varname>hbase.zookeeper.property.dataDir</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
+    The directory where the snapshot is stored.</para><formalpara><title>Default</title><para><varname>${hbase.tmp.dir}/zookeeper</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.clientPort"><glossterm><varname>hbase.zookeeper.property.clientPort</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
+    The port at which the clients will connect.</para><formalpara><title>Default</title><para><varname>2181</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.zookeeper.property.maxClientCnxns"><glossterm><varname>hbase.zookeeper.property.maxClientCnxns</varname></glossterm><glossdef><para>Property from ZooKeeper's config zoo.cfg.
+    Limit on number of concurrent connections (at the socket level) that a
+    single client, identified by IP address, may make to a single member of
+    the ZooKeeper ensemble. Set high to avoid zk connection issues running
+    standalone and pseudo-distributed.</para><formalpara><title>Default</title><para><varname>300</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.write.buffer"><glossterm><varname>hbase.client.write.buffer</varname></glossterm><glossdef><para>Default size of the HTable client write buffer in bytes.
+    A bigger buffer takes more memory -- on both the client and server
+    side since server instantiates the passed write buffer to process
+    it -- but a larger buffer size reduces the number of RPCs made.
+    For an estimate of server-side memory-used, evaluate
+    hbase.client.write.buffer * hbase.regionserver.handler.count</para><formalpara><title>Default</title><para><varname>2097152</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.pause"><glossterm><varname>hbase.client.pause</varname></glossterm><glossdef><para>General client pause value.  Used mostly as value to wait
+    before running a retry of a failed get, region lookup, etc.
+    See hbase.client.retries.number for description of how we backoff from
+    this initial pause amount and how this pause works w/ retries.</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.retries.number"><glossterm><varname>hbase.client.retries.number</varname></glossterm><glossdef><para>Maximum retries.  Used as maximum for all retryable
+    operations such as the getting of a cell's value, starting a row update,
+    etc.  Retry interval is a rough function based on hbase.client.pause.  At
+    first we retry at this interval but then with backoff, we pretty quickly reach
+    retrying every ten seconds.  See HConstants#RETRY_BACKOFF for how the backup
+    ramps up.  Change this setting and hbase.client.pause to suit your workload.</para><formalpara><title>Default</title><para><varname>35</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.max.total.tasks"><glossterm><varname>hbase.client.max.total.tasks</varname></glossterm><glossdef><para>The maximum number of concurrent tasks a single HTable instance will
+    send to the cluster.</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.max.perserver.tasks"><glossterm><varname>hbase.client.max.perserver.tasks</varname></glossterm><glossdef><para>The maximum number of concurrent tasks a single HTable instance will
+    send to a single region server.</para><formalpara><title>Default</title><para><varname>5</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.max.perregion.tasks"><glossterm><varname>hbase.client.max.perregion.tasks</varname></glossterm><glossdef><para>The maximum number of concurrent connections the client will
+    maintain to a single Region. That is, if there is already
+    hbase.client.max.perregion.tasks writes in progress for this region, new puts
+    won't be sent to this region until some writes finishes.</para><formalpara><title>Default</title><para><varname>1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.scanner.caching"><glossterm><varname>hbase.client.scanner.caching</varname></glossterm><glossdef><para>Number of rows that will be fetched when calling next
+    on a scanner if it is not served from (local, client) memory. Higher
+    caching values will enable faster scanners but will eat up more memory
+    and some calls of next may take longer and longer times when the cache is empty.
+    Do not set this value such that the time between invocations is greater
+    than the scanner timeout; i.e. hbase.client.scanner.timeout.period</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.keyvalue.maxsize"><glossterm><varname>hbase.client.keyvalue.maxsize</varname></glossterm><glossdef><para>Specifies the combined maximum allowed size of a KeyValue
+    instance. This is to set an upper boundary for a single entry saved in a
+    storage file. Since they cannot be split it helps avoiding that a region
+    cannot be split any further because the data is too large. It seems wise
+    to set this to a fraction of the maximum region size. Setting it to zero
+    or less disables the check.</para><formalpara><title>Default</title><para><varname>10485760</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.scanner.timeout.period"><glossterm><varname>hbase.client.scanner.timeout.period</varname></glossterm><glossdef><para>Client scanner lease period in milliseconds.</para><formalpara><title>Default</title><para><varname>60000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.client.localityCheck.threadPoolSize"><glossterm><varname>hbase.client.localityCheck.threadPoolSize</varname></glossterm><glossdef><para/><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bulkload.retries.number"><glossterm><varname>hbase.bulkload.retries.number</varname></glossterm><glossdef><para>Maximum retries.  This is maximum number of iterations
+    to atomic bulk loads are attempted in the face of splitting operations
+    0 means never give up.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.balancer.period&#10;    "><glossterm><varname>hbase.balancer.period
+    </varname></glossterm><glossdef><para>Period at which the region balancer runs in the Master.</para><formalpara><title>Default</title><para><varname>300000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regions.slop"><glossterm><varname>hbase.regions.slop</varname></glossterm><glossdef><para>Rebalance if any regionserver has average + (average * slop) regions.</para><formalpara><title>Default</title><para><varname>0.2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.server.thread.wakefrequency"><glossterm><varname>hbase.server.thread.wakefrequency</varname></glossterm><glossdef><para>Time to sleep in between searches for work (in milliseconds).
+    Used as sleep interval by service threads such as log roller.</para><formalpara><title>Default</title><para><varname>10000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.server.versionfile.writeattempts"><glossterm><varname>hbase.server.versionfile.writeattempts</varname></glossterm><glossdef><para>
+    How many time to retry attempting to write a version file
+    before just aborting. Each attempt is seperated by the
+    hbase.server.thread.wakefrequency milliseconds.</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.memstore.flush.size"><glossterm><varname>hbase.hregion.memstore.flush.size</varname></glossterm><glossdef><para>
+    Memstore will be flushed to disk if size of the memstore
+    exceeds this number of bytes.  Value is checked by a thread that runs
+    every hbase.server.thread.wakefrequency.</para><formalpara><title>Default</title><para><varname>134217728</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.percolumnfamilyflush.size.lower.bound"><glossterm><varname>hbase.hregion.percolumnfamilyflush.size.lower.bound</varname></glossterm><glossdef><para>
+    If FlushLargeStoresPolicy is used, then every time that we hit the
+    total memstore limit, we find out all the column families whose memstores
+    exceed this value, and only flush them, while retaining the others whose
+    memstores are lower than this limit. If none of the families have their
+    memstore size more than this, all the memstores will be flushed
+    (just as usual). This value should be less than half of the total memstore
+    threshold (hbase.hregion.memstore.flush.size).
+    </para><formalpara><title>Default</title><para><varname>16777216</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.preclose.flush.size"><glossterm><varname>hbase.hregion.preclose.flush.size</varname></glossterm><glossdef><para>
+      If the memstores in a region are this size or larger when we go
+      to close, run a "pre-flush" to clear out memstores before we put up
+      the region closed flag and take the region offline.  On close,
+      a flush is run under the close flag to empty memory.  During
+      this time the region is offline and we are not taking on any writes.
+      If the memstore content is large, this flush could take a long time to
+      complete.  The preflush is meant to clean out the bulk of the memstore
+      before putting up the close flag and taking the region offline so the
+      flush that runs under the close flag has little to do.</para><formalpara><title>Default</title><para><varname>5242880</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.memstore.block.multiplier"><glossterm><varname>hbase.hregion.memstore.block.multiplier</varname></glossterm><glossdef><para>
+    Block updates if memstore has hbase.hregion.memstore.block.multiplier
+    times hbase.hregion.memstore.flush.size bytes.  Useful preventing
+    runaway memstore during spikes in update traffic.  Without an
+    upper-bound, memstore fills such that when it flushes the
+    resultant flush files take a long time to compact or split, or
+    worse, we OOME.</para><formalpara><title>Default</title><para><varname>4</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.memstore.mslab.enabled"><glossterm><varname>hbase.hregion.memstore.mslab.enabled</varname></glossterm><glossdef><para>
+      Enables the MemStore-Local Allocation Buffer,
+      a feature which works to prevent heap fragmentation under
+      heavy write loads. This can reduce the frequency of stop-the-world
+      GC pauses on large heaps.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.max.filesize"><glossterm><varname>hbase.hregion.max.filesize</varname></glossterm><glossdef><para>
+    Maximum HFile size. If the sum of the sizes of a region's HFiles has grown to exceed this 
+    value, the region is split in two.</para><formalpara><title>Default</title><para><varname>10737418240</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.majorcompaction"><glossterm><varname>hbase.hregion.majorcompaction</varname></glossterm><glossdef><para>Time between major compactions, expressed in milliseconds. Set to 0 to disable
+      time-based automatic major compactions. User-requested and size-based major compactions will
+      still run. This value is multiplied by hbase.hregion.majorcompaction.jitter to cause
+      compaction to start at a somewhat-random time during a given window of time. The default value
+      is 7 days, expressed in milliseconds. If major compactions are causing disruption in your
+      environment, you can configure them to run at off-peak times for your deployment, or disable
+      time-based major compactions by setting this parameter to 0, and run major compactions in a
+      cron job or by another external mechanism.</para><formalpara><title>Default</title><para><varname>604800000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hregion.majorcompaction.jitter"><glossterm><varname>hbase.hregion.majorcompaction.jitter</varname></glossterm><glossdef><para>A multiplier applied to hbase.hregion.majorcompaction to cause compaction to occur
+      a given amount of time either side of hbase.hregion.majorcompaction. The smaller the number,
+      the closer the compactions will happen to the hbase.hregion.majorcompaction
+      interval.</para><formalpara><title>Default</title><para><varname>0.50</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compactionThreshold"><glossterm><varname>hbase.hstore.compactionThreshold</varname></glossterm><glossdef><para> If more than this number of StoreFiles exist in any one Store 
+      (one StoreFile is written per flush of MemStore), a compaction is run to rewrite all 
+      StoreFiles into a single StoreFile. Larger values delay compaction, but when compaction does
+      occur, it takes longer to complete.</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.flusher.count"><glossterm><varname>hbase.hstore.flusher.count</varname></glossterm><glossdef><para> The number of flush threads. With fewer threads, the MemStore flushes will be
+      queued. With more threads, the flushes will be executed in parallel, increasing the load on
+      HDFS, and potentially causing more compactions. </para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.blockingStoreFiles"><glossterm><varname>hbase.hstore.blockingStoreFiles</varname></glossterm><glossdef><para> If more than this number of StoreFiles exist in any one Store (one StoreFile
+     is written per flush of MemStore), updates are blocked for this region until a compaction is
+      completed, or until hbase.hstore.blockingWaitTime has been exceeded.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.blockingWaitTime"><glossterm><varname>hbase.hstore.blockingWaitTime</varname></glossterm><glossdef><para> The time for which a region will block updates after reaching the StoreFile limit
+    defined by hbase.hstore.blockingStoreFiles. After this time has elapsed, the region will stop 
+    blocking updates even if a compaction has not been completed.</para><formalpara><title>Default</title><para><varname>90000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.min"><glossterm><varname>hbase.hstore.compaction.min</varname></glossterm><glossdef><para>The minimum number of StoreFiles which must be eligible for compaction before 
+      compaction can run. The goal of tuning hbase.hstore.compaction.min is to avoid ending up with 
+      too many tiny StoreFiles to compact. Setting this value to 2 would cause a minor compaction 
+      each time you have two StoreFiles in a Store, and this is probably not appropriate. If you
+      set this value too high, all the other values will need to be adjusted accordingly. For most 
+      cases, the default value is appropriate. In previous versions of HBase, the parameter
+      hbase.hstore.compaction.min was named hbase.hstore.compactionThreshold.</para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.max"><glossterm><varname>hbase.hstore.compaction.max</varname></glossterm><glossdef><para>The maximum number of StoreFiles which will be selected for a single minor 
+      compaction, regardless of the number of eligible StoreFiles. Effectively, the value of
+      hbase.hstore.compaction.max controls the length of time it takes a single compaction to
+      complete. Setting it larger means that more StoreFiles are included in a compaction. For most
+      cases, the default value is appropriate.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.min.size"><glossterm><varname>hbase.hstore.compaction.min.size</varname></glossterm><glossdef><para>A StoreFile smaller than this size will always be eligible for minor compaction. 
+      HFiles this size or larger are evaluated by hbase.hstore.compaction.ratio to determine if 
+      they are eligible. Because this limit represents the "automatic include"limit for all 
+      StoreFiles smaller than this value, this value may need to be reduced in write-heavy 
+      environments where many StoreFiles in the 1-2 MB range are being flushed, because every 
+      StoreFile will be targeted for compaction and the resulting StoreFiles may still be under the
+      minimum size and require further compaction. If this parameter is lowered, the ratio check is
+      triggered more quickly. This addressed some issues seen in earlier versions of HBase but 
+      changing this parameter is no longer necessary in most situations. Default: 128 MB expressed 
+      in bytes.</para><formalpara><title>Default</title><para><varname>134217728</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.max.size"><glossterm><varname>hbase.hstore.compaction.max.size</varname></glossterm><glossdef><para>A StoreFile larger than this size will be excluded from compaction. The effect of 
+      raising hbase.hstore.compaction.max.size is fewer, larger StoreFiles that do not get 
+      compacted often. If you feel that compaction is happening too often without much benefit, you
+      can try raising this value. Default: the value of LONG.MAX_VALUE, expressed in bytes.</para><formalpara><title>Default</title><para><varname>9223372036854775807</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.ratio"><glossterm><varname>hbase.hstore.compaction.ratio</varname></glossterm><glossdef><para>For minor compaction, this ratio is used to determine whether a given StoreFile 
+      which is larger than hbase.hstore.compaction.min.size is eligible for compaction. Its
+      effect is to limit compaction of large StoreFiles. The value of hbase.hstore.compaction.ratio
+      is expressed as a floating-point decimal. A large ratio, such as 10, will produce a single 
+      giant StoreFile. Conversely, a low value, such as .25, will produce behavior similar to the 
+      BigTable compaction algorithm, producing four StoreFiles. A moderate value of between 1.0 and
+      1.4 is recommended. When tuning this value, you are balancing write costs with read costs. 
+      Raising the value (to something like 1.4) will have more write costs, because you will 
+      compact larger StoreFiles. However, during reads, HBase will need to seek through fewer 
+      StoreFiles to accomplish the read. Consider this approach if you cannot take advantage of 
+      Bloom filters. Otherwise, you can lower this value to something like 1.0 to reduce the 
+      background cost of writes, and use Bloom filters to control the number of StoreFiles touched 
+      during reads. For most cases, the default value is appropriate.</para><formalpara><title>Default</title><para><varname>1.2F</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.ratio.offpeak"><glossterm><varname>hbase.hstore.compaction.ratio.offpeak</varname></glossterm><glossdef><para>Allows you to set a different (by default, more aggressive) ratio for determining
+      whether larger StoreFiles are included in compactions during off-peak hours. Works in the 
+      same way as hbase.hstore.compaction.ratio. Only applies if hbase.offpeak.start.hour and 
+      hbase.offpeak.end.hour are also enabled.</para><formalpara><title>Default</title><para><varname>5.0F</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.time.to.purge.deletes"><glossterm><varname>hbase.hstore.time.to.purge.deletes</varname></glossterm><glossdef><para>The amount of time to delay purging of delete markers with future timestamps. If 
+      unset, or set to 0, all delete markers, including those with future timestamps, are purged 
+      during the next major compaction. Otherwise, a delete marker is kept until the major compaction 
+      which occurs after the marker's timestamp plus the value of this setting, in milliseconds.
+    </para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.offpeak.start.hour"><glossterm><varname>hbase.offpeak.start.hour</varname></glossterm><glossdef><para>The start of off-peak hours, expressed as an integer between 0 and 23, inclusive.
+      Set to -1 to disable off-peak.</para><formalpara><title>Default</title><para><varname>-1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.offpeak.end.hour"><glossterm><varname>hbase.offpeak.end.hour</varname></glossterm><glossdef><para>The end of off-peak hours, expressed as an integer between 0 and 23, inclusive. Set
+      to -1 to disable off-peak.</para><formalpara><title>Default</title><para><varname>-1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thread.compaction.throttle"><glossterm><varname>hbase.regionserver.thread.compaction.throttle</varname></glossterm><glossdef><para>There are two different thread pools for compactions, one for large compactions and
+      the other for small compactions. This helps to keep compaction of lean tables (such as
+        hbase:meta) fast. If a compaction is larger than this threshold, it
+      goes into the large compaction pool. In most cases, the default value is appropriate. Default:
+      2 x hbase.hstore.compaction.max x hbase.hregion.memstore.flush.size (which defaults to 128MB).
+      The value field assumes that the value of hbase.hregion.memstore.flush.size is unchanged from
+      the default.</para><formalpara><title>Default</title><para><varname>2684354560</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.compaction.kv.max"><glossterm><varname>hbase.hstore.compaction.kv.max</varname></glossterm><glossdef><para>The maximum number of KeyValues to read and then write in a batch when flushing or
+      compacting. Set this lower if you have big KeyValues and problems with Out Of Memory
+      Exceptions Set this higher if you have wide, small rows. </para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.storescanner.parallel.seek.enable"><glossterm><varname>hbase.storescanner.parallel.seek.enable</varname></glossterm><glossdef><para>
+      Enables StoreFileScanner parallel-seeking in StoreScanner,
+      a feature which can reduce response latency under special conditions.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.storescanner.parallel.seek.threads"><glossterm><varname>hbase.storescanner.parallel.seek.threads</varname></glossterm><glossdef><para>
+      The default thread pool size if parallel-seeking feature enabled.</para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.block.cache.size"><glossterm><varname>hfile.block.cache.size</varname></glossterm><glossdef><para>Percentage of maximum heap (-Xmx setting) to allocate to block cache
+        used by a StoreFile. Default of 0.4 means allocate 40%.
+        Set to 0 to disable but it's not recommended; you need at least
+        enough cache to hold the storefile indices.</para><formalpara><title>Default</title><para><varname>0.4</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.block.index.cacheonwrite"><glossterm><varname>hfile.block.index.cacheonwrite</varname></glossterm><glossdef><para>This allows to put non-root multi-level index blocks into the block
+          cache at the time the index is being written.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.index.block.max.size"><glossterm><varname>hfile.index.block.max.size</varname></glossterm><glossdef><para>When the size of a leaf-level, intermediate-level, or root-level
+          index block in a multi-level block index grows to this size, the
+          block is written out and a new block is started.</para><formalpara><title>Default</title><para><varname>131072</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.ioengine"><glossterm><varname>hbase.bucketcache.ioengine</varname></glossterm><glossdef><para>Where to store the contents of the bucketcache. One of: onheap, 
+      offheap, or file. If a file, set it to file:PATH_TO_FILE. See https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html for more information.
+    </para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.combinedcache.enabled"><glossterm><varname>hbase.bucketcache.combinedcache.enabled</varname></glossterm><glossdef><para>Whether or not the bucketcache is used in league with the LRU 
+      on-heap block cache. In this mode, indices and blooms are kept in the LRU 
+      blockcache and the data blocks are kept in the bucketcache.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.size"><glossterm><varname>hbase.bucketcache.size</varname></glossterm><glossdef><para>The size of the buckets for the bucketcache if you only use a single size. 
+      Defaults to the default blocksize, which is 64 * 1024.</para><formalpara><title>Default</title><para><varname>65536</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.bucketcache.sizes"><glossterm><varname>hbase.bucketcache.sizes</varname></glossterm><glossdef><para>A comma-separated list of sizes for buckets for the bucketcache 
+      if you use multiple sizes. Should be a list of block sizes in order from smallest 
+      to largest. The sizes you use will depend on your data access patterns.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.format.version"><glossterm><varname>hfile.format.version</varname></glossterm><glossdef><para>The HFile format version to use for new files.
+      Version 3 adds support for tags in hfiles (See http://hbase.apache.org/book.html#hbase.tags).
+      Distributed Log Replay requires that tags are enabled. Also see the configuration
+      'hbase.replication.rpc.codec'. 
+      </para><formalpara><title>Default</title><para><varname>3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hfile.block.bloom.cacheonwrite"><glossterm><varname>hfile.block.bloom.cacheonwrite</varname></glossterm><glossdef><para>Enables cache-on-write for inline blocks of a compound Bloom filter.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="io.storefile.bloom.block.size"><glossterm><varname>io.storefile.bloom.block.size</varname></glossterm><glossdef><para>The size in bytes of a single block ("chunk") of a compound Bloom
+          filter. This size is approximate, because Bloom blocks can only be
+          inserted at data block boundaries, and the number of keys per data
+          block varies.</para><formalpara><title>Default</title><para><varname>131072</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rs.cacheblocksonwrite"><glossterm><varname>hbase.rs.cacheblocksonwrite</varname></glossterm><glossdef><para>Whether an HFile block should be added to the block cache when the
+          block is finished.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rpc.timeout"><glossterm><varname>hbase.rpc.timeout</varname></glossterm><glossdef><para>This is for the RPC layer to define how long HBase client applications
+        take for a remote call to time out. It uses pings to check connections
+        but will eventually throw a TimeoutException.</para><formalpara><title>Default</title><para><varname>60000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rpc.shortoperation.timeout"><glossterm><varname>hbase.rpc.shortoperation.timeout</varname></glossterm><glossdef><para>This is another version of "hbase.rpc.timeout". For those RPC operation
+        within cluster, we rely on this configuration to set a short timeout limitation
+        for short operation. For example, short rpc timeout for region server's trying
+        to report to active master can benefit quicker master failover process.</para><formalpara><title>Default</title><para><varname>10000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.client.tcpnodelay"><glossterm><varname>hbase.ipc.client.tcpnodelay</varname></glossterm><glossdef><para>Set no delay on rpc socket connections.  See
+    http://docs.oracle.com/javase/1.5.0/docs/api/java/net/Socket.html#getTcpNoDelay()</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.keytab.file"><glossterm><varname>hbase.master.keytab.file</varname></glossterm><glossdef><para>Full path to the kerberos keytab file to use for logging in
+    the configured HMaster server principal.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.kerberos.principal"><glossterm><varname>hbase.master.kerberos.principal</varname></glossterm><glossdef><para>Ex. "hbase/_HOST@EXAMPLE.COM".  The kerberos principal name
+    that should be used to run the HMaster process.  The principal name should
+    be in the form: user/hostname@DOMAIN.  If "_HOST" is used as the hostname
+    portion, it will be replaced with the actual hostname of the running
+    instance.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.keytab.file"><glossterm><varname>hbase.regionserver.keytab.file</varname></glossterm><glossdef><para>Full path to the kerberos keytab file to use for logging in
+    the configured HRegionServer server principal.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.kerberos.principal"><glossterm><varname>hbase.regionserver.kerberos.principal</varname></glossterm><glossdef><para>Ex. "hbase/_HOST@EXAMPLE.COM".  The kerberos principal name
+    that should be used to run the HRegionServer process.  The principal name
+    should be in the form: user/hostname@DOMAIN.  If "_HOST" is used as the
+    hostname portion, it will be replaced with the actual hostname of the
+    running instance.  An entry for this principal must exist in the file
+    specified in hbase.regionserver.keytab.file</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hadoop.policy.file"><glossterm><varname>hadoop.policy.file</varname></glossterm><glossdef><para>The policy configuration file used by RPC servers to make
+      authorization decisions on client requests.  Only used when HBase
+      security is enabled.</para><formalpara><title>Default</title><para><varname>hbase-policy.xml</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.superuser"><glossterm><varname>hbase.superuser</varname></glossterm><glossdef><para>List of users or groups (comma-separated), who are allowed
+    full privileges, regardless of stored ACLs, across the cluster.
+    Only used when HBase security is enabled.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.auth.key.update.interval"><glossterm><varname>hbase.auth.key.update.interval</varname></glossterm><glossdef><para>The update interval for master key for authentication tokens
+    in servers in milliseconds.  Only used when HBase security is enabled.</para><formalpara><title>Default</title><para><varname>86400000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.auth.token.max.lifetime"><glossterm><varname>hbase.auth.token.max.lifetime</varname></glossterm><glossdef><para>The maximum lifetime in milliseconds after which an
+    authentication token expires.  Only used when HBase security is enabled.</para><formalpara><title>Default</title><para><varname>604800000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.ipc.client.fallback-to-simple-auth-allowed"><glossterm><varname>hbase.ipc.client.fallback-to-simple-auth-allowed</varname></glossterm><glossdef><para>When a client is configured to attempt a secure connection, but attempts to
+      connect to an insecure server, that server may instruct the client to
+      switch to SASL SIMPLE (unsecure) authentication. This setting controls
+      whether or not the client will accept this instruction from the server.
+      When false (the default), the client will not allow the fallback to SIMPLE
+      authentication, and will abort the connection.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.display.keys"><glossterm><varname>hbase.display.keys</varname></glossterm><glossdef><para>When this is set to true the webUI and such will display all start/end keys
+                 as part of the table details, region names, etc. When this is set to false,
+                 the keys are hidden.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coprocessor.region.classes"><glossterm><varname>hbase.coprocessor.region.classes</varname></glossterm><glossdef><para>A comma-separated list of Coprocessors that are loaded by
+    default on all tables. For any override coprocessor method, these classes
+    will be called in order. After implementing your own Coprocessor, just put
+    it in HBase's classpath and add the fully qualified class name here.
+    A coprocessor can also be loaded on demand by setting HTableDescriptor.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.port"><glossterm><varname>hbase.rest.port</varname></glossterm><glossdef><para>The port for the HBase REST server.</para><formalpara><title>Default</title><para><varname>8080</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.readonly"><glossterm><varname>hbase.rest.readonly</varname></glossterm><glossdef><para>Defines the mode the REST server will be started in. Possible values are:
+    false: All HTTP methods are permitted - GET/PUT/POST/DELETE.
+    true: Only the GET method is permitted.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.threads.max"><glossterm><varname>hbase.rest.threads.max</varname></glossterm><glossdef><para>The maximum number of threads of the REST server thread pool.
+        Threads in the pool are reused to process REST requests. This
+        controls the maximum number of requests processed concurrently.
+        It may help to control the memory used by the REST server to
+        avoid OOM issues. If the thread pool is full, incoming requests
+        will be queued up and wait for some free threads.</para><formalpara><title>Default</title><para><varname>100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.threads.min"><glossterm><varname>hbase.rest.threads.min</varname></glossterm><glossdef><para>The minimum number of threads of the REST server thread pool.
+        The thread pool always has at least these number of threads so
+        the REST server is ready to serve incoming requests.</para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.support.proxyuser"><glossterm><varname>hbase.rest.support.proxyuser</varname></glossterm><glossdef><para>Enables running the REST server to support proxy-user mode.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.defaults.for.version.skip"><glossterm><varname>hbase.defaults.for.version.skip</varname></glossterm><glossdef><para>Set to true to skip the 'hbase.defaults.for.version' check.
+    Setting this to true can be useful in contexts other than
+    the other side of a maven generation; i.e. running in an
+    ide.  You'll want to set this boolean to true to avoid
+    seeing the RuntimException complaint: "hbase-default.xml file
+    seems to be for and old version of HBase (\${hbase.version}), this
+    version is X.X.X-SNAPSHOT"</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coprocessor.master.classes"><glossterm><varname>hbase.coprocessor.master.classes</varname></glossterm><glossdef><para>A comma-separated list of
+    org.apache.hadoop.hbase.coprocessor.MasterObserver coprocessors that are
+    loaded by default on the active HMaster process. For any implemented
+    coprocessor methods, the listed classes will be called in order. After
+    implementing your own MasterObserver, just put it in HBase's classpath
+    and add the fully qualified class name here.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coprocessor.abortonerror"><glossterm><varname>hbase.coprocessor.abortonerror</varname></glossterm><glossdef><para>Set to true to cause the hosting server (master or regionserver)
+      to abort if a coprocessor fails to load, fails to initialize, or throws an
+      unexpected Throwable object. Setting this to false will allow the server to
+      continue execution but the system wide state of the coprocessor in question
+      will become inconsistent as it will be properly executing in only a subset
+      of servers, so this is most useful for debugging only.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.online.schema.update.enable"><glossterm><varname>hbase.online.schema.update.enable</varname></glossterm><glossdef><para>Set true to enable online schema changes.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.table.lock.enable"><glossterm><varname>hbase.table.lock.enable</varname></glossterm><glossdef><para>Set to true to enable locking the table in zookeeper for schema change operations.
+    Table locking from master prevents concurrent schema modifications to corrupt table
+    state.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.table.max.rowsize"><glossterm><varname>hbase.table.max.rowsize</varname></glossterm><glossdef><para>
+      Maximum size of single row in bytes (default is 1 Gb) for Get'ting
+      or Scan'ning without in-row scan flag set. If row size exceeds this limit
+      RowTooBigException is thrown to client.
+    </para><formalpara><title>Default</title><para><varname>1073741824</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.minWorkerThreads"><glossterm><varname>hbase.thrift.minWorkerThreads</varname></glossterm><glossdef><para>The "core size" of the thread pool. New threads are created on every
+    connection until this many threads are created.</para><formalpara><title>Default</title><para><varname>16</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.maxWorkerThreads"><glossterm><varname>hbase.thrift.maxWorkerThreads</varname></glossterm><glossdef><para>The maximum size of the thread pool. When the pending request queue
+    overflows, new threads are created until their number reaches this number.
+    After that, the server starts dropping connections.</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.maxQueuedRequests"><glossterm><varname>hbase.thrift.maxQueuedRequests</varname></glossterm><glossdef><para>The maximum number of pending Thrift connections waiting in the queue. If
+     there are no idle threads in the pool, the server queues requests. Only
+     when the queue overflows, new threads are added, up to
+     hbase.thrift.maxQueuedRequests threads.</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.thrift.htablepool.size.max"><glossterm><varname>hbase.thrift.htablepool.size.max</varname></glossterm><glossdef><para>The upper bound for the table pool used in the Thrift gateways server.
+      Since this is per table name, we assume a single table and so with 1000 default
+      worker threads max this is set to a matching number. For other workloads this number
+      can be adjusted as needed.
+    </para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thrift.framed"><glossterm><varname>hbase.regionserver.thrift.framed</varname></glossterm><glossdef><para>Use Thrift TFramedTransport on the server side.
+      This is the recommended transport for thrift servers and requires a similar setting
+      on the client side. Changing this to false will select the default transport,
+      vulnerable to DoS when malformed requests are issued due to THRIFT-601.
+    </para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thrift.framed.max_frame_size_in_mb"><glossterm><varname>hbase.regionserver.thrift.framed.max_frame_size_in_mb</varname></glossterm><glossdef><para>Default frame size when using framed transport</para><formalpara><title>Default</title><para><varname>2</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.thrift.compact"><glossterm><varname>hbase.regionserver.thrift.compact</varname></glossterm><glossdef><para>Use Thrift TCompactProtocol binary serialization protocol.</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.data.umask.enable"><glossterm><varname>hbase.data.umask.enable</varname></glossterm><glossdef><para>Enable, if true, that file permissions should be assigned
+      to the files written by the regionserver</para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.data.umask"><glossterm><varname>hbase.data.umask</varname></glossterm><glossdef><para>File permissions that should be used to write data
+      files when hbase.data.umask.enable is true</para><formalpara><title>Default</title><para><varname>000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.metrics.showTableName"><glossterm><varname>hbase.metrics.showTableName</varname></glossterm><glossdef><para>Whether to include the prefix "tbl.tablename" in per-column family metrics.
+	If true, for each metric M, per-cf metrics will be reported for tbl.T.cf.CF.M, if false,
+	per-cf metrics will be aggregated by column-family across tables, and reported for cf.CF.M.
+	In both cases, the aggregated metric M across tables and cfs will be reported.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.metrics.exposeOperationTimes"><glossterm><varname>hbase.metrics.exposeOperationTimes</varname></glossterm><glossdef><para>Whether to report metrics about time taken performing an
+      operation on the region server.  Get, Put, Delete, Increment, and Append can all
+      have their times exposed through Hadoop metrics per CF and per region.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.snapshot.enabled"><glossterm><varname>hbase.snapshot.enabled</varname></glossterm><glossdef><para>Set to true to allow snapshots to be taken / restored / cloned.</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.snapshot.restore.take.failsafe.snapshot"><glossterm><varname>hbase.snapshot.restore.take.failsafe.snapshot</varname></glossterm><glossdef><para>Set to true to take a snapshot before the restore operation.
+      The snapshot taken will be used in case of failure, to restore the previous state.
+      At the end of the restore operation this snapshot will be deleted</para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.snapshot.restore.failsafe.name"><glossterm><varname>hbase.snapshot.restore.failsafe.name</varname></glossterm><glossdef><para>Name of the failsafe snapshot taken by the restore operation.
+      You can use the {snapshot.name}, {table.name} and {restore.timestamp} variables
+      to create a name based on what you are restoring.</para><formalpara><title>Default</title><para><varname>hbase-failsafe-{snapshot.name}-{restore.timestamp}</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.server.compactchecker.interval.multiplier"><glossterm><varname>hbase.server.compactchecker.interval.multiplier</varname></glossterm><glossdef><para>The number that determines how often we scan to see if compaction is necessary.
+        Normally, compactions are done after some events (such as memstore flush), but if
+        region didn't receive a lot of writes for some time, or due to different compaction
+        policies, it may be necessary to check it periodically. The interval between checks is
+        hbase.server.compactchecker.interval.multiplier multiplied by
+        hbase.server.thread.wakefrequency.</para><formalpara><title>Default</title><para><varname>1000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.lease.recovery.timeout"><glossterm><varname>hbase.lease.recovery.timeout</varname></glossterm><glossdef><para>How long we wait on dfs lease recovery in total before giving up.</para><formalpara><title>Default</title><para><varname>900000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.lease.recovery.dfs.timeout"><glossterm><varname>hbase.lease.recovery.dfs.timeout</varname></glossterm><glossdef><para>How long between dfs recover lease invocations. Should be larger than the sum of
+        the time it takes for the namenode to issue a block recovery command as part of
+        datanode; dfs.heartbeat.interval and the time it takes for the primary
+        datanode, performing block recovery to timeout on a dead datanode; usually
+        dfs.client.socket-timeout. See the end of HBASE-8389 for more.</para><formalpara><title>Default</title><para><varname>64000</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.column.max.version"><glossterm><varname>hbase.column.max.version</varname></glossterm><glossdef><para>New column family descriptors will use this value as the default number of versions
+      to keep.</para><formalpara><title>Default</title><para><varname>1</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.dfs.client.read.shortcircuit.buffer.size"><glossterm><varname>hbase.dfs.client.read.shortcircuit.buffer.size</varname></glossterm><glossdef><para>If the DFSClient configuration
+    dfs.client.read.shortcircuit.buffer.size is unset, we will
+    use what is configured here as the short circuit read default
+    direct byte buffer size. DFSClient native default is 1MB; HBase
+    keeps its HDFS files open so number of file blocks * 1MB soon
+    starts to add up and threaten OOME because of a shortage of
+    direct memory.  So, we set it down from the default.  Make
+    it &gt; the default hbase block size set in the HColumnDescriptor
+    which is usually 64k.
+    </para><formalpara><title>Default</title><para><varname>131072</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.checksum.verify"><glossterm><varname>hbase.regionserver.checksum.verify</varname></glossterm><glossdef><para>
+        If set to true (the default), HBase verifies the checksums for hfile
+        blocks. HBase writes checksums inline with the data when it writes out
+        hfiles. HDFS (as of this writing) writes checksums to a separate file
+        than the data file necessitating extra seeks.  Setting this flag saves
+        some on i/o.  Checksum verification by HDFS will be internally disabled
+        on hfile streams when this flag is set.  If the hbase-checksum verification
+        fails, we will switch back to using HDFS checksums (so do not disable HDFS
+        checksums!  And besides this feature applies to hfiles only, not to WALs).
+        If this parameter is set to false, then hbase will not verify any checksums,
+        instead it will depend on checksum verification being done in the HDFS client.  
+    </para><formalpara><title>Default</title><para><varname>true</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.bytes.per.checksum"><glossterm><varname>hbase.hstore.bytes.per.checksum</varname></glossterm><glossdef><para>
+        Number of bytes in a newly created checksum chunk for HBase-level
+        checksums in hfile blocks.
+    </para><formalpara><title>Default</title><para><varname>16384</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.hstore.checksum.algorithm"><glossterm><varname>hbase.hstore.checksum.algorithm</varname></glossterm><glossdef><para>
+      Name of an algorithm that is used to compute checksums. Possible values
+      are NULL, CRC32, CRC32C.
+    </para><formalpara><title>Default</title><para><varname>CRC32</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.published"><glossterm><varname>hbase.status.published</varname></glossterm><glossdef><para>
+      This setting activates the publication by the master of the status of the region server.
+      When a region server dies and its recovery starts, the master will push this information
+      to the client application, to let them cut the connection immediately instead of waiting
+      for a timeout.
+    </para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.publisher.class"><glossterm><varname>hbase.status.publisher.class</varname></glossterm><glossdef><para>
+      Implementation of the status publication with a multicast message.
+    </para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.ClusterStatusPublisher$MulticastPublisher</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.listener.class"><glossterm><varname>hbase.status.listener.class</varname></glossterm><glossdef><para>
+      Implementation of the status listener with a multicast message.
+    </para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.client.ClusterStatusListener$MulticastListener</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.multicast.address.ip"><glossterm><varname>hbase.status.multicast.address.ip</varname></glossterm><glossdef><para>
+      Multicast address to use for the status publication by multicast.
+    </para><formalpara><title>Default</title><para><varname>226.1.1.3</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.status.multicast.address.port"><glossterm><varname>hbase.status.multicast.address.port</varname></glossterm><glossdef><para>
+      Multicast port to use for the status publication by multicast.
+    </para><formalpara><title>Default</title><para><varname>16100</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.dynamic.jars.dir"><glossterm><varname>hbase.dynamic.jars.dir</varname></glossterm><glossdef><para>
+      The directory from which the custom filter/co-processor jars can be loaded
+      dynamically by the region server without the need to restart. However,
+      an already loaded filter/co-processor class would not be un-loaded. See
+      HBASE-1936 for more details.
+    </para><formalpara><title>Default</title><para><varname>${hbase.rootdir}/lib</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.security.authentication"><glossterm><varname>hbase.security.authentication</varname></glossterm><glossdef><para>
+      Controls whether or not secure authentication is enabled for HBase.
+      Possible values are 'simple' (no authentication), and 'kerberos'.
+    </para><formalpara><title>Default</title><para><varname>simple</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.rest.filter.classes"><glossterm><varname>hbase.rest.filter.classes</varname></glossterm><glossdef><para>
+      Servlet filters for REST service.
+    </para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.rest.filter.GzipFilter</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.master.loadbalancer.class"><glossterm><varname>hbase.master.loadbalancer.class</varname></glossterm><glossdef><para>
+      Class used to execute the regions balancing when the period occurs.
+      See the class comment for more on how it works
+      http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/master/balancer/StochasticLoadBalancer.html
+      It replaces the DefaultLoadBalancer as the default (since renamed
+      as the SimpleLoadBalancer).
+    </para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.master.balancer.StochasticLoadBalancer</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.security.exec.permission.checks"><glossterm><varname>hbase.security.exec.permission.checks</varname></glossterm><glossdef><para>
+      If this setting is enabled and ACL based access control is active (the
+      AccessController coprocessor is installed either as a system coprocessor
+      or on a table as a table coprocessor) then you must grant all relevant
+      users EXEC privilege if they require the ability to execute coprocessor
+      endpoint calls. EXEC privilege, like any other permission, can be
+      granted globally to a user, or to a user on a per table or per namespace
+      basis. For more information on coprocessor endpoints, see the coprocessor
+      section of the HBase online manual. For more information on granting or
+      revoking permissions using the AccessController, see the security
+      section of the HBase online manual.
+    </para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.procedure.regionserver.classes"><glossterm><varname>hbase.procedure.regionserver.classes</varname></glossterm><glossdef><para>A comma-separated list of 
+    org.apache.hadoop.hbase.procedure.RegionServerProcedureManager procedure managers that are 
+    loaded by default on the active HRegionServer process. The lifecycle methods (init/start/stop) 
+    will be called by the active HRegionServer process to perform the specific globally barriered 
+    procedure. After implementing your own RegionServerProcedureManager, just put it in 
+    HBase's classpath and add the fully qualified class name here.
+    </para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.procedure.master.classes"><glossterm><varname>hbase.procedure.master.classes</varname></glossterm><glossdef><para>A comma-separated list of
+    org.apache.hadoop.hbase.procedure.MasterProcedureManager procedure managers that are
+    loaded by default on the active HMaster process. A procedure is identified by its signature and
+    users can use the signature and an instant name to trigger an execution of a globally barriered
+    procedure. After implementing your own MasterProcedureManager, just put it in HBase's classpath
+    and add the fully qualified class name here.</para><formalpara><title>Default</title><para><varname/></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.coordinated.state.manager.class"><glossterm><varname>hbase.coordinated.state.manager.class</varname></glossterm><glossdef><para>Fully qualified name of class implementing coordinated state manager.</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.coordination.ZkCoordinatedStateManager</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.regionserver.storefile.refresh.period"><glossterm><varname>hbase.regionserver.storefile.refresh.period</varname></glossterm><glossdef><para>
+      The period (in milliseconds) for refreshing the store files for the secondary regions. 0
+      means this feature is disabled. Secondary regions sees new files (from flushes and
+      compactions) from primary once the secondary region refreshes the list of files in the
+      region (there is no notification mechanism). But too frequent refreshes might cause
+      extra Namenode pressure. If the files cannot be refreshed for longer than HFile TTL
+      (hbase.master.hfilecleaner.ttl) the requests are rejected. Configuring HFile TTL to a larger
+      value is also recommended with this setting.
+    </para><formalpara><title>Default</title><para><varname>0</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.region.replica.replication.enabled"><glossterm><varname>hbase.region.replica.replication.enabled</varname></glossterm><glossdef><para>
+      Whether asynchronous WAL replication to the secondary region replicas is enabled or not.
+      If this is enabled, a replication peer named "region_replica_replication" will be created
+      which will tail the logs and replicate the mutatations to region replicas for tables that
+      have region replication &gt; 1. If this is enabled once, disabling this replication also
+      requires disabling the replication peer using shell or ReplicationAdmin java class.
+      Replication to secondary region replicas works over standard inter-cluster replication. 
+      So replication, if disabled explicitly, also has to be enabled by setting "hbase.replication" 
+      to true for this feature to work.
+    </para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.http.filter.initializers"><glossterm><varname>hbase.http.filter.initializers</varname></glossterm><glossdef><para>
+      A comma separated list of class names. Each class in the list must extend 
+      org.apache.hadoop.hbase.http.FilterInitializer. The corresponding Filter will 
+      be initialized. Then, the Filter will be applied to all user facing jsp 
+      and servlet web pages. 
+      The ordering of the list defines the ordering of the filters.
+      The default StaticUserWebFilter add a user principal as defined by the 
+      hbase.http.staticuser.user property.
+    </para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.http.lib.StaticUserWebFilter</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.security.visibility.mutations.checkauths"><glossterm><varname>hbase.security.visibility.mutations.checkauths</varname></glossterm><glossdef><para>
+      This property if enabled, will check whether the labels in the visibility expression are associated
+      with the user issuing the mutation
+    </para><formalpara><title>Default</title><para><varname>false</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.http.max.threads"><glossterm><varname>hbase.http.max.threads</varname></glossterm><glossdef><para>
+      The maximum number of threads that the HTTP Server will create in its 
+      ThreadPool.
+    </para><formalpara><title>Default</title><para><varname>10</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.replication.rpc.codec"><glossterm><varname>hbase.replication.rpc.codec</varname></glossterm><glossdef><para>
+  		The codec that is to be used when replication is enabled so that
+  		the tags are also replicated. This is used along with HFileV3 which 
+  		supports tags in them.  If tags are not used or if the hfile version used
+  		is HFileV2 then KeyValueCodec can be used as the replication codec. Note that
+  		using KeyValueCodecWithTags for replication when there are no tags causes no harm.
+  	</para><formalpara><title>Default</title><para><varname>org.apache.hadoop.hbase.codec.KeyValueCodecWithTags</varname></para></formalpara></glossdef></glossentry><glossentry xml:id="hbase.http.staticuser.user"><glossterm><varname>hbase.http.staticuser.user</varname></glossterm><glossdef><para>
+      The user name to filter as, on static web filters
+      while rendering content. An example use is the HDFS
+      web UI (user to be used for browsing files).
+    </para><formalpara><title>Default</title><para><varname>dr.stack</varname></para></formalpara></glossdef></glossentry></glossary>
\ No newline at end of file


[6/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/asf.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/asf.xml b/src/main/docbkx/asf.xml
new file mode 100644
index 0000000..1455b4a
--- /dev/null
+++ b/src/main/docbkx/asf.xml
@@ -0,0 +1,44 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="asf"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+    <title>HBase and the Apache Software Foundation</title>
+    <para>HBase is a project in the Apache Software Foundation and as such there are responsibilities to the ASF to ensure
+        a healthy project.</para>
+    <section xml:id="asf.devprocess"><title>ASF Development Process</title>
+        <para>See the <link xlink:href="http://www.apache.org/dev/#committers">Apache Development Process page</link>
+            for all sorts of information on how the ASF is structured (e.g., PMC, committers, contributors), to tips on contributing
+            and getting involved, and how open-source works at ASF.
+        </para>
+    </section>
+    <section xml:id="asf.reporting"><title>ASF Board Reporting</title>
+        <para>Once a quarter, each project in the ASF portfolio submits a report to the ASF board.  This is done by the HBase project
+            lead and the committers.  See <link xlink:href="http://www.apache.org/foundation/board/reporting">ASF board reporting</link> for more information.
+        </para>
+    </section>
+</appendix>


[3/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/faq.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/faq.xml b/src/main/docbkx/faq.xml
new file mode 100644
index 0000000..d7bcb0c
--- /dev/null
+++ b/src/main/docbkx/faq.xml
@@ -0,0 +1,270 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<appendix
+    xml:id="faq"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+        <title >FAQ</title>
+        <qandaset defaultlabel='qanda'>
+            <qandadiv><title>General</title>
+                <qandaentry>
+                    <question><para>When should I use HBase?</para></question>
+                    <answer>
+                        <para>See the <xref linkend="arch.overview" /> in the Architecture chapter.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry>
+                    <question><para>Are there other HBase FAQs?</para></question>
+                    <answer>
+                        <para>
+                            See the FAQ that is up on the wiki, <link xlink:href="http://wiki.apache.org/hadoop/Hbase/FAQ">HBase Wiki FAQ</link>.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry xml:id="faq.sql">
+                    <question><para>Does HBase support SQL?</para></question>
+                    <answer>
+                        <para>
+                            Not really.  SQL-ish support for HBase via <link xlink:href="http://hive.apache.org/">Hive</link> is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests.
+                            See the <xref linkend="datamodel" /> section for examples on the HBase client.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry>
+                    <question><para>How can I find examples of NoSQL/HBase?</para></question>
+                    <answer>
+                        <para>See the link to the BigTable paper in <xref linkend="other.info" /> in the appendix, as
+                            well as the other papers.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry>
+                    <question><para>What is the history of HBase?</para></question>
+                    <answer>
+                        <para>See <xref linkend="hbase.history"/>.
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv>
+                <title>Upgrading</title>
+                <qandaentry>
+                    <question>
+                        <para>How do I upgrade Maven-managed projects from HBase 0.94 to HBase 0.96+?</para>
+                    </question>
+                    <answer>
+                        <para>In HBase 0.96, the project moved to a modular structure. Adjust your project's
+                            dependencies to rely upon the <filename>hbase-client</filename> module or another
+                            module as appropriate, rather than a single JAR. You can model your Maven depency
+                            after one of the following, depending on your targeted version of HBase. See <xref
+                                linkend="upgrade0.96"/> or <xref linkend="upgrade0.98"/> for more
+                            information.</para>
+                        <example>
+                            <title>Maven Dependency for HBase 0.98</title>
+                            <programlisting language="xml"><![CDATA[
+<dependency>
+	<groupId>org.apache.hbase</groupId>
+	<artifactId>hbase-client</artifactId>
+	<version>0.98.5-hadoop2</version>
+</dependency>                
+                ]]></programlisting>
+                        </example>
+                        <example>
+                            <title>Maven Dependency for HBase 0.96</title>
+                            <programlisting language="xml"><![CDATA[
+<dependency>
+	<groupId>org.apache.hbase</groupId>
+	<artifactId>hbase-client</artifactId>
+	<version>0.96.2-hadoop2</version>
+</dependency>             
+                ]]></programlisting>
+                        </example>
+                        <example>
+                            <title>Maven Dependency for HBase 0.94</title>
+                            <programlisting language="xml"><![CDATA[
+<dependency>
+	<groupId>org.apache.hbase</groupId>
+	<artifactId>hbase</artifactId>
+	<version>0.94.3</version>
+</dependency>            
+                ]]></programlisting>
+                        </example>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv xml:id="faq.arch"><title>Architecture</title>
+                <qandaentry xml:id="faq.arch.regions">
+                    <question><para>How does HBase handle Region-RegionServer assignment and locality?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="regions.arch" />.
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv xml:id="faq.config"><title>Configuration</title>
+                <qandaentry xml:id="faq.config.started">
+                    <question><para>How can I get started with my first cluster?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="quickstart" />.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry xml:id="faq.config.options">
+                    <question><para>Where can I learn about the rest of the configuration options?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="configuration" />.
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv xml:id="faq.design"><title>Schema Design / Data Access</title>
+                <qandaentry xml:id="faq.design.schema">
+                    <question><para>How should I design my schema in HBase?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="datamodel" /> and <xref linkend="schema" />
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry>
+                    <question><para>
+                        How can I store (fill in the blank) in HBase?
+                    </para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="supported.datatypes" />.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry xml:id="secondary.indices">
+                    <question><para>
+                        How can I handle secondary indexes in HBase?
+                    </para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="secondary.indexes" />
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry xml:id="faq.changing.rowkeys">
+                    <question><para>Can I change a table's rowkeys?</para></question>
+                    <answer>
+                        <para> This is a very common question. You can't. See <xref
+                            linkend="changing.rowkeys"/>. </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry xml:id="faq.apis">
+                    <question><para>What APIs does HBase support?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="datamodel" />, <xref linkend="client" /> and <xref linkend="nonjava.jvm"/>.
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv xml:id="faq.mapreduce"><title>MapReduce</title>
+                <qandaentry xml:id="faq.mapreduce.use">
+                    <question><para>How can I use MapReduce with HBase?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="mapreduce" />
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv><title>Performance and Troubleshooting</title>
+                <qandaentry>
+                    <question><para>
+                        How can I improve HBase cluster performance?
+                    </para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="performance" />.
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry>
+                    <question><para>
+                        How can I troubleshoot my HBase cluster?
+                    </para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="trouble" />.
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv xml:id="ec2"><title>Amazon EC2</title>
+                <qandaentry>
+                    <question><para>
+                        I am running HBase on Amazon EC2 and...
+                    </para></question>
+                    <answer>
+                        <para>
+                            EC2 issues are a special case.  See Troubleshooting <xref linkend="trouble.ec2" /> and Performance <xref linkend="perf.ec2" /> sections.
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv><title xml:id="faq.operations">Operations</title>
+                <qandaentry>
+                    <question><para>
+                        How do I manage my HBase cluster?
+                    </para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="ops_mgt" />
+                        </para>
+                    </answer>
+                </qandaentry>
+                <qandaentry>
+                    <question><para>
+                        How do I back up my HBase cluster?
+                    </para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="ops.backup" />
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+            <qandadiv><title>HBase in Action</title>
+                <qandaentry>
+                    <question><para>Where can I find interesting videos and presentations on HBase?</para></question>
+                    <answer>
+                        <para>
+                            See <xref linkend="other.info" />
+                        </para>
+                    </answer>
+                </qandaentry>
+            </qandadiv>
+        </qandaset>
+    
+</appendix>


[5/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/book.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index ee2d7fb..3010055 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -1,4 +1,5 @@
 <?xml version="1.0" encoding="UTF-8"?>
+
 <!--
 /**
  *
@@ -80,4926 +81,16 @@
   </info>
 
   <!--XInclude some chapters-->
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="preface.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="getting_started.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="configuration.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="upgrading.xml" />
-  <xi:include
-    xmlns:xi="http://www.w3.org/2001/XInclude"
-    href="shell.xml" />
-
-  <chapter
-    xml:id="datamodel">
-    <title>Data Model</title>
-    <para>In HBase, data is stored in tables, which have rows and columns. This is a terminology
-      overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can
-    be helpful to think of an HBase table as a multi-dimensional map.</para>
-    <variablelist>
-      <title>HBase Data Model Terminology</title>
-      <varlistentry>
-        <term>Table</term>
-        <listitem>
-          <para>An HBase table consists of multiple rows.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Row</term>
-        <listitem>
-          <para>A row in HBase consists of a row key and one or more columns with values associated
-            with them. Rows are sorted alphabetically by the row key as they are stored. For this
-            reason, the design of the row key is very important. The goal is to store data in such a
-            way that related rows are near each other. A common row key pattern is a website domain.
-            If your row keys are domains, you should probably store them in reverse (org.apache.www,
-            org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each
-            other in the table, rather than being spread out based on the first letter of the
-            subdomain.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Column</term>
-        <listitem>
-          <para>A column in HBase consists of a column family and a column qualifier, which are
-            delimited by a <literal>:</literal> (colon) character.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Column Family</term>
-        <listitem>
-          <para>Column families physically colocate a set of columns and their values, often for
-            performance reasons. Each column family has a set of storage properties, such as whether
-            its values should be cached in memory, how its data is compressed or its row keys are
-            encoded, and others. Each row in a table has the same column
-            families, though a given row might not store anything in a given column family.</para>
-          <para>Column families are specified when you create your table, and influence the way your
-            data is stored in the underlying filesystem. Therefore, the column families should be
-            considered carefully during schema design.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Column Qualifier</term>
-        <listitem>
-          <para>A column qualifier is added to a column family to provide the index for a given
-            piece of data. Given a column family <literal>content</literal>, a column qualifier
-            might be <literal>content:html</literal>, and another might be
-            <literal>content:pdf</literal>. Though column families are fixed at table creation,
-            column qualifiers are mutable and may differ greatly between rows.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Cell</term>
-        <listitem>
-          <para>A cell is a combination of row, column family, and column qualifier, and contains a
-            value and a timestamp, which represents the value's version.</para>
-          <para>A cell's value is an uninterpreted array of bytes.</para>
-        </listitem>
-      </varlistentry>
-      <varlistentry>
-        <term>Timestamp</term>
-        <listitem>
-          <para>A timestamp is written alongside each value, and is the identifier for a given
-            version of a value. By default, the timestamp represents the time on the RegionServer
-            when the data was written, but you can specify a different timestamp value when you put
-            data into the cell.</para>
-          <caution>
-            <para>Direct manipulation of timestamps is an advanced feature which is only exposed for
-              special cases that are deeply integrated with HBase, and is discouraged in general.
-              Encoding a timestamp at the application level is the preferred pattern.</para>
-          </caution>
-          <para>You can specify the maximum number of versions of a value that HBase retains, per column
-            family. When the maximum number of versions is reached, the oldest versions are 
-            eventually deleted. By default, only the newest version is kept.</para>
-        </listitem>
-      </varlistentry>
-    </variablelist>
-
-    <section
-      xml:id="conceptual.view">
-      <title>Conceptual View</title>
-      <para>You can read a very understandable explanation of the HBase data model in the blog post <link
-          xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding
-          HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the
-        PDF <link
-          xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction
-          to Basic Schema Design</link> by Amandeep Khurana. It may help to read different
-        perspectives to get a solid understanding of HBase schema design. The linked articles cover
-        the same ground as the information in this section.</para>
-      <para> The following example is a slightly modified form of the one on page 2 of the <link
-          xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
-        is a table called <varname>webtable</varname> that contains two rows
-        (<literal>com.cnn.www</literal>
-          and <literal>com.example.www</literal>), three column families named
-          <varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In
-          this example, for the first row (<literal>com.cnn.www</literal>), 
-          <varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
-          <varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
-          (<varname>contents:html</varname>). This example contains 5 versions of the row with the
-        row key <literal>com.cnn.www</literal>, and one version of the row with the row key
-        <literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire
-        HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each
-        contain the external site which links to the site represented by the row, along with the
-        text it used in the anchor of its link. The <varname>people</varname> column family represents
-        people associated with the site.
-      </para>
-        <note>
-          <title>Column Names</title>
-        <para> By convention, a column name is made of its column family prefix and a
-            <emphasis>qualifier</emphasis>. For example, the column
-            <emphasis>contents:html</emphasis> is made up of the column family
-            <varname>contents</varname> and the <varname>html</varname> qualifier. The colon
-          character (<literal>:</literal>) delimits the column family from the column family
-            <emphasis>qualifier</emphasis>. </para>
-        </note>
-        <table
-          frame="all">
-          <title>Table <varname>webtable</varname></title>
-          <tgroup
-            cols="5"
-            align="left"
-            colsep="1"
-            rowsep="1">
-            <colspec
-              colname="c1" />
-            <colspec
-              colname="c2" />
-            <colspec
-              colname="c3" />
-            <colspec
-              colname="c4" />
-            <colspec
-              colname="c5" />
-            <thead>
-              <row>
-                <entry>Row Key</entry>
-                <entry>Time Stamp</entry>
-                <entry>ColumnFamily <varname>contents</varname></entry>
-                <entry>ColumnFamily <varname>anchor</varname></entry>
-                <entry>ColumnFamily <varname>people</varname></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t9</entry>
-                <entry />
-                <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t8</entry>
-                <entry />
-                <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t6</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry />
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t5</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry />
-                <entry />
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t3</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry />
-                <entry />
-              </row>
-              <row>
-                <entry>"com.example.www"</entry>
-                <entry>t5</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-                <entry></entry>
-                <entry>people:author = "John Doe"</entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </table>
-      <para>Cells in this table that appear to be empty do not take space, or in fact exist, in
-        HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to
-        look at data in HBase, or even the most accurate. The following represents the same
-        information as a multi-dimensional map. This is only a mock-up for illustrative
-        purposes and may not be strictly accurate.</para>
-      <programlisting><![CDATA[
-{
-	"com.cnn.www": {
-		contents: {
-			t6: contents:html: "<html>..."
-			t5: contents:html: "<html>..."
-			t3: contents:html: "<html>..."
-		}
-		anchor: {
-			t9: anchor:cnnsi.com = "CNN"
-			t8: anchor:my.look.ca = "CNN.com"
-		}
-		people: {}
-	}
-	"com.example.www": {
-		contents: {
-			t5: contents:html: "<html>..."
-		}
-		anchor: {}
-		people: {
-			t5: people:author: "John Doe"
-		}
-	}
-}        
-        ]]></programlisting>
-
-    </section>
-    <section
-      xml:id="physical.view">
-      <title>Physical View</title>
-      <para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are
-        physically stored by column family. A new column qualifier (column_family:column_qualifier)
-        can be added to an existing column family at any time.</para>
-      <table
-        frame="all">
-        <title>ColumnFamily <varname>anchor</varname></title>
-        <tgroup
-          cols="3"
-          align="left"
-          colsep="1"
-          rowsep="1">
-          <colspec
-            colname="c1" />
-          <colspec
-            colname="c2" />
-          <colspec
-            colname="c3" />
-          <thead>
-            <row>
-              <entry>Row Key</entry>
-              <entry>Time Stamp</entry>
-              <entry>Column Family <varname>anchor</varname></entry>
-            </row>
-          </thead>
-          <tbody>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t9</entry>
-              <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
-            </row>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t8</entry>
-              <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </table>
-      <table
-        frame="all">
-        <title>ColumnFamily <varname>contents</varname></title>
-        <tgroup
-          cols="3"
-          align="left"
-          colsep="1"
-          rowsep="1">
-          <colspec
-            colname="c1" />
-          <colspec
-            colname="c2" />
-          <colspec
-            colname="c3" />
-          <thead>
-            <row>
-              <entry>Row Key</entry>
-              <entry>Time Stamp</entry>
-              <entry>ColumnFamily "contents:"</entry>
-            </row>
-          </thead>
-          <tbody>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t6</entry>
-              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-            </row>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t5</entry>
-              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-            </row>
-            <row>
-              <entry>"com.cnn.www"</entry>
-              <entry>t3</entry>
-              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </table>
-      <para>The empty cells shown in the
-        conceptual view are not stored at all.
-        Thus a request for the value of the <varname>contents:html</varname> column at time stamp
-          <literal>t8</literal> would return no value. Similarly, a request for an
-          <varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
-        return no value. However, if no timestamp is supplied, the most recent value for a
-        particular column would be returned. Given multiple versions, the most recent is also the
-        first one found,  since timestamps
-        are stored in descending order. Thus a request for the values of all columns in the row
-          <varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
-          <varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
-          <varname>anchor:cnnsi.com</varname> from timestamp <literal>t9</literal>, the value of
-          <varname>anchor:my.look.ca</varname> from timestamp <literal>t8</literal>. </para>
-      <para>For more information about the internals of how Apache HBase stores data, see <xref
-          linkend="regions.arch" />. </para>
-    </section>
-
-    <section
-      xml:id="namespace">
-      <title>Namespace</title>
-      <para> A namespace is a logical grouping of tables analogous to a database in relation
-        database systems. This abstraction lays the groundwork for upcoming multi-tenancy related
-        features: <itemizedlist>
-          <listitem>
-            <para>Quota Management (HBASE-8410) - Restrict the amount of resources (ie regions,
-              tables) a namespace can consume.</para>
-          </listitem>
-          <listitem>
-            <para>Namespace Security Administration (HBASE-9206) - provide another level of security
-              administration for tenants.</para>
-          </listitem>
-          <listitem>
-            <para>Region server groups (HBASE-6721) - A namespace/table can be pinned onto a subset
-              of regionservers thus guaranteeing a course level of isolation.</para>
-          </listitem>
-        </itemizedlist>
-      </para>
-      <section
-        xml:id="namespace_creation">
-        <title>Namespace management</title>
-        <para> A namespace can be created, removed or altered. Namespace membership is determined
-          during table creation by specifying a fully-qualified table name of the form:</para>
-
-        <programlisting language="xml"><![CDATA[<table namespace>:<table qualifier>]]></programlisting>
-
-
-        <example>
-          <title>Examples</title>
-
-          <programlisting language="bourne">
-#Create a namespace
-create_namespace 'my_ns'
-            </programlisting>
-          <programlisting language="bourne">
-#create my_table in my_ns namespace
-create 'my_ns:my_table', 'fam'
-          </programlisting>
-          <programlisting language="bourne">
-#drop namespace
-drop_namespace 'my_ns'
-          </programlisting>
-          <programlisting language="bourne">
-#alter namespace
-alter_namespace 'my_ns', {METHOD => 'set', 'PROPERTY_NAME' => 'PROPERTY_VALUE'}
-        </programlisting>
-        </example>
-      </section>
-      <section
-        xml:id="namespace_special">
-        <title>Predefined namespaces</title>
-        <para> There are two predefined special namespaces: </para>
-        <itemizedlist>
-          <listitem>
-            <para>hbase - system namespace, used to contain hbase internal tables</para>
-          </listitem>
-          <listitem>
-            <para>default - tables with no explicit specified namespace will automatically fall into
-              this namespace.</para>
-          </listitem>
-        </itemizedlist>
-        <example>
-          <title>Examples</title>
-
-          <programlisting language="bourne">
-#namespace=foo and table qualifier=bar
-create 'foo:bar', 'fam'
-
-#namespace=default and table qualifier=bar
-create 'bar', 'fam'
-</programlisting>
-        </example>
-      </section>
-    </section>
-
-    <section
-      xml:id="table">
-      <title>Table</title>
-      <para> Tables are declared up front at schema definition time. </para>
-    </section>
-
-    <section
-      xml:id="row">
-      <title>Row</title>
-      <para>Row keys are uninterrpreted bytes. Rows are lexicographically sorted with the lowest
-        order appearing first in a table. The empty byte array is used to denote both the start and
-        end of a tables' namespace.</para>
-    </section>
-
-    <section
-      xml:id="columnfamily">
-      <title>Column Family<indexterm><primary>Column Family</primary></indexterm></title>
-      <para> Columns in Apache HBase are grouped into <emphasis>column families</emphasis>. All
-        column members of a column family have the same prefix. For example, the columns
-          <emphasis>courses:history</emphasis> and <emphasis>courses:math</emphasis> are both
-        members of the <emphasis>courses</emphasis> column family. The colon character
-          (<literal>:</literal>) delimits the column family from the <indexterm><primary>column
-            family qualifier</primary><secondary>Column Family Qualifier</secondary></indexterm>.
-        The column family prefix must be composed of <emphasis>printable</emphasis> characters. The
-        qualifying tail, the column family <emphasis>qualifier</emphasis>, can be made of any
-        arbitrary bytes. Column families must be declared up front at schema definition time whereas
-        columns do not need to be defined at schema time but can be conjured on the fly while the
-        table is up an running.</para>
-      <para>Physically, all column family members are stored together on the filesystem. Because
-        tunings and storage specifications are done at the column family level, it is advised that
-        all column family members have the same general access pattern and size
-        characteristics.</para>
-
-    </section>
-    <section
-      xml:id="cells">
-      <title>Cells<indexterm><primary>Cells</primary></indexterm></title>
-      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
-          <literal>cell</literal> in HBase. Cell content is uninterrpreted bytes</para>
-    </section>
-    <section
-      xml:id="data_model_operations">
-      <title>Data Model Operations</title>
-      <para>The four primary data model operations are Get, Put, Scan, and Delete. Operations are
-        applied via <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link>
-        instances.
-      </para>
-      <section
-        xml:id="get">
-        <title>Get</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
-          returns attributes for a specified row. Gets are executed via <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#get(org.apache.hadoop.hbase.client.Get)">
-            Table.get</link>. </para>
-      </section>
-      <section
-        xml:id="put">
-        <title>Put</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link>
-          either adds new rows to a table (if the key is new) or can update existing rows (if the
-          key already exists). Puts are executed via <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#put(org.apache.hadoop.hbase.client.Put)">
-            Table.put</link> (writeBuffer) or <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#batch(java.util.List, java.lang.Object[])">
-            Table.batch</link> (non-writeBuffer). </para>
-      </section>
-      <section
-        xml:id="scan">
-        <title>Scans</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
-          allow iteration over multiple rows for specified attributes. </para>
-        <para>The following is an example of a Scan on a Table instance. Assume that a table is
-          populated with rows with keys "row1", "row2", "row3", and then another set of rows with
-          the keys "abc1", "abc2", and "abc3". The following example shows how to set a Scan
-          instance to return the rows beginning with "row".</para>
-<programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-
-Table table = ...      // instantiate a Table instance
-
-Scan scan = new Scan();
-scan.addColumn(CF, ATTR);
-scan.setRowPrefixFilter(Bytes.toBytes("row"));
-ResultScanner rs = table.getScanner(scan);
-try {
-  for (Result r = rs.next(); r != null; r = rs.next()) {
-  // process result...
-} finally {
-  rs.close();  // always close the ResultScanner!
-</programlisting>
-        <para>Note that generally the easiest way to specify a specific stop point for a scan is by
-          using the <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/InclusiveStopFilter.html">InclusiveStopFilter</link>
-          class. </para>
-      </section>
-      <section
-        xml:id="delete">
-        <title>Delete</title>
-        <para><link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Delete.html">Delete</link>
-          removes a row from a table. Deletes are executed via <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html#delete(org.apache.hadoop.hbase.client.Delete)">
-            HTable.delete</link>. </para>
-        <para>HBase does not modify data in place, and so deletes are handled by creating new
-          markers called <emphasis>tombstones</emphasis>. These tombstones, along with the dead
-          values, are cleaned up on major compactions. </para>
-        <para>See <xref
-            linkend="version.delete" /> for more information on deleting versions of columns, and
-          see <xref
-            linkend="compaction" /> for more information on compactions. </para>
-
-      </section>
-
-    </section>
-
-
-    <section
-      xml:id="versions">
-      <title>Versions<indexterm><primary>Versions</primary></indexterm></title>
-
-      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly specifies a
-          <literal>cell</literal> in HBase. It's possible to have an unbounded number of cells where
-        the row and column are the same but the cell address differs only in its version
-        dimension.</para>
-
-      <para>While rows and column keys are expressed as bytes, the version is specified using a long
-        integer. Typically this long contains time instances such as those returned by
-          <code>java.util.Date.getTime()</code> or <code>System.currentTimeMillis()</code>, that is:
-          <quote>the difference, measured in milliseconds, between the current time and midnight,
-          January 1, 1970 UTC</quote>.</para>
-
-      <para>The HBase version dimension is stored in decreasing order, so that when reading from a
-        store file, the most recent values are found first.</para>
-
-      <para>There is a lot of confusion over the semantics of <literal>cell</literal> versions, in
-        HBase. In particular:</para>
-      <itemizedlist>
-        <listitem>
-          <para>If multiple writes to a cell have the same version, only the last written is
-            fetchable.</para>
-        </listitem>
-
-        <listitem>
-          <para>It is OK to write cells in a non-increasing version order.</para>
-        </listitem>
-      </itemizedlist>
-
-      <para>Below we describe how the version dimension in HBase currently works. See <link
-              xlink:href="https://issues.apache.org/jira/browse/HBASE-2406">HBASE-2406</link> for
-            discussion of HBase versions. <link
-              xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in HBase</link>
-            makes for a good read on the version, or time, dimension in HBase. It has more detail on
-            versioning than is provided here. As of this writing, the limiitation
-              <emphasis>Overwriting values at existing timestamps</emphasis> mentioned in the
-            article no longer holds in HBase. This section is basically a synopsis of this article
-            by Bruno Dumon.</para>
-      
-      <section xml:id="specify.number.of.versions">
-        <title>Specifying the Number of Versions to Store</title>
-        <para>The maximum number of versions to store for a given column is part of the column
-          schema and is specified at table creation, or via an <command>alter</command> command, via
-            <code>HColumnDescriptor.DEFAULT_VERSIONS</code>. Prior to HBase 0.96, the default number
-          of versions kept was <literal>3</literal>, but in 0.96 and newer has been changed to
-            <literal>1</literal>.</para>
-        <example>
-          <title>Modify the Maximum Number of Versions for a Column</title>
-          <para>This example uses HBase Shell to keep a maximum of 5 versions of column
-              <code>f1</code>. You could also use <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
-              >HColumnDescriptor</link>.</para>
-          <screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, VERSIONS => 5]]></screen>
-        </example>
-        <example>
-          <title>Modify the Minimum Number of Versions for a Column</title>
-          <para>You can also specify the minimum number of versions to store. By default, this is
-            set to 0, which means the feature is disabled. The following example sets the minimum
-            number of versions on field <code>f1</code> to <literal>2</literal>, via HBase Shell.
-            You could also use <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
-              >HColumnDescriptor</link>.</para>
-          <screen><![CDATA[hbase> alter ‘t1′, NAME => ‘f1′, MIN_VERSIONS => 2]]></screen>
-        </example>
-        <para>Starting with HBase 0.98.2, you can specify a global default for the maximum number of
-          versions kept for all newly-created columns, by setting
-            <option>hbase.column.max.version</option> in <filename>hbase-site.xml</filename>. See
-            <xref linkend="hbase.column.max.version"/>.</para>
-      </section>
-
-      <section
-        xml:id="versions.ops">
-        <title>Versions and HBase Operations</title>
-
-        <para>In this section we look at the behavior of the version dimension for each of the core
-          HBase operations.</para>
-
-        <section>
-          <title>Get/Scan</title>
-
-          <para>Gets are implemented on top of Scans. The below discussion of <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
-            applies equally to <link
-              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scans</link>.</para>
-
-          <para>By default, i.e. if you specify no explicit version, when doing a
-              <literal>get</literal>, the cell whose version has the largest value is returned
-            (which may or may not be the latest one written, see later). The default behavior can be
-            modified in the following ways:</para>
-
-          <itemizedlist>
-            <listitem>
-              <para>to return more than one version, see <link
-                  xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()">Get.setMaxVersions()</link></para>
-            </listitem>
-
-            <listitem>
-              <para>to return versions other than the latest, see <link
-                  xlink:href="???">Get.setTimeRange()</link></para>
-
-              <para>To retrieve the latest version that is less than or equal to a given value, thus
-                giving the 'latest' state of the record at a certain point in time, just use a range
-                from 0 to the desired version and set the max versions to 1.</para>
-            </listitem>
-          </itemizedlist>
-
-        </section>
-        <section
-          xml:id="default_get_example">
-          <title>Default Get Example</title>
-          <para>The following Get will only retrieve the current version of the row</para>
-          <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Get get = new Get(Bytes.toBytes("row1"));
-Result r = table.get(get);
-byte[] b = r.getValue(CF, ATTR);  // returns current version of value
-</programlisting>
-        </section>
-        <section
-          xml:id="versioned_get_example">
-          <title>Versioned Get Example</title>
-          <para>The following Get will return the last 3 versions of the row.</para>
-          <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Get get = new Get(Bytes.toBytes("row1"));
-get.setMaxVersions(3);  // will return last 3 versions of row
-Result r = table.get(get);
-byte[] b = r.getValue(CF, ATTR);  // returns current version of value
-List&lt;KeyValue&gt; kv = r.getColumn(CF, ATTR);  // returns all versions of this column
-</programlisting>
-        </section>
-
-        <section>
-          <title>Put</title>
-
-          <para>Doing a put always creates a new version of a <literal>cell</literal>, at a certain
-            timestamp. By default the system uses the server's <literal>currentTimeMillis</literal>,
-            but you can specify the version (= the long integer) yourself, on a per-column level.
-            This means you could assign a time in the past or the future, or use the long value for
-            non-time purposes.</para>
-
-          <para>To overwrite an existing value, do a put at exactly the same row, column, and
-            version as that of the cell you would overshadow.</para>
-          <section
-            xml:id="implicit_version_example">
-            <title>Implicit Version Example</title>
-            <para>The following Put will be implicitly versioned by HBase with the current
-              time.</para>
-            <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Put put = new Put(Bytes.toBytes(row));
-put.add(CF, ATTR, Bytes.toBytes( data));
-table.put(put);
-</programlisting>
-          </section>
-          <section
-            xml:id="explicit_version_example">
-            <title>Explicit Version Example</title>
-            <para>The following Put has the version timestamp explicitly set.</para>
-            <programlisting language="java">
-public static final byte[] CF = "cf".getBytes();
-public static final byte[] ATTR = "attr".getBytes();
-...
-Put put = new Put( Bytes.toBytes(row));
-long explicitTimeInMs = 555;  // just an example
-put.add(CF, ATTR, explicitTimeInMs, Bytes.toBytes(data));
-table.put(put);
-</programlisting>
-            <para>Caution: the version timestamp is internally by HBase for things like time-to-live
-              calculations. It's usually best to avoid setting this timestamp yourself. Prefer using
-              a separate timestamp attribute of the row, or have the timestamp a part of the rowkey,
-              or both. </para>
-          </section>
-
-        </section>
-
-        <section
-          xml:id="version.delete">
-          <title>Delete</title>
-
-          <para>There are three different types of internal delete markers. See Lars Hofhansl's blog
-            for discussion of his attempt adding another, <link
-              xlink:href="http://hadoop-hbase.blogspot.com/2012/01/scanning-in-hbase.html">Scanning
-              in HBase: Prefix Delete Marker</link>. </para>
-          <itemizedlist>
-            <listitem>
-              <para>Delete: for a specific version of a column.</para>
-            </listitem>
-            <listitem>
-              <para>Delete column: for all versions of a column.</para>
-            </listitem>
-            <listitem>
-              <para>Delete family: for all columns of a particular ColumnFamily</para>
-            </listitem>
-          </itemizedlist>
-          <para>When deleting an entire row, HBase will internally create a tombstone for each
-            ColumnFamily (i.e., not each individual column). </para>
-          <para>Deletes work by creating <emphasis>tombstone</emphasis> markers. For example, let's
-            suppose we want to delete a row. For this you can specify a version, or else by default
-            the <literal>currentTimeMillis</literal> is used. What this means is <quote>delete all
-              cells where the version is less than or equal to this version</quote>. HBase never
-            modifies data in place, so for example a delete will not immediately delete (or mark as
-            deleted) the entries in the storage file that correspond to the delete condition.
-            Rather, a so-called <emphasis>tombstone</emphasis> is written, which will mask the
-            deleted values. When HBase does a major compaction, the tombstones are processed to
-            actually remove the dead values, together with the tombstones themselves. If the version
-            you specified when deleting a row is larger than the version of any value in the row,
-            then you can consider the complete row to be deleted.</para>
-          <para>For an informative discussion on how deletes and versioning interact, see the thread <link
-              xlink:href="http://comments.gmane.org/gmane.comp.java.hadoop.hbase.user/28421">Put w/
-              timestamp -> Deleteall -> Put w/ timestamp fails</link> up on the user mailing
-            list.</para>
-          <para>Also see <xref
-              linkend="keyvalue" /> for more information on the internal KeyValue format. </para>
-          <para>Delete markers are purged during the next major compaction of the store, unless the
-              <option>KEEP_DELETED_CELLS</option> option is set in the column family. To keep the
-            deletes for a configurable amount of time, you can set the delete TTL via the
-              <option>hbase.hstore.time.to.purge.deletes</option> property in
-              <filename>hbase-site.xml</filename>. If
-              <option>hbase.hstore.time.to.purge.deletes</option> is not set, or set to 0, all
-            delete markers, including those with timestamps in the future, are purged during the
-            next major compaction. Otherwise, a delete marker with a timestamp in the future is kept
-            until the major compaction which occurs after the time represented by the marker's
-            timestamp plus the value of <option>hbase.hstore.time.to.purge.deletes</option>, in
-            milliseconds. </para>
-          <note>
-            <para>This behavior represents a fix for an unexpected change that was introduced in
-              HBase 0.94, and was fixed in <link
-                xlink:href="https://issues.apache.org/jira/browse/HBASE-10118">HBASE-10118</link>.
-              The change has been backported to HBase 0.94 and newer branches.</para>
-          </note>
-        </section>
-      </section>
-
-      <section>
-        <title>Current Limitations</title>
-
-        <section>
-          <title>Deletes mask Puts</title>
-
-          <para>Deletes mask puts, even puts that happened after the delete
-          was entered. See <link xlink:href="https://issues.apache.org/jira/browse/HBASE-2256"
-              >HBASE-2256</link>. Remember that a delete writes a tombstone, which only
-          disappears after then next major compaction has run. Suppose you do
-          a delete of everything &lt;= T. After this you do a new put with a
-          timestamp &lt;= T. This put, even if it happened after the delete,
-          will be masked by the delete tombstone. Performing the put will not
-          fail, but when you do a get you will notice the put did have no
-          effect. It will start working again after the major compaction has
-          run. These issues should not be a problem if you use
-          always-increasing versions for new puts to a row. But they can occur
-          even if you do not care about time: just do delete and put
-          immediately after each other, and there is some chance they happen
-          within the same millisecond.</para>
-        </section>
-
-        <section
-          xml:id="major.compactions.change.query.results">
-          <title>Major compactions change query results</title>
-          
-          <para><quote>...create three cell versions at t1, t2 and t3, with a maximum-versions
-              setting of 2. So when getting all versions, only the values at t2 and t3 will be
-              returned. But if you delete the version at t2 or t3, the one at t1 will appear again.
-              Obviously, once a major compaction has run, such behavior will not be the case
-              anymore...</quote> (See <emphasis>Garbage Collection</emphasis> in <link
-              xlink:href="http://outerthought.org/blog/417-ot.html">Bending time in
-            HBase</link>.)</para>
-        </section>
-      </section>
-    </section>
-    <section xml:id="dm.sort">
-      <title>Sort Order</title>
-      <para>All data model operations HBase return data in sorted order.  First by row,
-      then by ColumnFamily, followed by column qualifier, and finally timestamp (sorted
-      in reverse, so newest records are returned first).
-      </para>
-    </section>
-    <section xml:id="dm.column.metadata">
-      <title>Column Metadata</title>
-      <para>There is no store of column metadata outside of the internal KeyValue instances for a ColumnFamily.
-      Thus, while HBase can support not only a wide number of columns per row, but a heterogenous set of columns
-      between rows as well, it is your responsibility to keep track of the column names.
-      </para>
-      <para>The only way to get a complete set of columns that exist for a ColumnFamily is to process all the rows.
-      For more information about how HBase stores data internally, see <xref linkend="keyvalue" />.
-	  </para>
-    </section>
-    <section xml:id="joins"><title>Joins</title>
-      <para>Whether HBase supports joins is a common question on the dist-list, and there is a simple answer:  it doesn't,
-      at not least in the way that RDBMS' support them (e.g., with equi-joins or outer-joins in SQL).  As has been illustrated
-      in this chapter, the read data model operations in HBase are Get and Scan.
-      </para>
-      <para>However, that doesn't mean that equivalent join functionality can't be supported in your application, but
-      you have to do it yourself.  The two primary strategies are either denormalizing the data upon writing to HBase,
-      or to have lookup tables and do the join between HBase tables in your application or MapReduce code (and as RDBMS'
-      demonstrate, there are several strategies for this depending on the size of the tables, e.g., nested loops vs.
-      hash-joins).  So which is the best approach?  It depends on what you are trying to do, and as such there isn't a single
-      answer that works for every use case.
-      </para>
-    </section>
-    <section xml:id="acid"><title>ACID</title>
-        <para>See <link xlink:href="http://hbase.apache.org/acid-semantics.html">ACID Semantics</link>.
-            Lars Hofhansl has also written a note on
-            <link xlink:href="http://hadoop-hbase.blogspot.com/2012/03/acid-in-hbase.html">ACID in HBase</link>.</para>
-    </section>
-  </chapter>  <!-- data model -->
-
-  <!--  schema design -->
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="preface.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="getting_started.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="configuration.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="upgrading.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="shell.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="datamodel.xml"/>
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="schema_design.xml"/>
-
-  <chapter
-    xml:id="mapreduce">
-    <title>HBase and MapReduce</title>
-    <para>Apache MapReduce is a software framework used to analyze large amounts of data, and is
-      the framework used most often with <link
-        xlink:href="http://hadoop.apache.org/">Apache Hadoop</link>. MapReduce itself is out of the
-      scope of this document. A good place to get started with MapReduce is <link
-        xlink:href="http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html" />. MapReduce version
-      2 (MR2)is now part of <link
-        xlink:href="http://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/">YARN</link>. </para>
-
-    <para> This chapter discusses specific configuration steps you need to take to use MapReduce on
-      data within HBase. In addition, it discusses other interactions and issues between HBase and
-      MapReduce jobs.
-      <note> 
-      <title>mapred and mapreduce</title>
-      <para>There are two mapreduce packages in HBase as in MapReduce itself: <filename>org.apache.hadoop.hbase.mapred</filename>
-      and <filename>org.apache.hadoop.hbase.mapreduce</filename>. The former does old-style API and the latter
-      the new style.  The latter has more facility though you can usually find an equivalent in the older
-      package.  Pick the package that goes with your mapreduce deploy.  When in doubt or starting over, pick the
-      <filename>org.apache.hadoop.hbase.mapreduce</filename>.  In the notes below, we refer to
-      o.a.h.h.mapreduce but replace with the o.a.h.h.mapred if that is what you are using.
-      </para>
-      </note> 
-    </para>
-
-    <section
-      xml:id="hbase.mapreduce.classpath">
-      <title>HBase, MapReduce, and the CLASSPATH</title>
-      <para>By default, MapReduce jobs deployed to a MapReduce cluster do not have access to either
-        the HBase configuration under <envar>$HBASE_CONF_DIR</envar> or the HBase classes.</para>
-      <para>To give the MapReduce jobs the access they need, you could add
-          <filename>hbase-site.xml</filename> to the
-            <filename><replaceable>$HADOOP_HOME</replaceable>/conf/</filename> directory and add the
-        HBase JARs to the <filename><replaceable>HADOOP_HOME</replaceable>/conf/</filename>
-        directory, then copy these changes across your cluster. You could add hbase-site.xml to
-        $HADOOP_HOME/conf and add HBase jars to the $HADOOP_HOME/lib. You would then need to copy
-        these changes across your cluster or edit
-          <filename><replaceable>$HADOOP_HOME</replaceable>conf/hadoop-env.sh</filename> and add
-        them to the <envar>HADOOP_CLASSPATH</envar> variable. However, this approach is not
-        recommended because it will pollute your Hadoop install with HBase references. It also
-        requires you to restart the Hadoop cluster before Hadoop can use the HBase data.</para>
-      <para> Since HBase 0.90.x, HBase adds its dependency JARs to the job configuration itself. The
-        dependencies only need to be available on the local CLASSPATH. The following example runs
-        the bundled HBase <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
-        MapReduce job against a table named <systemitem>usertable</systemitem> If you have not set
-        the environment variables expected in the command (the parts prefixed by a
-          <literal>$</literal> sign and curly braces), you can use the actual system paths instead.
-        Be sure to use the correct version of the HBase JAR for your system. The backticks
-          (<literal>`</literal> symbols) cause ths shell to execute the sub-commands, setting the
-        CLASSPATH as part of the command. This example assumes you use a BASH-compatible shell. </para>
-      <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter usertable</userinput></screen>
-      <para>When the command runs, internally, the HBase JAR finds the dependencies it needs for
-        zookeeper, guava, and its other dependencies on the passed <envar>HADOOP_CLASSPATH</envar>
-        and adds the JARs to the MapReduce job configuration. See the source at
-        TableMapReduceUtil#addDependencyJars(org.apache.hadoop.mapreduce.Job) for how this is done. </para>
-      <note>
-        <para> The example may not work if you are running HBase from its build directory rather
-          than an installed location. You may see an error like the following:</para>
-        <screen>java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper</screen>
-        <para>If this occurs, try modifying the command as follows, so that it uses the HBase JARs
-          from the <filename>target/</filename> directory within the build environment.</para>
-        <screen language="bourne">$ <userinput>HADOOP_CLASSPATH=${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar:`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server/target/hbase-server-VERSION-SNAPSHOT.jar rowcounter usertable</userinput></screen>
-      </note>
-      <caution>
-        <title>Notice to Mapreduce users of HBase 0.96.1 and above</title>
-        <para>Some mapreduce jobs that use HBase fail to launch. The symptom is an exception similar
-          to the following:</para>
-        <screen>
-Exception in thread "main" java.lang.IllegalAccessError: class
-    com.google.protobuf.ZeroCopyLiteralByteString cannot access its superclass
-    com.google.protobuf.LiteralByteString
-    at java.lang.ClassLoader.defineClass1(Native Method)
-    at java.lang.ClassLoader.defineClass(ClassLoader.java:792)
-    at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
-    at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
-    at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
-    at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
-    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
-    at java.security.AccessController.doPrivileged(Native Method)
-    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
-    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
-    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
-    at
-    org.apache.hadoop.hbase.protobuf.ProtobufUtil.toScan(ProtobufUtil.java:818)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.convertScanToString(TableMapReduceUtil.java:433)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:186)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:147)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:270)
-    at
-    org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:100)
-...
-</screen>
-        <para>This is caused by an optimization introduced in <link
-            xlink:href="https://issues.apache.org/jira/browse/HBASE-9867">HBASE-9867</link> that
-          inadvertently introduced a classloader dependency. </para>
-        <para>This affects both jobs using the <code>-libjars</code> option and "fat jar," those
-          which package their runtime dependencies in a nested <code>lib</code> folder.</para>
-        <para>In order to satisfy the new classloader requirements, hbase-protocol.jar must be
-          included in Hadoop's classpath. See <xref
-            linkend="hbase.mapreduce.classpath" /> for current recommendations for resolving
-          classpath errors. The following is included for historical purposes.</para>
-        <para>This can be resolved system-wide by including a reference to the hbase-protocol.jar in
-          hadoop's lib directory, via a symlink or by copying the jar into the new location.</para>
-        <para>This can also be achieved on a per-job launch basis by including it in the
-            <code>HADOOP_CLASSPATH</code> environment variable at job submission time. When
-          launching jobs that package their dependencies, all three of the following job launching
-          commands satisfy this requirement:</para>
-        <screen language="bourne">
-$ <userinput>HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar MyJob.jar MyJobMainClass</userinput>
-$ <userinput>HADOOP_CLASSPATH=$(hbase classpath) hadoop jar MyJob.jar MyJobMainClass</userinput>
-        </screen>
-        <para>For jars that do not package their dependencies, the following command structure is
-          necessary:</para>
-        <screen language="bourne">
-$ <userinput>HADOOP_CLASSPATH=$(hbase mapredcp):/etc/hbase/conf hadoop jar MyApp.jar MyJobMainClass -libjars $(hbase mapredcp | tr ':' ',')</userinput> ...
-        </screen>
-        <para>See also <link
-            xlink:href="https://issues.apache.org/jira/browse/HBASE-10304">HBASE-10304</link> for
-          further discussion of this issue.</para>
-      </caution>
-    </section>
-
-    <section>
-      <title>MapReduce Scan Caching</title>
-      <para>TableMapReduceUtil now restores the option to set scanner caching (the number of rows
-        which are cached before returning the result to the client) on the Scan object that is
-        passed in. This functionality was lost due to a bug in HBase 0.95 (<link
-          xlink:href="https://issues.apache.org/jira/browse/HBASE-11558">HBASE-11558</link>), which
-        is fixed for HBase 0.98.5 and 0.96.3. The priority order for choosing the scanner caching is
-        as follows:</para>
-      <orderedlist>
-        <listitem>
-          <para>Caching settings which are set on the scan object.</para>
-        </listitem>
-        <listitem>
-          <para>Caching settings which are specified via the configuration option
-              <option>hbase.client.scanner.caching</option>, which can either be set manually in
-              <filename>hbase-site.xml</filename> or via the helper method
-              <code>TableMapReduceUtil.setScannerCaching()</code>.</para>
-        </listitem>
-        <listitem>
-          <para>The default value <code>HConstants.DEFAULT_HBASE_CLIENT_SCANNER_CACHING</code>, which is set to
-            <literal>100</literal>.</para>
-        </listitem>
-      </orderedlist>
-      <para>Optimizing the caching settings is a balance between the time the client waits for a
-        result and the number of sets of results the client needs to receive. If the caching setting
-        is too large, the client could end up waiting for a long time or the request could even time
-        out. If the setting is too small, the scan needs to return results in several pieces.
-        If you think of the scan as a shovel, a bigger cache setting is analogous to a bigger
-        shovel, and a smaller cache setting is equivalent to more shoveling in order to fill the
-        bucket.</para>
-      <para>The list of priorities mentioned above allows you to set a reasonable default, and
-        override it for specific operations.</para>
-      <para>See the API documentation for <link
-          xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html"
-          >Scan</link> for more details.</para>
-    </section>
-
-    <section>
-      <title>Bundled HBase MapReduce Jobs</title>
-      <para>The HBase JAR also serves as a Driver for some bundled mapreduce jobs. To learn about
-        the bundled MapReduce jobs, run the following command.</para>
-
-      <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar</userinput>
-<computeroutput>An example program must be given as the first argument.
-Valid program names are:
-  copytable: Export a table from local cluster to peer cluster
-  completebulkload: Complete a bulk data load.
-  export: Write table data to HDFS.
-  import: Import data written by Export.
-  importtsv: Import data in TSV format.
-  rowcounter: Count rows in HBase table</computeroutput>
-    </screen>
-      <para>Each of the valid program names are bundled MapReduce jobs. To run one of the jobs,
-        model your command after the following example.</para>
-      <screen language="bourne">$ <userinput>${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-server-VERSION.jar rowcounter myTable</userinput></screen>
-    </section>
-
-    <section>
-      <title>HBase as a MapReduce Job Data Source and Data Sink</title>
-      <para>HBase can be used as a data source, <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>,
-        and data sink, <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
-        or <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/MultiTableOutputFormat.html">MultiTableOutputFormat</link>,
-        for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to
-        subclass <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>
-        and/or <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableReducer.html">TableReducer</link>.
-        See the do-nothing pass-through classes <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableMapper.html">IdentityTableMapper</link>
-        and <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/IdentityTableReducer.html">IdentityTableReducer</link>
-        for basic usage. For a more involved example, see <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
-        or review the <code>org.apache.hadoop.hbase.mapreduce.TestTableMapReduce</code> unit test. </para>
-      <para>If you run MapReduce jobs that use HBase as source or sink, need to specify source and
-        sink table and column names in your configuration.</para>
-
-      <para>When you read from HBase, the <code>TableInputFormat</code> requests the list of regions
-        from HBase and makes a map, which is either a <code>map-per-region</code> or
-          <code>mapreduce.job.maps</code> map, whichever is smaller. If your job only has two maps,
-        raise <code>mapreduce.job.maps</code> to a number greater than the number of regions. Maps
-        will run on the adjacent TaskTracker if you are running a TaskTracer and RegionServer per
-        node. When writing to HBase, it may make sense to avoid the Reduce step and write back into
-        HBase from within your map. This approach works when your job does not need the sort and
-        collation that MapReduce does on the map-emitted data. On insert, HBase 'sorts' so there is
-        no point double-sorting (and shuffling data around your MapReduce cluster) unless you need
-        to. If you do not need the Reduce, you myour map might emit counts of records processed for
-        reporting at the end of the jobj, or set the number of Reduces to zero and use
-        TableOutputFormat. If running the Reduce step makes sense in your case, you should typically
-        use multiple reducers so that load is spread across the HBase cluster.</para>
-
-      <para>A new HBase partitioner, the <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/HRegionPartitioner.html">HRegionPartitioner</link>,
-        can run as many reducers the number of existing regions. The HRegionPartitioner is suitable
-        when your table is large and your upload will not greatly alter the number of existing
-        regions upon completion. Otherwise use the default partitioner. </para>
-    </section>
-
-    <section>
-      <title>Writing HFiles Directly During Bulk Import</title>
-      <para>If you are importing into a new table, you can bypass the HBase API and write your
-        content directly to the filesystem, formatted into HBase data files (HFiles). Your import
-        will run faster, perhaps an order of magnitude faster. For more on how this mechanism works,
-        see <xref
-          linkend="arch.bulk.load" />.</para>
-    </section>
-
-    <section>
-      <title>RowCounter Example</title>
-      <para>The included <link
-        xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/RowCounter.html">RowCounter</link>
-        MapReduce job uses <code>TableInputFormat</code> and does a count of all rows in the specified
-        table. To run it, use the following command: </para>
-      <screen language="bourne">$ <userinput>./bin/hadoop jar hbase-X.X.X.jar</userinput></screen> 
-      <para>This will
-        invoke the HBase MapReduce Driver class. Select <literal>rowcounter</literal> from the choice of jobs
-        offered. This will print rowcouner usage advice to standard output. Specify the tablename,
-        column to count, and output
-        directory. If you have classpath errors, see <xref linkend="hbase.mapreduce.classpath" />.</para>
-    </section>
-
-    <section
-      xml:id="splitter">
-      <title>Map-Task Splitting</title>
-      <section
-        xml:id="splitter.default">
-        <title>The Default HBase MapReduce Splitter</title>
-        <para>When <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
-          is used to source an HBase table in a MapReduce job, its splitter will make a map task for
-          each region of the table. Thus, if there are 100 regions in the table, there will be 100
-          map-tasks for the job - regardless of how many column families are selected in the
-          Scan.</para>
-      </section>
-      <section
-        xml:id="splitter.custom">
-        <title>Custom Splitters</title>
-        <para>For those interested in implementing custom splitters, see the method
-            <code>getSplits</code> in <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
-          That is where the logic for map-task assignment resides. </para>
-      </section>
-    </section>
-    <section
-      xml:id="mapreduce.example">
-      <title>HBase MapReduce Examples</title>
-      <section
-        xml:id="mapreduce.example.read">
-        <title>HBase MapReduce Read Example</title>
-        <para>The following is an example of using HBase as a MapReduce source in read-only manner.
-          Specifically, there is a Mapper instance but no Reducer, and nothing is being emitted from
-          the Mapper. There job would be defined as follows...</para>
-        <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config, "ExampleRead");
-job.setJarByClass(MyReadJob.class);     // class that contains mapper
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-...
-
-TableMapReduceUtil.initTableMapperJob(
-  tableName,        // input HBase table name
-  scan,             // Scan instance to control CF and attribute selection
-  MyMapper.class,   // mapper
-  null,             // mapper output key
-  null,             // mapper output value
-  job);
-job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-  throw new IOException("error with job!");
-}
-  </programlisting>
-        <para>...and the mapper instance would extend <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...</para>
-        <programlisting language="java">
-public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
-
-  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
-    // process data for the row from the Result instance.
-   }
-}
-    </programlisting>
-      </section>
-      <section
-        xml:id="mapreduce.example.readwrite">
-        <title>HBase MapReduce Read/Write Example</title>
-        <para>The following is an example of using HBase both as a source and as a sink with
-          MapReduce. This example will simply copy data from one table to another.</para>
-        <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleReadWrite");
-job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,      // input table
-	scan,	          // Scan instance to control CF and attribute selection
-	MyMapper.class,   // mapper class
-	null,	          // mapper output key
-	null,	          // mapper output value
-	job);
-TableMapReduceUtil.initTableReducerJob(
-	targetTable,      // output table
-	null,             // reducer class
-	job);
-job.setNumReduceTasks(0);
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-    throw new IOException("error with job!");
-}
-    </programlisting>
-        <para>An explanation is required of what <classname>TableMapReduceUtil</classname> is doing,
-          especially with the reducer. <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link>
-          is being used as the outputFormat class, and several parameters are being set on the
-          config (e.g., TableOutputFormat.OUTPUT_TABLE), as well as setting the reducer output key
-          to <classname>ImmutableBytesWritable</classname> and reducer value to
-            <classname>Writable</classname>. These could be set by the programmer on the job and
-          conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.</para>
-        <para>The following is the example mapper, which will create a <classname>Put</classname>
-          and matching the input <classname>Result</classname> and emit it. Note: this is what the
-          CopyTable utility does. </para>
-        <programlisting language="java">
-public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt;  {
-
-	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-		// this example is just copying the data from the source table...
-   		context.write(row, resultToPut(row,value));
-   	}
-
-  	private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
-  		Put put = new Put(key.get());
- 		for (KeyValue kv : result.raw()) {
-			put.add(kv);
-		}
-		return put;
-   	}
-}
-    </programlisting>
-        <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes
-          care of sending the <classname>Put</classname> to the target table. </para>
-        <para>This is just an example, developers could choose not to use
-            <classname>TableOutputFormat</classname> and connect to the target table themselves.
-        </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.readwrite.multi">
-        <title>HBase MapReduce Read/Write Example With Multi-Table Output</title>
-        <para>TODO: example for <classname>MultiTableOutputFormat</classname>. </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary">
-        <title>HBase MapReduce Summary to HBase Example</title>
-        <para>The following example uses HBase as a MapReduce source and sink with a summarization
-          step. This example will count the number of distinct instances of a value in a table and
-          write those summarized counts in another table.
-          <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleSummary");
-job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
-TableMapReduceUtil.initTableReducerJob(
-	targetTable,        // output table
-	MyTableReducer.class,    // reducer class
-	job);
-job.setNumReduceTasks(1);   // at least one, adjust as required
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-	throw new IOException("error with job!");
-}
-    </programlisting>
-          In this example mapper a column with a String-value is chosen as the value to summarize
-          upon. This value is used as the key to emit from the mapper, and an
-            <classname>IntWritable</classname> represents an instance counter.
-          <programlisting language="java">
-public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt;  {
-	public static final byte[] CF = "cf".getBytes();
-	public static final byte[] ATTR1 = "attr1".getBytes();
-
-	private final IntWritable ONE = new IntWritable(1);
-   	private Text text = new Text();
-
-   	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-        	String val = new String(value.getValue(CF, ATTR1));
-          	text.set(val);     // we can only emit Writables...
-
-        	context.write(text, ONE);
-   	}
-}
-    </programlisting>
-          In the reducer, the "ones" are counted (just like any other MR example that does this),
-          and then emits a <classname>Put</classname>.
-          <programlisting language="java">
-public static class MyTableReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt;  {
-	public static final byte[] CF = "cf".getBytes();
-	public static final byte[] COUNT = "count".getBytes();
-
- 	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-    		int i = 0;
-    		for (IntWritable val : values) {
-    			i += val.get();
-    		}
-    		Put put = new Put(Bytes.toBytes(key.toString()));
-    		put.add(CF, COUNT, Bytes.toBytes(i));
-
-    		context.write(null, put);
-   	}
-}
-    </programlisting>
-        </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary.file">
-        <title>HBase MapReduce Summary to File Example</title>
-        <para>This very similar to the summary example above, with exception that this is using
-          HBase as a MapReduce source but HDFS as the sink. The differences are in the job setup and
-          in the reducer. The mapper remains the same. </para>
-        <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleSummaryToFile");
-job.setJarByClass(MySummaryFileJob.class);     // class that contains mapper and reducer
-
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
-job.setReducerClass(MyReducer.class);    // reducer class
-job.setNumReduceTasks(1);    // at least one, adjust as required
-FileOutputFormat.setOutputPath(job, new Path("/tmp/mr/mySummaryFile"));  // adjust directories as required
-
-boolean b = job.waitForCompletion(true);
-if (!b) {
-	throw new IOException("error with job!");
-}
-    </programlisting>
-        <para>As stated above, the previous Mapper can run unchanged with this example. As for the
-          Reducer, it is a "generic" Reducer instead of extending TableMapper and emitting
-          Puts.</para>
-        <programlisting language="java">
- public static class MyReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
-
-	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-		int i = 0;
-		for (IntWritable val : values) {
-			i += val.get();
-		}
-		context.write(key, new IntWritable(i));
-	}
-}
-    </programlisting>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary.noreducer">
-        <title>HBase MapReduce Summary to HBase Without Reducer</title>
-        <para>It is also possible to perform summaries without a reducer - if you use HBase as the
-          reducer. </para>
-        <para>An HBase target table would need to exist for the job summary. The Table method
-            <code>incrementColumnValue</code> would be used to atomically increment values. From a
-          performance perspective, it might make sense to keep a Map of values with their values to
-          be incremeneted for each map-task, and make one update per key at during the <code>
-            cleanup</code> method of the mapper. However, your milage may vary depending on the
-          number of rows to be processed and unique keys. </para>
-        <para>In the end, the summary results are in HBase. </para>
-      </section>
-      <section
-        xml:id="mapreduce.example.summary.rdbms">
-        <title>HBase MapReduce Summary to RDBMS</title>
-        <para>Sometimes it is more appropriate to generate summaries to an RDBMS. For these cases,
-          it is possible to generate summaries directly to an RDBMS via a custom reducer. The
-            <code>setup</code> method can connect to an RDBMS (the connection information can be
-          passed via custom parameters in the context) and the cleanup method can close the
-          connection. </para>
-        <para>It is critical to understand that number of reducers for the job affects the
-          summarization implementation, and you'll have to design this into your reducer.
-          Specifically, whether it is designed to run as a singleton (one reducer) or multiple
-          reducers. Neither is right or wrong, it depends on your use-case. Recognize that the more
-          reducers that are assigned to the job, the more simultaneous connections to the RDBMS will
-          be created - this will scale, but only to a point. </para>
-        <programlisting language="java">
- public static class MyRdbmsReducer extends Reducer&lt;Text, IntWritable, Text, IntWritable&gt;  {
-
-	private Connection c = null;
-
-	public void setup(Context context) {
-  		// create DB connection...
-  	}
-
-	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-		// do summarization
-		// in this example the keys are Text, but this is just an example
-	}
-
-	public void cleanup(Context context) {
-  		// close db connection
-  	}
-
-}
-    </programlisting>
-        <para>In the end, the summary results are written to your RDBMS table/s. </para>
-      </section>
-
-    </section>
-    <!--  mr examples -->
-    <section
-      xml:id="mapreduce.htable.access">
-      <title>Accessing Other HBase Tables in a MapReduce Job</title>
-      <para>Although the framework currently allows one HBase table as input to a MapReduce job,
-        other HBase tables can be accessed as lookup tables, etc., in a MapReduce job via creating
-        an Table instance in the setup method of the Mapper.
-        <programlisting language="java">public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
-  private Table myOtherTable;
-
-  public void setup(Context context) {
-    // In here create a Connection to the cluster and save it or use the Connection
-    // from the existing table
-    myOtherTable = connection.getTable("myOtherTable");
-  }
-
-  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-	// process Result...
-	// use 'myOtherTable' for lookups
-  }
-
-  </programlisting>
-      </para>
-    </section>
-    <section
-      xml:id="mapreduce.specex">
-      <title>Speculative Execution</title>
-      <para>It is generally advisable to turn off speculative execution for MapReduce jobs that use
-        HBase as a source. This can either be done on a per-Job basis through properties, on on the
-        entire cluster. Especially for longer running jobs, speculative execution will create
-        duplicate map-tasks which will double-write your data to HBase; this is probably not what
-        you want. </para>
-      <para>See <xref
-          linkend="spec.ex" /> for more information. </para>
-    </section>
-  </chapter>  <!--  mapreduce -->
-
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="mapreduce.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="security.xml" />
-
-  <chapter xml:id="architecture">
-    <title>Architecture</title>
-	<section xml:id="arch.overview">
-	<title>Overview</title>
-	  <section xml:id="arch.overview.nosql">
-	  <title>NoSQL?</title>
-	  <para>HBase is a type of "NoSQL" database.  "NoSQL" is a general term meaning that the database isn't an RDBMS which
-	  supports SQL as its primary access language, but there are many types of NoSQL databases:  BerkeleyDB is an
-	  example of a local NoSQL database, whereas HBase is very much a distributed database.  Technically speaking,
-	  HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS,
-	  such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
-	  </para>
-	  <para>However, HBase has many features which supports both linear and modular scaling.  HBase clusters expand
-	  by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20
-	  RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
-	  RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best
-	  performance requires specialized hardware and storage devices.  HBase features of note are:
-	        <itemizedlist>
-              <listitem><para>Strongly consistent reads/writes:  HBase is not an "eventually consistent" DataStore.  This
-              makes it very suitable for tasks such as high-speed counter aggregation.</para>  </listitem>
-              <listitem><para>Automatic sharding:  HBase tables are distributed on the cluster via regions, and regions are
-              automatically split and re-distributed as your data grows.</para></listitem>
-              <listitem><para>Automatic RegionServer failover</para></listitem>
-              <listitem><para>Hadoop/HDFS Integration:  HBase supports HDFS out of the box as its distributed file system.</para></listitem>
-              <listitem><para>MapReduce:  HBase supports massively parallelized processing via MapReduce for using HBase as both
-              source and sink.</para></listitem>
-              <listitem><para>Java Client API:  HBase supports an easy to use Java API for programmatic access.</para></listitem>
-              <listitem><para>Thrift/REST API:  HBase also supports Thrift and REST for non-Java front-ends.</para></listitem>
-              <listitem><para>Block Cache and Bloom Filters:  HBase supports a Block Cache and Bloom Filters for high volume query optimization.</para></listitem>
-              <listitem><para>Operational Management:  HBase provides build-in web-pages for operational insight as well as JMX metrics.</para></listitem>
-            </itemizedlist>
-	  </para>
-      </section>
-
-	  <section xml:id="arch.overview.when">
-	    <title>When Should I Use HBase?</title>
-	    	  <para>HBase isn't suitable for every problem.</para>
-	          <para>First, make sure you have enough data.  If you have hundreds of millions or billions of rows, then
-	            HBase is a good candidate.  If you only have a few thousand/million rows, then using a traditional RDBMS
-	            might be a better choice due to the fact that all of your data might wind up on a single node (or two) and
-	            the rest of the cluster may be sitting idle.
-	          </para>
-	          <para>Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns,
-	          secondary indexes, transactions, advanced query languages, etc.)  An application built against an RDBMS cannot be
-	          "ported" to HBase by simply changing a JDBC driver, for example.  Consider moving from an RDBMS to HBase as a
-	          complete redesign as opposed to a port.
-              </para>
-	          <para>Third, make sure you have enough hardware.  Even HDFS doesn't do well with anything less than
-                5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
-                </para>
-                <para>HBase can run quite well stand-alone on a laptop - but this should be considered a development
-                configuration only.
-                </para>
-      </section>
-      <section xml:id="arch.overview.hbasehdfs">
-        <title>What Is The Difference Between HBase and Hadoop/HDFS?</title>
-          <para><link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files.
-          Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.
-          HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.
-          This can sometimes be a point of conceptual confusion.  HBase internally puts your data in indexed "StoreFiles" that exist
-          on HDFS for high-speed lookups.  See the <xref linkend="datamodel" /> and the rest of this chapter for more information on how HBase achieves its goals.
-         </para>
-      </section>
-	</section>
-
-    <section
-      xml:id="arch.catalog">
-      <title>Catalog Tables</title>
-      <para>The catalog table <code>hbase:meta</code> exists as an HBase table and is filtered out of the HBase
-        shell's <code>list</code> command, but is in fact a table just like any other. </para>
-      <section
-        xml:id="arch.catalog.root">
-        <title>-ROOT-</title>
-        <note>
-          <para>The <code>-ROOT-</code> table was removed in HBase 0.96.0. Information here should
-            be considered historical.</para>
-        </note>
-        <para>The <code>-ROOT-</code> table kept track of the location of the
-            <code>.META</code> table (the previous name for the table now called <code>hbase:meta</code>) prior to HBase
-          0.96. The <code>-ROOT-</code> table structure was as follows: </para>
-        <itemizedlist>
-          <title>Key</title>
-          <listitem>
-            <para>.META. region key (<code>.META.,,1</code>)</para>
-          </listitem>
-        </itemizedlist>
-
-        <itemizedlist>
-          <title>Values</title>
-          <listitem>
-            <para><code>info:regioninfo</code> (serialized <link
-                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">HRegionInfo</link>
-              instance of hbase:meta)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:server</code> (server:port of the RegionServer holding
-              hbase:meta)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:serverstartcode</code> (start-time of the RegionServer process holding
-              hbase:meta)</para>
-          </listitem>
-        </itemizedlist>
-      </section>
-      <section
-        xml:id="arch.catalog.meta">
-        <title>hbase:meta</title>
-        <para>The <code>hbase:meta</code> table (previously called <code>.META.</code>) keeps a list
-          of all regions in the system. The location of <code>hbase:meta</code> was previously
-          tracked within the <code>-ROOT-</code> table, but is now stored in Zookeeper.</para>
-        <para>The <code>hbase:meta</code> table structure is as follows: </para>
-        <itemizedlist>
-          <title>Key</title>
-          <listitem>
-            <para>Region key of the format (<code>[table],[region start key],[region
-              id]</code>)</para>
-          </listitem>
-        </itemizedlist>
-        <itemizedlist>
-          <title>Values</title>
-          <listitem>
-            <para><code>info:regioninfo</code> (serialized <link
-                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">
-                HRegionInfo</link> instance for this region)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:server</code> (server:port of the RegionServer containing this
-              region)</para>
-          </listitem>
-          <listitem>
-            <para><code>info:serverstartcode</code> (start-time of the RegionServer process
-              containing this region)</para>
-          </listitem>
-        </itemizedlist>
-        <para>When a table is in the process of splitting, two other columns will be created, called
-            <code>info:splitA</code> and <code>info:splitB</code>. These columns represent the two
-          daughter regions. The values for these columns are also serialized HRegionInfo instances.
-          After the region has been split, eventually this row will be deleted. </para>
-        <note>
-          <title>Note on HRegionInfo</title>
-          <para>The empty key is used to denote table start and table end. A region with an empty
-            start key is the first region in a table. If a region has both an empty start and an
-            empty end key, it is the only region in the table </para>
-        </note>
-        <para>In the (hopefully unlikely) event that programmatic processing of catalog metadata is
-          required, see the <link
-            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29">Writables</link>
-          utility. </para>
-      </section>
-      <section
-        xml:id="arch.catalog.startup">
-        <title>Startup Sequencing</title>
-        <para>First, the location of <code>hbase:meta</code> is looked up in Zookeeper. Next,
-          <code>hbase:meta</code> is updated with server and startcode values.</para>  
-        <para>For information on region-RegionServer assignment, see <xref
-            linkend="regions.arch.assignment" />. </para>
-      </section>
-    </section>  <!--  catalog -->
-
-    <section
-      xml:id="client">
-      <title>Client</title>
-      <para>The HBase client finds the RegionServers that are serving the particular row range of
-        interest. It does this by querying the <code>hbase:meta</code> table. See <xref
-          linkend="arch.catalog.meta" /> for details. After locating the required region(s), the
-        client contacts the RegionServer serving that region, rather than going through the master,
-        and issues the read or write request. This information is cached in the client so that
-        subsequent requests need not go through the lookup process. Should a region be reassigned
-        either by the master load balancer or because a RegionServer has died, the client will
-        requery the catalog tables to determine the new location of the user region. </para>
-
-      <para>See <xref
-          linkend="master.runtime" /> for more information about the impact of the Master on HBase
-        Client communication. </para>
-      <para>Administrative functions are done via an instance of <link
-          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html">Admin</link>
-      </para>
-
-      <section
-        xml:id="client.connections">
-        <title>Cluster Connections</title>
-        <para>The API changed in HBase 1.0. Its been cleaned up and users are returned
-          Interfaces to work against rather than particular types. In HBase 1.0,
-          obtain a cluster Connection from ConnectionFactory and thereafter, get from it
-          instances of Table, Admin, and RegionLocator on an as-need basis. When done, close
-          obtained instances.  Finally, be sure to cleanup your Connection instance before
-          exiting.  Connections are heavyweigh

<TRUNCATED>

[7/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
http://git-wip-us.apache.org/repos/asf/hbase/blob/a1fe1e09/src/main/docbkx/architecture.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/architecture.xml b/src/main/docbkx/architecture.xml
new file mode 100644
index 0000000..16b298a
--- /dev/null
+++ b/src/main/docbkx/architecture.xml
@@ -0,0 +1,3489 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<chapter
+    xml:id="architecture"
+    version="5.0"
+    xmlns="http://docbook.org/ns/docbook"
+    xmlns:xlink="http://www.w3.org/1999/xlink"
+    xmlns:xi="http://www.w3.org/2001/XInclude"
+    xmlns:svg="http://www.w3.org/2000/svg"
+    xmlns:m="http://www.w3.org/1998/Math/MathML"
+    xmlns:html="http://www.w3.org/1999/xhtml"
+    xmlns:db="http://docbook.org/ns/docbook">
+    <!--/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+-->
+
+    <title>Architecture</title>
+	<section xml:id="arch.overview">
+	<title>Overview</title>
+	  <section xml:id="arch.overview.nosql">
+	  <title>NoSQL?</title>
+	  <para>HBase is a type of "NoSQL" database.  "NoSQL" is a general term meaning that the database isn't an RDBMS which
+	  supports SQL as its primary access language, but there are many types of NoSQL databases:  BerkeleyDB is an
+	  example of a local NoSQL database, whereas HBase is very much a distributed database.  Technically speaking,
+	  HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS,
+	  such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
+	  </para>
+	  <para>However, HBase has many features which supports both linear and modular scaling.  HBase clusters expand
+	  by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20
+	  RegionServers, for example, it doubles both in terms of storage and as well as processing capacity.
+	  RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best
+	  performance requires specialized hardware and storage devices.  HBase features of note are:
+	        <itemizedlist>
+              <listitem><para>Strongly consistent reads/writes:  HBase is not an "eventually consistent" DataStore.  This
+              makes it very suitable for tasks such as high-speed counter aggregation.</para>  </listitem>
+              <listitem><para>Automatic sharding:  HBase tables are distributed on the cluster via regions, and regions are
+              automatically split and re-distributed as your data grows.</para></listitem>
+              <listitem><para>Automatic RegionServer failover</para></listitem>
+              <listitem><para>Hadoop/HDFS Integration:  HBase supports HDFS out of the box as its distributed file system.</para></listitem>
+              <listitem><para>MapReduce:  HBase supports massively parallelized processing via MapReduce for using HBase as both
+              source and sink.</para></listitem>
+              <listitem><para>Java Client API:  HBase supports an easy to use Java API for programmatic access.</para></listitem>
+              <listitem><para>Thrift/REST API:  HBase also supports Thrift and REST for non-Java front-ends.</para></listitem>
+              <listitem><para>Block Cache and Bloom Filters:  HBase supports a Block Cache and Bloom Filters for high volume query optimization.</para></listitem>
+              <listitem><para>Operational Management:  HBase provides build-in web-pages for operational insight as well as JMX metrics.</para></listitem>
+            </itemizedlist>
+	  </para>
+      </section>
+
+	  <section xml:id="arch.overview.when">
+	    <title>When Should I Use HBase?</title>
+	    	  <para>HBase isn't suitable for every problem.</para>
+	          <para>First, make sure you have enough data.  If you have hundreds of millions or billions of rows, then
+	            HBase is a good candidate.  If you only have a few thousand/million rows, then using a traditional RDBMS
+	            might be a better choice due to the fact that all of your data might wind up on a single node (or two) and
+	            the rest of the cluster may be sitting idle.
+	          </para>
+	          <para>Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns,
+	          secondary indexes, transactions, advanced query languages, etc.)  An application built against an RDBMS cannot be
+	          "ported" to HBase by simply changing a JDBC driver, for example.  Consider moving from an RDBMS to HBase as a
+	          complete redesign as opposed to a port.
+              </para>
+	          <para>Third, make sure you have enough hardware.  Even HDFS doesn't do well with anything less than
+                5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
+                </para>
+                <para>HBase can run quite well stand-alone on a laptop - but this should be considered a development
+                configuration only.
+                </para>
+      </section>
+      <section xml:id="arch.overview.hbasehdfs">
+        <title>What Is The Difference Between HBase and Hadoop/HDFS?</title>
+          <para><link xlink:href="http://hadoop.apache.org/hdfs/">HDFS</link> is a distributed file system that is well suited for the storage of large files.
+          Its documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files.
+          HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables.
+          This can sometimes be a point of conceptual confusion.  HBase internally puts your data in indexed "StoreFiles" that exist
+          on HDFS for high-speed lookups.  See the <xref linkend="datamodel" /> and the rest of this chapter for more information on how HBase achieves its goals.
+         </para>
+      </section>
+	</section>
+
+    <section
+      xml:id="arch.catalog">
+      <title>Catalog Tables</title>
+      <para>The catalog table <code>hbase:meta</code> exists as an HBase table and is filtered out of the HBase
+        shell's <code>list</code> command, but is in fact a table just like any other. </para>
+      <section
+        xml:id="arch.catalog.root">
+        <title>-ROOT-</title>
+        <note>
+          <para>The <code>-ROOT-</code> table was removed in HBase 0.96.0. Information here should
+            be considered historical.</para>
+        </note>
+        <para>The <code>-ROOT-</code> table kept track of the location of the
+            <code>.META</code> table (the previous name for the table now called <code>hbase:meta</code>) prior to HBase
+          0.96. The <code>-ROOT-</code> table structure was as follows: </para>
+        <itemizedlist>
+          <title>Key</title>
+          <listitem>
+            <para>.META. region key (<code>.META.,,1</code>)</para>
+          </listitem>
+        </itemizedlist>
+
+        <itemizedlist>
+          <title>Values</title>
+          <listitem>
+            <para><code>info:regioninfo</code> (serialized <link
+                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">HRegionInfo</link>
+              instance of hbase:meta)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:server</code> (server:port of the RegionServer holding
+              hbase:meta)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:serverstartcode</code> (start-time of the RegionServer process holding
+              hbase:meta)</para>
+          </listitem>
+        </itemizedlist>
+      </section>
+      <section
+        xml:id="arch.catalog.meta">
+        <title>hbase:meta</title>
+        <para>The <code>hbase:meta</code> table (previously called <code>.META.</code>) keeps a list
+          of all regions in the system. The location of <code>hbase:meta</code> was previously
+          tracked within the <code>-ROOT-</code> table, but is now stored in Zookeeper.</para>
+        <para>The <code>hbase:meta</code> table structure is as follows: </para>
+        <itemizedlist>
+          <title>Key</title>
+          <listitem>
+            <para>Region key of the format (<code>[table],[region start key],[region
+              id]</code>)</para>
+          </listitem>
+        </itemizedlist>
+        <itemizedlist>
+          <title>Values</title>
+          <listitem>
+            <para><code>info:regioninfo</code> (serialized <link
+                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HRegionInfo.html">
+                HRegionInfo</link> instance for this region)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:server</code> (server:port of the RegionServer containing this
+              region)</para>
+          </listitem>
+          <listitem>
+            <para><code>info:serverstartcode</code> (start-time of the RegionServer process
+              containing this region)</para>
+          </listitem>
+        </itemizedlist>
+        <para>When a table is in the process of splitting, two other columns will be created, called
+            <code>info:splitA</code> and <code>info:splitB</code>. These columns represent the two
+          daughter regions. The values for these columns are also serialized HRegionInfo instances.
+          After the region has been split, eventually this row will be deleted. </para>
+        <note>
+          <title>Note on HRegionInfo</title>
+          <para>The empty key is used to denote table start and table end. A region with an empty
+            start key is the first region in a table. If a region has both an empty start and an
+            empty end key, it is the only region in the table </para>
+        </note>
+        <para>In the (hopefully unlikely) event that programmatic processing of catalog metadata is
+          required, see the <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/Writables.html#getHRegionInfo%28byte[]%29">Writables</link>
+          utility. </para>
+      </section>
+      <section
+        xml:id="arch.catalog.startup">
+        <title>Startup Sequencing</title>
+        <para>First, the location of <code>hbase:meta</code> is looked up in Zookeeper. Next,
+          <code>hbase:meta</code> is updated with server and startcode values.</para>  
+        <para>For information on region-RegionServer assignment, see <xref
+            linkend="regions.arch.assignment" />. </para>
+      </section>
+    </section>  <!--  catalog -->
+
+    <section
+      xml:id="client">
+      <title>Client</title>
+      <para>The HBase client finds the RegionServers that are serving the particular row range of
+        interest. It does this by querying the <code>hbase:meta</code> table. See <xref
+          linkend="arch.catalog.meta" /> for details. After locating the required region(s), the
+        client contacts the RegionServer serving that region, rather than going through the master,
+        and issues the read or write request. This information is cached in the client so that
+        subsequent requests need not go through the lookup process. Should a region be reassigned
+        either by the master load balancer or because a RegionServer has died, the client will
+        requery the catalog tables to determine the new location of the user region. </para>
+
+      <para>See <xref
+          linkend="master.runtime" /> for more information about the impact of the Master on HBase
+        Client communication. </para>
+      <para>Administrative functions are done via an instance of <link
+          xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Admin.html">Admin</link>
+      </para>
+
+      <section
+        xml:id="client.connections">
+        <title>Cluster Connections</title>
+        <para>The API changed in HBase 1.0. Its been cleaned up and users are returned
+          Interfaces to work against rather than particular types. In HBase 1.0,
+          obtain a cluster Connection from ConnectionFactory and thereafter, get from it
+          instances of Table, Admin, and RegionLocator on an as-need basis. When done, close
+          obtained instances.  Finally, be sure to cleanup your Connection instance before
+          exiting.  Connections are heavyweight objects. Create once and keep an instance around.
+          Table, Admin and RegionLocator instances are lightweight. Create as you go and then
+          let go as soon as you are done by closing them. See the
+          <link xlink:href="/Users/stack/checkouts/hbase.git/target/site/apidocs/org/apache/hadoop/hbase/client/package-summary.html">Client Package Javadoc Description</link> for example usage of the new HBase 1.0 API.</para>
+
+        <para>For connection configuration information, see <xref linkend="client_dependencies" />. </para>
+
+        <para><emphasis><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Table.html">Table</link>
+            instances are not thread-safe</emphasis>. Only one thread can use an instance of Table at
+          any given time. When creating Table instances, it is advisable to use the same <link
+            xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HBaseConfiguration">HBaseConfiguration</link>
+          instance. This will ensure sharing of ZooKeeper and socket instances to the RegionServers
+          which is usually what you want. For example, this is preferred:</para>
+          <programlisting language="java">HBaseConfiguration conf = HBaseConfiguration.create();
+HTable table1 = new HTable(conf, "myTable");
+HTable table2 = new HTable(conf, "myTable");</programlisting>
+          <para>as opposed to this:</para>
+          <programlisting language="java">HBaseConfiguration conf1 = HBaseConfiguration.create();
+HTable table1 = new HTable(conf1, "myTable");
+HBaseConfiguration conf2 = HBaseConfiguration.create();
+HTable table2 = new HTable(conf2, "myTable");</programlisting>
+
+        <para>For more information about how connections are handled in the HBase client,
+        see <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnectionManager.html">HConnectionManager</link>.
+          </para>
+          <section xml:id="client.connection.pooling"><title>Connection Pooling</title>
+            <para>For applications which require high-end multithreaded access (e.g., web-servers or application servers that may serve many application threads
+            in a single JVM), you can pre-create an <classname>HConnection</classname>, as shown in
+              the following example:</para>
+            <example>
+              <title>Pre-Creating a <code>HConnection</code></title>
+              <programlisting language="java">// Create a connection to the cluster.
+HConnection connection = HConnectionManager.createConnection(Configuration);
+HTableInterface table = connection.getTable("myTable");
+// use table as needed, the table returned is lightweight
+table.close();
+// use the connection for other access to the cluster
+connection.close();</programlisting>
+            </example>
+          <para>Constructing HTableInterface implementation is very lightweight and resources are
+            controlled.</para>
+            <warning>
+              <title><code>HTablePool</code> is Deprecated</title>
+              <para>Previous versions of this guide discussed <code>HTablePool</code>, which was
+                deprecated in HBase 0.94, 0.95, and 0.96, and removed in 0.98.1, by <link
+                  xlink:href="https://issues.apache.org/jira/browse/HBASE-6580">HBASE-6500</link>.
+                Please use <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HConnection.html"><code>HConnection</code></link> instead.</para>
+            </warning>
+          </section>
+   	  </section>
+	   <section xml:id="client.writebuffer"><title>WriteBuffer and Batch Methods</title>
+           <para>If <xref linkend="perf.hbase.client.autoflush" /> is turned off on
+               <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html">HTable</link>,
+               <classname>Put</classname>s are sent to RegionServers when the writebuffer
+               is filled.  The writebuffer is 2MB by default.  Before an HTable instance is
+               discarded, either <methodname>close()</methodname> or
+               <methodname>flushCommits()</methodname> should be invoked so Puts
+               will not be lost.
+	      </para>
+	      <para>Note: <code>htable.delete(Delete);</code> does not go in the writebuffer!  This only applies to Puts.
+	      </para>
+	      <para>For additional information on write durability, review the <link xlink:href="../acid-semantics.html">ACID semantics</link> page.
+	      </para>
+       <para>For fine-grained control of batching of
+           <classname>Put</classname>s or <classname>Delete</classname>s,
+           see the <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29">batch</link> methods on HTable.
+	   </para>
+	   </section>
+	   <section xml:id="client.external"><title>External Clients</title>
+           <para>Information on non-Java clients and custom protocols is covered in <xref linkend="external_apis" />
+           </para>
+		</section>
+	</section>
+
+    <section xml:id="client.filter"><title>Client Request Filters</title>
+      <para><link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link> and <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link> instances can be
+       optionally configured with <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/Filter.html">filters</link> which are applied on the RegionServer.
+      </para>
+      <para>Filters can be confusing because there are many different types, and it is best to approach them by understanding the groups
+      of Filter functionality.
+      </para>
+      <section xml:id="client.filter.structural"><title>Structural</title>
+        <para>Structural Filters contain other Filters.</para>
+        <section xml:id="client.filter.structural.fl"><title>FilterList</title>
+          <para><link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FilterList.html">FilterList</link>
+          represents a list of Filters with a relationship of <code>FilterList.Operator.MUST_PASS_ALL</code> or
+          <code>FilterList.Operator.MUST_PASS_ONE</code> between the Filters.  The following example shows an 'or' between two
+          Filters (checking for either 'my value' or 'my other value' on the same attribute).</para>
+<programlisting language="java">
+FilterList list = new FilterList(FilterList.Operator.MUST_PASS_ONE);
+SingleColumnValueFilter filter1 = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	Bytes.toBytes("my value")
+	);
+list.add(filter1);
+SingleColumnValueFilter filter2 = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	Bytes.toBytes("my other value")
+	);
+list.add(filter2);
+scan.setFilter(list);
+</programlisting>
+        </section>
+      </section>
+      <section
+        xml:id="client.filter.cv">
+        <title>Column Value</title>
+        <section
+          xml:id="client.filter.cv.scvf">
+          <title>SingleColumnValueFilter</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html">SingleColumnValueFilter</link>
+            can be used to test column values for equivalence (<code><link
+                xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/CompareFilter.CompareOp.html">CompareOp.EQUAL</link>
+            </code>), inequality (<code>CompareOp.NOT_EQUAL</code>), or ranges (e.g.,
+              <code>CompareOp.GREATER</code>). The following is example of testing equivalence a
+            column to a String value "my value"...</para>
+          <programlisting language="java">
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	Bytes.toBytes("my value")
+	);
+scan.setFilter(filter);
+</programlisting>
+        </section>
+      </section>
+      <section
+        xml:id="client.filter.cvp">
+        <title>Column Value Comparators</title>
+        <para>There are several Comparator classes in the Filter package that deserve special
+          mention. These Comparators are used in concert with other Filters, such as <xref
+            linkend="client.filter.cv.scvf" />. </para>
+        <section
+          xml:id="client.filter.cvp.rcs">
+          <title>RegexStringComparator</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RegexStringComparator.html">RegexStringComparator</link>
+            supports regular expressions for value comparisons.</para>
+          <programlisting language="java">
+RegexStringComparator comp = new RegexStringComparator("my.");   // any value that starts with 'my'
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	comp
+	);
+scan.setFilter(filter);
+</programlisting>
+          <para>See the Oracle JavaDoc for <link
+              xlink:href="http://download.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html">supported
+              RegEx patterns in Java</link>. </para>
+        </section>
+        <section
+          xml:id="client.filter.cvp.SubStringComparator">
+          <title>SubstringComparator</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SubstringComparator.html">SubstringComparator</link>
+            can be used to determine if a given substring exists in a value. The comparison is
+            case-insensitive. </para>
+          <programlisting language="java">
+SubstringComparator comp = new SubstringComparator("y val");   // looking for 'my value'
+SingleColumnValueFilter filter = new SingleColumnValueFilter(
+	cf,
+	column,
+	CompareOp.EQUAL,
+	comp
+	);
+scan.setFilter(filter);
+</programlisting>
+        </section>
+        <section
+          xml:id="client.filter.cvp.bfp">
+          <title>BinaryPrefixComparator</title>
+          <para>See <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryPrefixComparator.html">BinaryPrefixComparator</link>.</para>
+        </section>
+        <section
+          xml:id="client.filter.cvp.bc">
+          <title>BinaryComparator</title>
+          <para>See <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/BinaryComparator.html">BinaryComparator</link>.</para>
+        </section>
+      </section>
+      <section
+        xml:id="client.filter.kvm">
+        <title>KeyValue Metadata</title>
+        <para>As HBase stores data internally as KeyValue pairs, KeyValue Metadata Filters evaluate
+          the existence of keys (i.e., ColumnFamily:Column qualifiers) for a row, as opposed to
+          values the previous section. </para>
+        <section
+          xml:id="client.filter.kvm.ff">
+          <title>FamilyFilter</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FamilyFilter.html">FamilyFilter</link>
+            can be used to filter on the ColumnFamily. It is generally a better idea to select
+            ColumnFamilies in the Scan than to do it with a Filter.</para>
+        </section>
+        <section
+          xml:id="client.filter.kvm.qf">
+          <title>QualifierFilter</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/QualifierFilter.html">QualifierFilter</link>
+            can be used to filter based on Column (aka Qualifier) name. </para>
+        </section>
+        <section
+          xml:id="client.filter.kvm.cpf">
+          <title>ColumnPrefixFilter</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnPrefixFilter.html">ColumnPrefixFilter</link>
+            can be used to filter based on the lead portion of Column (aka Qualifier) names. </para>
+          <para>A ColumnPrefixFilter seeks ahead to the first column matching the prefix in each row
+            and for each involved column family. It can be used to efficiently get a subset of the
+            columns in very wide rows. </para>
+          <para>Note: The same column qualifier can be used in different column families. This
+            filter returns all matching columns. </para>
+          <para>Example: Find all columns in a row and family that start with "abc"</para>
+          <programlisting language="java">
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[] prefix = Bytes.toBytes("abc");
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new ColumnPrefixFilter(prefix);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+</programlisting>
+        </section>
+        <section
+          xml:id="client.filter.kvm.mcpf">
+          <title>MultipleColumnPrefixFilter</title>
+          <para><link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/MultipleColumnPrefixFilter.html">MultipleColumnPrefixFilter</link>
+            behaves like ColumnPrefixFilter but allows specifying multiple prefixes. </para>
+          <para>Like ColumnPrefixFilter, MultipleColumnPrefixFilter efficiently seeks ahead to the
+            first column matching the lowest prefix and also seeks past ranges of columns between
+            prefixes. It can be used to efficiently get discontinuous sets of columns from very wide
+            rows. </para>
+          <para>Example: Find all columns in a row and family that start with "abc" or "xyz"</para>
+          <programlisting language="java">
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[][] prefixes = new byte[][] {Bytes.toBytes("abc"), Bytes.toBytes("xyz")};
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new MultipleColumnPrefixFilter(prefixes);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+</programlisting>
+        </section>
+        <section
+          xml:id="client.filter.kvm.crf ">
+          <title>ColumnRangeFilter</title>
+          <para>A <link
+              xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/ColumnRangeFilter.html">ColumnRangeFilter</link>
+            allows efficient intra row scanning. </para>
+          <para>A ColumnRangeFilter can seek ahead to the first matching column for each involved
+            column family. It can be used to efficiently get a 'slice' of the columns of a very wide
+            row. i.e. you have a million columns in a row but you only want to look at columns
+            bbbb-bbdd. </para>
+          <para>Note: The same column qualifier can be used in different column families. This
+            filter returns all matching columns. </para>
+          <para>Example: Find all columns in a row and family between "bbbb" (inclusive) and "bbdd"
+            (inclusive)</para>
+          <programlisting language="java">
+HTableInterface t = ...;
+byte[] row = ...;
+byte[] family = ...;
+byte[] startColumn = Bytes.toBytes("bbbb");
+byte[] endColumn = Bytes.toBytes("bbdd");
+Scan scan = new Scan(row, row); // (optional) limit to one row
+scan.addFamily(family); // (optional) limit to one family
+Filter f = new ColumnRangeFilter(startColumn, true, endColumn, true);
+scan.setFilter(f);
+scan.setBatch(10); // set this if there could be many columns returned
+ResultScanner rs = t.getScanner(scan);
+for (Result r = rs.next(); r != null; r = rs.next()) {
+  for (KeyValue kv : r.raw()) {
+    // each kv represents a column
+  }
+}
+rs.close();
+</programlisting>
+            <para>Note:  Introduced in HBase 0.92</para>
+        </section>
+      </section>
+      <section xml:id="client.filter.row"><title>RowKey</title>
+        <section xml:id="client.filter.row.rf"><title>RowFilter</title>
+          <para>It is generally a better idea to use the startRow/stopRow methods on Scan for row selection, however
+          <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/RowFilter.html">RowFilter</link> can also be used.</para>
+        </section>
+      </section>
+      <section xml:id="client.filter.utility"><title>Utility</title>
+        <section xml:id="client.filter.utility.fkof"><title>FirstKeyOnlyFilter</title>
+          <para>This is primarily used for rowcount jobs.
+          See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKeyOnlyFilter.html">FirstKeyOnlyFilter</link>.</para>
+        </section>
+      </section>
+	</section>  <!--  client.filter -->
+
+    <section xml:id="master"><title>Master</title>
+      <para><code>HMaster</code> is the implementation of the Master Server. The Master server is
+        responsible for monitoring all RegionServer instances in the cluster, and is the interface
+        for all metadata changes. In a distributed cluster, the Master typically runs on the <xref
+          linkend="arch.hdfs.nn"/>. J Mohamed Zahoor goes into some more detail on the Master
+        Architecture in this blog posting, <link
+          xlink:href="http://blog.zahoor.in/2012/08/hbase-hmaster-architecture/">HBase HMaster
+          Architecture </link>.</para>
+       <section xml:id="master.startup"><title>Startup Behavior</title>
+         <para>If run in a multi-Master environment, all Masters compete to run the cluster.  If the active
+         Master loses its lease in ZooKeeper (or the Master shuts down), then then the remaining Masters jostle to
+         take over the Master role.
+         </para>
+       </section>
+      <section
+        xml:id="master.runtime">
+        <title>Runtime Impact</title>
+        <para>A common dist-list question involves what happens to an HBase cluster when the Master
+          goes down. Because the HBase client talks directly to the RegionServers, the cluster can
+          still function in a "steady state." Additionally, per <xref
+            linkend="arch.catalog" />, <code>hbase:meta</code> exists as an HBase table and is not
+          resident in the Master. However, the Master controls critical functions such as
+          RegionServer failover and completing region splits. So while the cluster can still run for
+          a short time without the Master, the Master should be restarted as soon as possible.
+        </para>
+      </section>
+       <section xml:id="master.api"><title>Interface</title>
+         <para>The methods exposed by <code>HMasterInterface</code> are primarily metadata-oriented methods:
+         <itemizedlist>
+            <listitem><para>Table (createTable, modifyTable, removeTable, enable, disable)
+            </para></listitem>
+            <listitem><para>ColumnFamily (addColumn, modifyColumn, removeColumn)
+            </para></listitem>
+            <listitem><para>Region (move, assign, unassign)
+            </para></listitem>
+         </itemizedlist>
+         For example, when the <code>HBaseAdmin</code> method <code>disableTable</code> is invoked, it is serviced by the Master server.
+         </para>
+       </section>
+       <section xml:id="master.processes"><title>Processes</title>
+         <para>The Master runs several background threads:
+         </para>
+         <section xml:id="master.processes.loadbalancer"><title>LoadBalancer</title>
+           <para>Periodically, and when there are no regions in transition,
+             a load balancer will run and move regions around to balance the cluster's load.
+             See <xref linkend="balancer_config" /> for configuring this property.</para>
+             <para>See <xref linkend="regions.arch.assignment"/> for more information on region assignment.
+             </para>
+         </section>
+         <section xml:id="master.processes.catalog"><title>CatalogJanitor</title>
+           <para>Periodically checks and cleans up the hbase:meta table.  See <xref linkend="arch.catalog.meta" /> for more information on META.</para>
+         </section>
+       </section>
+
+     </section>
+    <section
+      xml:id="regionserver.arch">
+      <title>RegionServer</title>
+      <para><code>HRegionServer</code> is the RegionServer implementation. It is responsible for
+        serving and managing regions. In a distributed cluster, a RegionServer runs on a <xref
+          linkend="arch.hdfs.dn" />. </para>
+      <section
+        xml:id="regionserver.arch.api">
+        <title>Interface</title>
+        <para>The methods exposed by <code>HRegionRegionInterface</code> contain both data-oriented
+          and region-maintenance methods: <itemizedlist>
+            <listitem>
+              <para>Data (get, put, delete, next, etc.)</para>
+            </listitem>
+            <listitem>
+              <para>Region (splitRegion, compactRegion, etc.)</para>
+            </listitem>
+          </itemizedlist> For example, when the <code>HBaseAdmin</code> method
+            <code>majorCompact</code> is invoked on a table, the client is actually iterating
+          through all regions for the specified table and requesting a major compaction directly to
+          each region. </para>
+      </section>
+      <section
+        xml:id="regionserver.arch.processes">
+        <title>Processes</title>
+        <para>The RegionServer runs a variety of background threads:</para>
+        <section
+          xml:id="regionserver.arch.processes.compactsplit">
+          <title>CompactSplitThread</title>
+          <para>Checks for splits and handle minor compactions.</para>
+        </section>
+        <section
+          xml:id="regionserver.arch.processes.majorcompact">
+          <title>MajorCompactionChecker</title>
+          <para>Checks for major compactions.</para>
+        </section>
+        <section
+          xml:id="regionserver.arch.processes.memstore">
+          <title>MemStoreFlusher</title>
+          <para>Periodically flushes in-memory writes in the MemStore to StoreFiles.</para>
+        </section>
+        <section
+          xml:id="regionserver.arch.processes.log">
+          <title>LogRoller</title>
+          <para>Periodically checks the RegionServer's WAL.</para>
+        </section>
+      </section>
+
+      <section
+        xml:id="coprocessors">
+        <title>Coprocessors</title>
+        <para>Coprocessors were added in 0.92. There is a thorough <link
+            xlink:href="https://blogs.apache.org/hbase/entry/coprocessor_introduction">Blog Overview
+            of CoProcessors</link> posted. Documentation will eventually move to this reference
+          guide, but the blog is the most current information available at this time. </para>
+      </section>
+
+      <section
+        xml:id="block.cache">
+        <title>Block Cache</title>
+
+        <para>HBase provides two different BlockCache implementations: the default onheap
+          LruBlockCache and BucketCache, which is (usually) offheap. This section
+          discusses benefits and drawbacks of each implementation, how to choose the appropriate
+          option, and configuration options for each.</para>
+
+      <note><title>Block Cache Reporting: UI</title>
+      <para>See the RegionServer UI for detail on caching deploy.  Since HBase-0.98.4, the
+          Block Cache detail has been significantly extended showing configurations,
+          sizings, current usage, time-in-the-cache, and even detail on block counts and types.</para>
+  </note>
+
+        <section>
+
+          <title>Cache Choices</title>
+          <para><classname>LruBlockCache</classname> is the original implementation, and is
+              entirely within the Java heap. <classname>BucketCache</classname> is mainly
+              intended for keeping blockcache data offheap, although BucketCache can also
+              keep data onheap and serve from a file-backed cache.
+              <note><title>BucketCache is production ready as of hbase-0.98.6</title>
+                <para>To run with BucketCache, you need HBASE-11678. This was included in
+                  hbase-0.98.6.
+                </para>
+              </note>
+          </para>
+
+          <para>Fetching will always be slower when fetching from BucketCache,
+              as compared to the native onheap LruBlockCache. However, latencies tend to be
+              less erratic across time, because there is less garbage collection when you use
+              BucketCache since it is managing BlockCache allocations, not the GC. If the
+              BucketCache is deployed in offheap mode, this memory is not managed by the
+              GC at all. This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs
+              and heap fragmentation.  See Nick Dimiduk's <link
+              xlink:href="http://www.n10k.com/blog/blockcache-101/">BlockCache 101</link> for
+            comparisons running onheap vs offheap tests. Also see
+            <link xlink:href="http://people.apache.org/~stack/bc/">Comparing BlockCache Deploys</link>
+            which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise
+            if you are experiencing cache churn (or you want your cache to exist beyond the
+            vagaries of java GC), use BucketCache.
+              </para>
+
+              <para>When you enable BucketCache, you are enabling a two tier caching
+              system, an L1 cache which is implemented by an instance of LruBlockCache and
+              an offheap L2 cache which is implemented by BucketCache.  Management of these
+              two tiers and the policy that dictates how blocks move between them is done by
+              <classname>CombinedBlockCache</classname>. It keeps all DATA blocks in the L2
+              BucketCache and meta blocks -- INDEX and BLOOM blocks --
+              onheap in the L1 <classname>LruBlockCache</classname>.
+              See <xref linkend="offheap.blockcache" /> for more detail on going offheap.</para>
+        </section>
+
+        <section xml:id="cache.configurations">
+            <title>General Cache Configurations</title>
+          <para>Apart from the cache implementation itself, you can set some general configuration
+            options to control how the cache performs. See <link
+              xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html"
+            />. After setting any of these options, restart or rolling restart your cluster for the
+            configuration to take effect. Check logs for errors or unexpected behavior.</para>
+          <para>See also <xref linkend="blockcache.prefetch"/>, which discusses a new option
+            introduced in <link xlink:href="https://issues.apache.org/jira/browse/HBASE-9857"
+              >HBASE-9857</link>.</para>
+      </section>
+
+        <section
+          xml:id="block.cache.design">
+          <title>LruBlockCache Design</title>
+          <para>The LruBlockCache is an LRU cache that contains three levels of block priority to
+            allow for scan-resistance and in-memory ColumnFamilies: </para>
+          <itemizedlist>
+            <listitem>
+              <para>Single access priority: The first time a block is loaded from HDFS it normally
+                has this priority and it will be part of the first group to be considered during
+                evictions. The advantage is that scanned blocks are more likely to get evicted than
+                blocks that are getting more usage.</para>
+            </listitem>
+            <listitem>
+              <para>Mutli access priority: If a block in the previous priority group is accessed
+                again, it upgrades to this priority. It is thus part of the second group considered
+                during evictions.</para>
+            </listitem>
+            <listitem xml:id="hbase.cache.inmemory">
+              <para>In-memory access priority: If the block's family was configured to be
+                "in-memory", it will be part of this priority disregarding the number of times it
+                was accessed. Catalog tables are configured like this. This group is the last one
+                considered during evictions.</para>
+            <para>To mark a column family as in-memory, call
+                <programlisting language="java">HColumnDescriptor.setInMemory(true);</programlisting> if creating a table from java,
+                or set <command>IN_MEMORY => true</command> when creating or altering a table in
+                the shell: e.g.  <programlisting>hbase(main):003:0> create  't', {NAME => 'f', IN_MEMORY => 'true'}</programlisting></para>
+            </listitem>
+          </itemizedlist>
+          <para> For more information, see the <link
+              xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/LruBlockCache.html">LruBlockCache
+              source</link>
+          </para>
+        </section>
+        <section
+          xml:id="block.cache.usage">
+          <title>LruBlockCache Usage</title>
+          <para>Block caching is enabled by default for all the user tables which means that any
+            read operation will load the LRU cache. This might be good for a large number of use
+            cases, but further tunings are usually required in order to achieve better performance.
+            An important concept is the <link
+              xlink:href="http://en.wikipedia.org/wiki/Working_set_size">working set size</link>, or
+            WSS, which is: "the amount of memory needed to compute the answer to a problem". For a
+            website, this would be the data that's needed to answer the queries over a short amount
+            of time. </para>
+          <para>The way to calculate how much memory is available in HBase for caching is: </para>
+          <programlisting>
+            number of region servers * heap size * hfile.block.cache.size * 0.99
+        </programlisting>
+          <para>The default value for the block cache is 0.25 which represents 25% of the available
+            heap. The last value (99%) is the default acceptable loading factor in the LRU cache
+            after which eviction is started. The reason it is included in this equation is that it
+            would be unrealistic to say that it is possible to use 100% of the available memory
+            since this would make the process blocking from the point where it loads new blocks.
+            Here are some examples: </para>
+          <itemizedlist>
+            <listitem>
+              <para>One region server with the default heap size (1 GB) and the default block cache
+                size will have 253 MB of block cache available.</para>
+            </listitem>
+            <listitem>
+              <para>20 region servers with the heap size set to 8 GB and a default block cache size
+                will have 39.6 of block cache.</para>
+            </listitem>
+            <listitem>
+              <para>100 region servers with the heap size set to 24 GB and a block cache size of 0.5
+                will have about 1.16 TB of block cache.</para>
+            </listitem>
+        </itemizedlist>
+        <para>Your data is not the only resident of the block cache. Here are others that you may have to take into account:
+        </para>
+          <variablelist>
+            <varlistentry>
+              <term>Catalog Tables</term>
+              <listitem>
+                <para>The <code>-ROOT-</code> (prior to HBase 0.96. See <xref
+                    linkend="arch.catalog.root" />) and <code>hbase:meta</code> tables are forced
+                  into the block cache and have the in-memory priority which means that they are
+                  harder to evict. The former never uses more than a few hundreds of bytes while the
+                  latter can occupy a few MBs (depending on the number of regions).</para>
+              </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>HFiles Indexes</term>
+              <listitem>
+                <para>An <firstterm>hfile</firstterm> is the file format that HBase uses to store
+                  data in HDFS. It contains a multi-layered index which allows HBase to seek to the
+                  data without having to read the whole file. The size of those indexes is a factor
+                  of the block size (64KB by default), the size of your keys and the amount of data
+                  you are storing. For big data sets it's not unusual to see numbers around 1GB per
+                  region server, although not all of it will be in cache because the LRU will evict
+                  indexes that aren't used.</para>
+              </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>Keys</term>
+              <listitem>
+                <para>The values that are stored are only half the picture, since each value is
+                  stored along with its keys (row key, family qualifier, and timestamp). See <xref
+                    linkend="keysize" />.</para>
+              </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>Bloom Filters</term>
+              <listitem>
+                <para>Just like the HFile indexes, those data structures (when enabled) are stored
+                  in the LRU.</para>
+              </listitem>
+            </varlistentry>
+          </variablelist>
+          <para>Currently the recommended way to measure HFile indexes and bloom filters sizes is to
+            look at the region server web UI and checkout the relevant metrics. For keys, sampling
+            can be done by using the HFile command line tool and look for the average key size
+            metric. Since HBase 0.98.3, you can view detail on BlockCache stats and metrics
+            in a special Block Cache section in the UI.</para>
+          <para>It's generally bad to use block caching when the WSS doesn't fit in memory. This is
+            the case when you have for example 40GB available across all your region servers' block
+            caches but you need to process 1TB of data. One of the reasons is that the churn
+            generated by the evictions will trigger more garbage collections unnecessarily. Here are
+            two use cases: </para>
+        <itemizedlist>
+            <listitem>
+              <para>Fully random reading pattern: This is a case where you almost never access the
+                same row twice within a short amount of time such that the chance of hitting a
+                cached block is close to 0. Setting block caching on such a table is a waste of
+                memory and CPU cycles, more so that it will generate more garbage to pick up by the
+                JVM. For more information on monitoring GC, see <xref
+                  linkend="trouble.log.gc" />.</para>
+            </listitem>
+            <listitem>
+              <para>Mapping a table: In a typical MapReduce job that takes a table in input, every
+                row will be read only once so there's no need to put them into the block cache. The
+                Scan object has the option of turning this off via the setCaching method (set it to
+                false). You can still keep block caching turned on on this table if you need fast
+                random read access. An example would be counting the number of rows in a table that
+                serves live traffic, caching every block of that table would create massive churn
+                and would surely evict data that's currently in use. </para>
+            </listitem>
+          </itemizedlist>
+          <section xml:id="data.blocks.in.fscache">
+            <title>Caching META blocks only (DATA blocks in fscache)</title>
+            <para>An interesting setup is one where we cache META blocks only and we read DATA
+              blocks in on each access. If the DATA blocks fit inside fscache, this alternative
+              may make sense when access is completely random across a very large dataset.
+              To enable this setup, alter your table and for each column family
+              set <varname>BLOCKCACHE => 'false'</varname>.  You are 'disabling' the
+              BlockCache for this column family only you can never disable the caching of
+              META blocks. Since
+              <link xlink:href="https://issues.apache.org/jira/browse/HBASE-4683">HBASE-4683 Always cache index and bloom blocks</link>,
+              we will cache META blocks even if the BlockCache is disabled.
+            </para>
+          </section>
+        </section>
+        <section
+          xml:id="offheap.blockcache">
+          <title>Offheap Block Cache</title>
+          <section xml:id="enable.bucketcache">
+            <title>How to Enable BucketCache</title>
+                <para>The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 onheap cache
+                    implemented by LruBlockCache and a second L2 cache implemented with BucketCache. The managing class is <link
+                        xlink:href="http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html">CombinedBlockCache</link> by default.
+            The just-previous link describes the caching 'policy' implemented by CombinedBlockCache. In short, it works
+            by keeping meta blocks -- INDEX and BLOOM in the L1, onheap LruBlockCache tier -- and DATA
+            blocks are kept in the L2, BucketCache tier. It is possible to amend this behavior in
+            HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted onheap in the L1 tier by
+            setting <varname>cacheDataInL1</varname> via
+                  <code>(HColumnDescriptor.setCacheDataInL1(true)</code>
+            or in the shell, creating or amending column families setting <varname>CACHE_DATA_IN_L1</varname>
+            to true: e.g. <programlisting>hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}</programlisting></para>
+
+        <para>The BucketCache Block Cache can be deployed onheap, offheap, or file based.
+            You set which via the
+            <varname>hbase.bucketcache.ioengine</varname> setting.  Setting it to
+            <varname>heap</varname> will have BucketCache deployed inside the 
+            allocated java heap. Setting it to <varname>offheap</varname> will have
+            BucketCache make its allocations offheap,
+            and an ioengine setting of <varname>file:PATH_TO_FILE</varname> will direct
+            BucketCache to use a file caching (Useful in particular if you have some fast i/o attached to the box such
+            as SSDs).
+        </para>
+        <para xml:id="raw.l1.l2">It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache
+            policy and have BucketCache working as a strict L2 cache to the L1
+              LruBlockCache. For such a setup, set <varname>CacheConfig.BUCKET_CACHE_COMBINED_KEY</varname> to
+              <literal>false</literal>. In this mode, on eviction from L1, blocks go to L2.
+              When a block is cached, it is cached first in L1. When we go to look for a cached block,
+              we look first in L1 and if none found, then search L2.  Let us call this deploy format,
+              <emphasis><indexterm><primary>Raw L1+L2</primary></indexterm></emphasis>.</para>
+          <para>Other BucketCache configs include: specifying a location to persist cache to across
+              restarts, how many threads to use writing the cache, etc.  See the
+              <link xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html">CacheConfig.html</link>
+              class for configuration options and descriptions.</para>
+
+            <procedure>
+              <title>BucketCache Example Configuration</title>
+              <para>This sample provides a configuration for a 4 GB offheap BucketCache with a 1 GB
+                  onheap cache. Configuration is performed on the RegionServer.  Setting
+                  <varname>hbase.bucketcache.ioengine</varname> and 
+                  <varname>hbase.bucketcache.size</varname> &gt; 0 enables CombinedBlockCache.
+                  Let us presume that the RegionServer has been set to run with a 5G heap:
+                  i.e. HBASE_HEAPSIZE=5g.
+              </para>
+              <step>
+                <para>First, edit the RegionServer's <filename>hbase-env.sh</filename> and set
+                  <varname>HBASE_OFFHEAPSIZE</varname> to a value greater than the offheap size wanted, in
+                  this case, 4 GB (expressed as 4G).  Lets set it to 5G.  That'll be 4G
+                  for our offheap cache and 1G for any other uses of offheap memory (there are
+                  other users of offheap memory other than BlockCache; e.g. DFSClient 
+                  in RegionServer can make use of offheap memory). See <xref linkend="direct.memory" />.</para>
+                <programlisting>HBASE_OFFHEAPSIZE=5G</programlisting>
+              </step>
+              <step>
+                <para>Next, add the following configuration to the RegionServer's
+                    <filename>hbase-site.xml</filename>.</para>
+                <programlisting language="xml">
+<![CDATA[<property>
+  <name>hbase.bucketcache.ioengine</name>
+  <value>offheap</value>
+</property>
+<property>
+  <name>hfile.block.cache.size</name>
+  <value>0.2</value>
+</property>
+<property>
+  <name>hbase.bucketcache.size</name>
+  <value>4196</value>
+</property>]]>
+          </programlisting>
+              </step>
+              <step>
+                <para>Restart or rolling restart your cluster, and check the logs for any
+                  issues.</para>
+              </step>
+            </procedure>
+            <para>In the above, we set bucketcache to be 4G.  The onheap lrublockcache we
+                configured to have 0.2 of the RegionServer's heap size (0.2 * 5G = 1G).
+                In other words, you configure the L1 LruBlockCache as you would normally,
+                as you would when there is no L2 BucketCache present.
+            </para>
+            <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-10641"
+                >HBASE-10641</link> introduced the ability to configure multiple sizes for the
+              buckets of the bucketcache, in HBase 0.98 and newer. To configurable multiple bucket
+              sizes, configure the new property <option>hfile.block.cache.sizes</option> (instead of
+                <option>hfile.block.cache.size</option>) to a comma-separated list of block sizes,
+              ordered from smallest to largest, with no spaces. The goal is to optimize the bucket
+              sizes based on your data access patterns. The following example configures buckets of
+              size 4096 and 8192.</para>
+            <screen language="xml"><![CDATA[
+<property>
+  <name>hfile.block.cache.sizes</name>
+  <value>4096,8192</value>
+</property>
+              ]]></screen>
+            <note xml:id="direct.memory">
+                <title>Direct Memory Usage In HBase</title>
+                <para>The default maximum direct memory varies by JVM.  Traditionally it is 64M
+                    or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently).
+                    HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will
+                    allocate direct memory buffers.  If you do offheap block caching, you'll
+                    be making use of direct memory.  Starting your JVM, make sure
+                    the <varname>-XX:MaxDirectMemorySize</varname> setting in
+                    <filename>conf/hbase-env.sh</filename> is set to some value that is
+                    higher than what you have allocated to your offheap blockcache
+                    (<varname>hbase.bucketcache.size</varname>).  It should be larger than your offheap block
+                    cache and then some for DFSClient usage (How much the DFSClient uses is not
+                    easy to quantify; it is the number of open hfiles * <varname>hbase.dfs.client.read.shortcircuit.buffer.size</varname>
+                    where hbase.dfs.client.read.shortcircuit.buffer.size is set to 128k in HBase -- see <filename>hbase-default.xml</filename>
+                    default configurations).
+                        Direct memory, which is part of the Java process heap, is separate from the object
+                        heap allocated by -Xmx. The value allocated by MaxDirectMemorySize must not exceed
+                        physical RAM, and is likely to be less than the total available RAM due to other
+                        memory requirements and system constraints.
+                </para>
+              <para>You can see how much memory -- onheap and offheap/direct -- a RegionServer is
+                configured to use and how much it is using at any one time by looking at the
+                  <emphasis>Server Metrics: Memory</emphasis> tab in the UI. It can also be gotten
+                via JMX. In particular the direct memory currently used by the server can be found
+                on the <varname>java.nio.type=BufferPool,name=direct</varname> bean. Terracotta has
+                a <link
+                  xlink:href="http://terracotta.org/documentation/4.0/bigmemorygo/configuration/storage-options"
+                  >good write up</link> on using offheap memory in java. It is for their product
+                BigMemory but alot of the issues noted apply in general to any attempt at going
+                offheap. Check it out.</para>
+            </note>
+              <note xml:id="hbase.bucketcache.percentage.in.combinedcache"><title>hbase.bucketcache.percentage.in.combinedcache</title>
+                  <para>This is a pre-HBase 1.0 configuration removed because it
+                      was confusing. It was a float that you would set to some value
+                      between 0.0 and 1.0.  Its default was 0.9. If the deploy was using
+                      CombinedBlockCache, then the LruBlockCache L1 size was calculated to
+                      be (1 - <varname>hbase.bucketcache.percentage.in.combinedcache</varname>) * <varname>size-of-bucketcache</varname> 
+                      and the BucketCache size was <varname>hbase.bucketcache.percentage.in.combinedcache</varname> * size-of-bucket-cache.
+                      where size-of-bucket-cache itself is EITHER the value of the configuration hbase.bucketcache.size
+                      IF it was specified as megabytes OR <varname>hbase.bucketcache.size</varname> * <varname>-XX:MaxDirectMemorySize</varname> if
+                      <varname>hbase.bucketcache.size</varname> between 0 and 1.0.
+                  </para>
+                  <para>In 1.0, it should be more straight-forward. L1 LruBlockCache size
+                      is set as a fraction of java heap using hfile.block.cache.size setting
+                      (not the best name) and L2 is set as above either in absolute
+                      megabytes or as a fraction of allocated maximum direct memory.
+                  </para>
+              </note>
+          </section>
+        </section>
+        <section>
+          <title>Comprewssed Blockcache</title>
+          <para><link xlink:href="https://issues.apache.org/jira/browse/HBASE-11331"
+              >HBASE-11331</link> introduced lazy blockcache decompression, more simply referred to
+            as compressed blockcache. When compressed blockcache is enabled. data and encoded data
+            blocks are cached in the blockcache in their on-disk format, rather than being
+            decompressed and decrypted before caching.</para>
+          <para xlink:href="https://issues.apache.org/jira/browse/HBASE-11331">For a RegionServer
+            hosting more data than can fit into cache, enabling this feature with SNAPPY compression
+            has been shown to result in 50% increase in throughput and 30% improvement in mean
+            latency while, increasing garbage collection by 80% and increasing overall CPU load by
+            2%. See HBASE-11331 for more details about how performance was measured and achieved.
+            For a RegionServer hosting data that can comfortably fit into cache, or if your workload
+            is sensitive to extra CPU or garbage-collection load, you may receive less
+            benefit.</para>
+          <para>Compressed blockcache is disabled by default. To enable it, set
+              <code>hbase.block.data.cachecompressed</code> to <code>true</code> in
+              <filename>hbase-site.xml</filename> on all RegionServers.</para>
+        </section>
+      </section>
+
+      <section
+        xml:id="wal">
+        <title>Write Ahead Log (WAL)</title>
+
+        <section
+          xml:id="purpose.wal">
+          <title>Purpose</title>
+          <para>The <firstterm>Write Ahead Log (WAL)</firstterm> records all changes to data in
+            HBase, to file-based storage. Under normal operations, the WAL is not needed because
+            data changes move from the MemStore to StoreFiles. However, if a RegionServer crashes or
+          becomes unavailable before the MemStore is flushed, the WAL ensures that the changes to
+          the data can be replayed. If writing to the WAL fails, the entire operation to modify the
+          data fails.</para>
+          <para>
+            HBase uses an implementation of the <link xlink:href=
+            "http://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/wal/WAL.html"
+            >WAL</link> interface. Usually, there is only one instance of a WAL per RegionServer.
+            The RegionServer records Puts and Deletes to it, before recording them to the <xref
+              linkend="store.memstore" /> for the affected <xref
+              linkend="store" />.
+          </para>
+          <note>
+            <title>The HLog</title>
+            <para>
+              Prior to 2.0, the interface for WALs in HBase was named <classname>HLog</classname>.
+              In 0.94, HLog was the name of the implementation of the WAL. You will likely find
+              references to the HLog in documentation tailored to these older versions.
+            </para>
+          </note>
+          <para>The WAL resides in HDFS in the <filename>/hbase/WALs/</filename> directory (prior to
+            HBase 0.94, they were stored in <filename>/hbase/.logs/</filename>), with subdirectories per
+            region.</para>
+          <para> For more general information about the concept of write ahead logs, see the
+            Wikipedia <link
+              xlink:href="http://en.wikipedia.org/wiki/Write-ahead_logging">Write-Ahead Log</link>
+            article. </para>
+        </section>
+        <section
+          xml:id="wal_flush">
+          <title>WAL Flushing</title>
+          <para>TODO (describe). </para>
+        </section>
+
+        <section
+          xml:id="wal_splitting">
+          <title>WAL Splitting</title>
+
+          <para>A RegionServer serves many regions. All of the regions in a region server share the
+            same active WAL file. Each edit in the WAL file includes information about which region
+            it belongs to. When a region is opened, the edits in the WAL file which belong to that
+            region need to be replayed. Therefore, edits in the WAL file must be grouped by region
+            so that particular sets can be replayed to regenerate the data in a particular region.
+            The process of grouping the WAL edits by region is called <firstterm>log
+              splitting</firstterm>. It is a critical process for recovering data if a region server
+            fails.</para>
+          <para>Log splitting is done by the HMaster during cluster start-up or by the ServerShutdownHandler
+            as a region server shuts down. So that consistency is guaranteed, affected regions
+            are unavailable until data is restored. All WAL edits need to be recovered and replayed
+            before a given region can become available again. As a result, regions affected by
+            log splitting are unavailable until the process completes.</para>
+          <procedure xml:id="log.splitting.step.by.step">
+            <title>Log Splitting, Step by Step</title>
+            <step>
+              <title>The <filename>/hbase/WALs/&lt;host>,&lt;port>,&lt;startcode></filename> directory is renamed.</title>
+              <para>Renaming the directory is important because a RegionServer may still be up and
+                accepting requests even if the HMaster thinks it is down. If the RegionServer does
+                not respond immediately and does not heartbeat its ZooKeeper session, the HMaster
+                may interpret this as a RegionServer failure. Renaming the logs directory ensures
+                that existing, valid WAL files which are still in use by an active but busy
+                RegionServer are not written to by accident.</para>
+              <para>The new directory is named according to the following pattern:</para>
+              <screen><![CDATA[/hbase/WALs/<host>,<port>,<startcode>-splitting]]></screen>
+              <para>An example of such a renamed directory might look like the following:</para>
+              <screen>/hbase/WALs/srv.example.com,60020,1254173957298-splitting</screen>
+            </step>
+            <step>
+              <title>Each log file is split, one at a time.</title>
+              <para>The log splitter reads the log file one edit entry at a time and puts each edit
+                entry into the buffer corresponding to the edit’s region. At the same time, the
+                splitter starts several writer threads. Writer threads pick up a corresponding
+                buffer and write the edit entries in the buffer to a temporary recovered edit
+                file. The temporary edit file is stored to disk with the following naming pattern:</para>
+              <screen><![CDATA[/hbase/<table_name>/<region_id>/recovered.edits/.temp]]></screen>
+              <para>This file is used to store all the edits in the WAL log for this region. After
+                log splitting completes, the <filename>.temp</filename> file is renamed to the
+                sequence ID of the first log written to the file.</para>
+              <para>To determine whether all edits have been written, the sequence ID is compared to
+                the sequence of the last edit that was written to the HFile. If the sequence of the
+                last edit is greater than or equal to the sequence ID included in the file name, it
+                is clear that all writes from the edit file have been completed.</para>
+            </step>
+            <step>
+              <title>After log splitting is complete, each affected region is assigned to a
+                RegionServer.</title>
+              <para> When the region is opened, the <filename>recovered.edits</filename> folder is checked for recovered
+                edits files. If any such files are present, they are replayed by reading the edits
+                and saving them to the MemStore. After all edit files are replayed, the contents of
+                the MemStore are written to disk (HFile) and the edit files are deleted.</para>
+            </step>
+          </procedure>
+  
+          <section>
+            <title>Handling of Errors During Log Splitting</title>
+
+            <para>If you set the <varname>hbase.hlog.split.skip.errors</varname> option to
+                <constant>true</constant>, errors are treated as follows:</para>
+            <itemizedlist>
+              <listitem>
+                <para>Any error encountered during splitting will be logged.</para>
+              </listitem>
+              <listitem>
+                <para>The problematic WAL log will be moved into the <filename>.corrupt</filename>
+                  directory under the hbase <varname>rootdir</varname>,</para>
+              </listitem>
+              <listitem>
+                <para>Processing of the WAL will continue</para>
+              </listitem>
+            </itemizedlist>
+            <para>If the <varname>hbase.hlog.split.skip.errors</varname> optionset to
+                <literal>false</literal>, the default, the exception will be propagated and the
+              split will be logged as failed. See <link
+                    xlink:href="https://issues.apache.org/jira/browse/HBASE-2958">HBASE-2958 When
+                    hbase.hlog.split.skip.errors is set to false, we fail the split but thats
+                    it</link>. We need to do more than just fail split if this flag is set.</para>
+            
+            <section>
+              <title>How EOFExceptions are treated when splitting a crashed RegionServers'
+                WALs</title>
+
+              <para>If an EOFException occurs while splitting logs, the split proceeds even when
+                  <varname>hbase.hlog.split.skip.errors</varname> is set to
+                <literal>false</literal>. An EOFException while reading the last log in the set of
+                files to split is likely, because the RegionServer is likely to be in the process of
+                writing a record at the time of a crash. For background, see <link
+                      xlink:href="https://issues.apache.org/jira/browse/HBASE-2643">HBASE-2643
+                      Figure how to deal with eof splitting logs</link></para>
+            </section>
+          </section>
+          
+          <section>
+            <title>Performance Improvements during Log Splitting</title>
+            <para>
+              WAL log splitting and recovery can be resource intensive and take a long time,
+              depending on the number of RegionServers involved in the crash and the size of the
+              regions. <xref linkend="distributed.log.splitting" /> and <xref
+                linkend="distributed.log.replay" /> were developed to improve
+              performance during log splitting.
+            </para>
+            <section xml:id="distributed.log.splitting">
+              <title>Distributed Log Splitting</title>
+              <para><firstterm>Distributed Log Splitting</firstterm> was added in HBase version 0.92
+                (<link xlink:href="https://issues.apache.org/jira/browse/HBASE-1364">HBASE-1364</link>) 
+                by Prakash Khemani from Facebook. It reduces the time to complete log splitting
+                dramatically, improving the availability of regions and tables. For
+                example, recovering a crashed cluster took around 9 hours with single-threaded log
+                splitting, but only about six minutes with distributed log splitting.</para>
+              <para>The information in this section is sourced from Jimmy Xiang's blog post at <link
+              xlink:href="http://blog.cloudera.com/blog/2012/07/hbase-log-splitting/" />.</para>
+              
+              <formalpara>
+                <title>Enabling or Disabling Distributed Log Splitting</title>
+                <para>Distributed log processing is enabled by default since HBase 0.92. The setting
+                  is controlled by the <property>hbase.master.distributed.log.splitting</property>
+                  property, which can be set to <literal>true</literal> or <literal>false</literal>,
+                  but defaults to <literal>true</literal>. </para>
+              </formalpara>
+              <procedure>
+                <title>Distributed Log Splitting, Step by Step</title>
+                <para>After configuring distributed log splitting, the HMaster controls the process.
+                  The HMaster enrolls each RegionServer in the log splitting process, and the actual
+                  work of splitting the logs is done by the RegionServers. The general process for
+                  log splitting, as described in <xref
+                    linkend="log.splitting.step.by.step" /> still applies here.</para>
+                <step>
+                  <para>If distributed log processing is enabled, the HMaster creates a
+                    <firstterm>split log manager</firstterm> instance when the cluster is started.
+                    The split log manager manages all log files which need
+                    to be scanned and split. The split log manager places all the logs into the
+                    ZooKeeper splitlog node (<filename>/hbase/splitlog</filename>) as tasks. You can
+                  view the contents of the splitlog by issuing the following
+                    <command>zkcli</command> command. Example output is shown.</para>
+                  <screen language="bourne">ls /hbase/splitlog
+[hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost8.sample.com%2C57020%2C1340474893275-splitting%2Fhost8.sample.com%253A57020.1340474893900, 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost3.sample.com%2C57020%2C1340474893299-splitting%2Fhost3.sample.com%253A57020.1340474893931, 
+hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost4.sample.com%2C57020%2C1340474893287-splitting%2Fhost4.sample.com%253A57020.1340474893946]                  
+                  </screen>
+                  <para>The output contains some non-ASCII characters. When decoded, it looks much
+                    more simple:</para>
+                  <screen>
+[hdfs://host2.sample.com:56020/hbase/.logs
+/host8.sample.com,57020,1340474893275-splitting
+/host8.sample.com%3A57020.1340474893900, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host3.sample.com,57020,1340474893299-splitting
+/host3.sample.com%3A57020.1340474893931, 
+hdfs://host2.sample.com:56020/hbase/.logs
+/host4.sample.com,57020,1340474893287-splitting
+/host4.sample.com%3A57020.1340474893946]                    
+                  </screen>
+                  <para>The listing represents WAL file names to be scanned and split, which is a
+                    list of log splitting tasks.</para>
+                </step>
+                <step>
+                  <title>The split log manager monitors the log-splitting tasks and workers.</title>
+                  <para>The split log manager is responsible for the following ongoing tasks:</para>
+                  <itemizedlist>
+                    <listitem>
+                      <para>Once the split log manager publishes all the tasks to the splitlog
+                        znode, it monitors these task nodes and waits for them to be
+                        processed.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any dead split log
+                        workers queued up. If it finds tasks claimed by unresponsive workers, it
+                        will resubmit those tasks. If the resubmit fails due to some ZooKeeper
+                        exception, the dead worker is queued up again for retry.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks to see if there are any unassigned
+                        tasks. If it finds any, it create an ephemeral rescan node so that each
+                        split log worker is notified to re-scan unassigned tasks via the
+                          <code>nodeChildrenChanged</code> ZooKeeper event.</para>
+                    </listitem>
+                    <listitem>
+                      <para>Checks for tasks which are assigned but expired. If any are found, they
+                        are moved back to <code>TASK_UNASSIGNED</code> state again so that they can
+                        be retried. It is possible that these tasks are assigned to slow workers, or
+                        they may already be finished. This is not a problem, because log splitting
+                        tasks have the property of idempotence. In other words, the same log
+                        splitting task can be processed many times without causing any
+                        problem.</para>
+                    </listitem>
+                    <listitem>
+                      <para>The split log manager watches the HBase split log znodes constantly. If
+                        any split log task node data is changed, the split log manager retrieves the
+                        node data. The
+                        node data contains the current state of the task. You can use the
+                        <command>zkcli</command> <command>get</command> command to retrieve the
+                        current state of a task. In the example output below, the first line of the
+                        output shows that the task is currently unassigned.</para>
+                      <screen>
+<userinput>get /hbase/splitlog/hdfs%3A%2F%2Fhost2.sample.com%3A56020%2Fhbase%2F.logs%2Fhost6.sample.com%2C57020%2C1340474893287-splitting%2Fhost6.sample.com%253A57020.1340474893945
+</userinput> 
+<computeroutput>unassigned host2.sample.com:57000
+cZxid = 0×7115
+ctime = Sat Jun 23 11:13:40 PDT 2012
+...</computeroutput>  
+                      </screen>
+                      <para>Based on the state of the task whose data is changed, the split log
+                        manager does one of the following:</para>
+
+                      <itemizedlist>
+                        <listitem>
+                          <para>Resubmit the task if it is unassigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Heartbeat the task if it is assigned</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is resigned (see <xref
+                            linkend="distributed.log.replay.failure.reasons" />)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it is completed with errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" />)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Resubmit or fail the task if it could not complete due to
+                            errors (see <xref
+                            linkend="distributed.log.replay.failure.reasons" />)</para>
+                        </listitem>
+                        <listitem>
+                          <para>Delete the task if it is successfully completed or failed</para>
+                        </listitem>
+                      </itemizedlist>
+                      <itemizedlist xml:id="distributed.log.replay.failure.reasons">
+                        <title>Reasons a Task Will Fail</title>
+                        <listitem><para>The task has been deleted.</para></listitem>
+                        <listitem><para>The node no longer exists.</para></listitem>
+                        <listitem><para>The log status manager failed to move the state of the task
+                          to TASK_UNASSIGNED.</para></listitem>
+                        <listitem><para>The number of resubmits is over the resubmit
+                          threshold.</para></listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>Each RegionServer's split log worker performs the log-splitting tasks.</title>
+                  <para>Each RegionServer runs a daemon thread called the <firstterm>split log
+                      worker</firstterm>, which does the work to split the logs. The daemon thread
+                    starts when the RegionServer starts, and registers itself to watch HBase znodes.
+                    If any splitlog znode children change, it notifies a sleeping worker thread to
+                    wake up and grab more tasks. If if a worker's current task’s node data is
+                    changed, the worker checks to see if the task has been taken by another worker.
+                    If so, the worker thread stops work on the current task.</para>
+                  <para>The worker monitors
+                    the splitlog znode constantly. When a new task appears, the split log worker
+                    retrieves  the task paths and checks each one until it finds an unclaimed task,
+                    which it attempts to claim. If the claim was successful, it attempts to perform
+                    the task and updates the task's <property>state</property> property based on the
+                    splitting outcome. At this point, the split log worker scans for another
+                    unclaimed task.</para>
+                  <itemizedlist>
+                    <title>How the Split Log Worker Approaches a Task</title>
+
+                    <listitem>
+                      <para>It queries the task state and only takes action if the task is in
+                          <literal>TASK_UNASSIGNED </literal>state.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the task is is in <literal>TASK_UNASSIGNED</literal> state, the
+                        worker attempts to set the state to <literal>TASK_OWNED</literal> by itself.
+                        If it fails to set the state, another worker will try to grab it. The split
+                        log manager will also ask all workers to rescan later if the task remains
+                        unassigned.</para>
+                    </listitem>
+                    <listitem>
+                      <para>If the worker succeeds in taking ownership of the task, it tries to get
+                        the task state again to make sure it really gets it asynchronously. In the
+                        meantime, it starts a split task executor to do the actual work: </para>
+                      <itemizedlist>
+                        <listitem>
+                          <para>Get the HBase root folder, create a temp folder under the root, and
+                            split the log file to the temp folder.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the split was successful, the task executor sets the task to
+                            state <literal>TASK_DONE</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker catches an unexpected IOException, the task is set to
+                            state <literal>TASK_ERR</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the worker is shutting down, set the the task to state
+                              <literal>TASK_RESIGNED</literal>.</para>
+                        </listitem>
+                        <listitem>
+                          <para>If the task is taken by another worker, just log it.</para>
+                        </listitem>
+                      </itemizedlist>
+                    </listitem>
+                  </itemizedlist>
+                </step>
+                <step>
+                  <title>The split log manager monitors for uncompleted tasks.</title>
+                  <para>The split log manager returns when all tasks are completed successfully. If
+                    all tasks are completed with some failures, the split log manager throws an
+                    exception so that the log splitting can be retried. Due to an asynchronous
+                    implementation, in very rare cases, the split log manager loses track of some
+                    completed tasks. For that reason, it periodically checks for remaining
+                    uncompleted task in its task map or ZooKeeper. If none are found, it throws an
+                    exception so that the log splitting can be retried right away instead of hanging
+                    there waiting for something that won’t happen.</para>
+                </step>
+              </procedure>
+            </section>
+            <section xml:id="distributed.log.replay">
+              <title>Distributed Log Replay</title>
+              <para>After a RegionServer fails, its failed region is assigned to another
+                RegionServer, which is marked as "recovering" in ZooKeeper. A split log worker directly
+                replays edits from the WAL of the failed region server to the region at its new
+                location. When a region is in "recovering" state, it can accept writes but no reads
+                (including Append and Increment), region splits or merges. </para>
+              <para>Distributed Log Replay extends the <xref linkend="distributed.log.splitting" /> framework. It works by
+                directly replaying WAL edits to another RegionServer instead of creating
+                  <filename>recovered.edits</filename> files. It provides the following advantages
+                over distributed log splitting alone:</para>
+              <itemizedlist>
+                <listitem><para>It eliminates the overhead of writing and reading a large number of
+                  <filename>recovered.edits</filename> files. It is not unusual for thousands of
+                  <filename>recovered.edits</filename> files to be created and written concurrently
+                  during a RegionServer recovery. Many small random writes can degrade overall
+                  system performance.</para></listitem>
+                <listitem><para>It allows writes even when a region is in recovering state. It only takes seconds for a recovering region to accept writes again. 
+</para></listitem>
+              </itemizedlist>
+              <formalpara>
+                <title>Enabling Distributed Log Replay</title>
+                <para>To enable distributed log replay, set <varname>hbase.master.distributed.log.replay</varname> to
+                  true. This will be the default for HBase 0.99 (<link
+                    xlink:href="https://issues.apache.org/jira/browse/HBASE-10888">HBASE-10888</link>).</para>
+              </formalpara>
+              <para>You must also enable HFile version 3 (which is the default HFile format starting
+                in HBase 0.99. See <link
+                  xlink:href="https://issues.apache.org/jira/browse/HBASE-10855">HBASE-1085

<TRUNCATED>

[8/8] hbase git commit: HBASE-12738 Chunk Ref Guide into file-per-chapter

Posted by mi...@apache.org.
HBASE-12738 Chunk Ref Guide into file-per-chapter


Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/a1fe1e09
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/a1fe1e09
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/a1fe1e09

Branch: refs/heads/master
Commit: a1fe1e09642355aa8165c11da3f759d621da1421
Parents: d9f25e3
Author: Misty Stanley-Jones <ms...@cloudera.com>
Authored: Mon Dec 22 15:26:59 2014 +1000
Committer: Misty Stanley-Jones <ms...@cloudera.com>
Committed: Mon Dec 22 15:46:49 2014 +1000

----------------------------------------------------------------------
 src/main/docbkx/architecture.xml      | 3489 ++++++++++++++++
 src/main/docbkx/asf.xml               |   44 +
 src/main/docbkx/book.xml              | 6021 +---------------------------
 src/main/docbkx/compression.xml       |  535 +++
 src/main/docbkx/configuration.xml     |    6 +-
 src/main/docbkx/customization-pdf.xsl |  129 +
 src/main/docbkx/datamodel.xml         |  865 ++++
 src/main/docbkx/faq.xml               |  270 ++
 src/main/docbkx/hbase-default.xml     |  538 +++
 src/main/docbkx/hbase_history.xml     |   41 +
 src/main/docbkx/hbck_in_depth.xml     |  237 ++
 src/main/docbkx/mapreduce.xml         |  630 +++
 src/main/docbkx/orca.xml              |   47 +
 src/main/docbkx/other_info.xml        |   83 +
 src/main/docbkx/performance.xml       |    2 +-
 src/main/docbkx/sql.xml               |   40 +
 src/main/docbkx/upgrading.xml         |    2 +-
 src/main/docbkx/ycsb.xml              |   36 +
 18 files changed, 7008 insertions(+), 6007 deletions(-)
----------------------------------------------------------------------