You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by en...@apache.org on 2014/12/03 06:53:26 UTC
[3/9] hbase git commit: Blanket update of src/main/docbkx from master
http://git-wip-us.apache.org/repos/asf/hbase/blob/48d9d27d/src/main/docbkx/performance.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/performance.xml b/src/main/docbkx/performance.xml
index 689b26f..1757d3f 100644
--- a/src/main/docbkx/performance.xml
+++ b/src/main/docbkx/performance.xml
@@ -182,6 +182,8 @@
save a bit of YGC churn and allocate in the old gen directly. </para>
<para>For more information about GC logs, see <xref
linkend="trouble.log.gc" />. </para>
+ <para>Consider also enabling the offheap Block Cache. This has been shown to mitigate
+ GC pause times. See <xref linkend="block.cache" /></para>
</section>
</section>
</section>
@@ -627,7 +629,7 @@ hbase> <userinput>create 'mytable',{NAME => 'colfam1', BLOOMFILTER => 'ROWCOL'}<
<title>Constants</title>
<para>When people get started with HBase they have a tendency to write code that looks like
this:</para>
- <programlisting>
+ <programlisting language="java">
Get get = new Get(rowkey);
Result r = htable.get(get);
byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns current version of value
@@ -635,7 +637,7 @@ byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr")); // returns c
<para>But especially when inside loops (and MapReduce jobs), converting the columnFamily and
column-names to byte-arrays repeatedly is surprisingly expensive. It's better to use
constants for the byte-arrays, like this:</para>
- <programlisting>
+ <programlisting language="java">
public static final byte[] CF = "cf".getBytes();
public static final byte[] ATTR = "attr".getBytes();
...
@@ -669,14 +671,14 @@ byte[] b = r.getValue(CF, ATTR); // returns current version of value
<para>There are two different approaches to pre-creating splits. The first approach is to rely
on the default <code>HBaseAdmin</code> strategy (which is implemented in
<code>Bytes.split</code>)... </para>
- <programlisting>
-byte[] startKey = ...; // your lowest keuy
+ <programlisting language="java">
+byte[] startKey = ...; // your lowest key
byte[] endKey = ...; // your highest key
int numberOfRegions = ...; // # of regions to create
admin.createTable(table, startKey, endKey, numberOfRegions);
</programlisting>
<para>And the other approach is to define the splits yourself... </para>
- <programlisting>
+ <programlisting language="java">
byte[][] splits = ...; // create your own splits
admin.createTable(table, splits);
</programlisting>
@@ -829,7 +831,7 @@ admin.createTable(table, splits);
<code>Scan.HINT_LOOKAHEAD</code> can be set the on Scan object. The following code
instructs the RegionServer to attempt two iterations of next before a seek is
scheduled:</para>
- <programlisting>
+ <programlisting language="java">
Scan scan = new Scan();
scan.addColumn(...);
scan.setAttribute(Scan.HINT_LOOKAHEAD, Bytes.toBytes(2));
@@ -854,7 +856,7 @@ table.getScanner(scan);
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/ResultScanner.html">ResultScanners</link>
you can cause problems on the RegionServers. Always have ResultScanner processing enclosed
in try/catch blocks...</para>
- <programlisting>
+ <programlisting language="java">
Scan scan = new Scan();
// set attrs...
ResultScanner rs = htable.getScanner(scan);
@@ -878,6 +880,8 @@ htable.close();
<methodname>setCacheBlocks</methodname> method. For input Scans to MapReduce jobs, this
should be <varname>false</varname>. For frequently accessed rows, it is advisable to use the
block cache.</para>
+
+ <para>Cache more data by moving your Block Cache offheap. See <xref linkend="offheap.blockcache" /></para>
</section>
<section
xml:id="perf.hbase.client.rowkeyonly">
@@ -984,6 +988,58 @@ htable.close();
</section>
</section>
<!-- bloom -->
+ <section>
+ <title>Hedged Reads</title>
+ <para>Hedged reads are a feature of HDFS, introduced in <link
+ xlink:href="https://issues.apache.org/jira/browse/HDFS-5776">HDFS-5776</link>. Normally, a
+ single thread is spawned for each read request. However, if hedged reads are enabled, the
+ client waits some configurable amount of time, and if the read does not return, the client
+ spawns a second read request, against a different block replica of the same data. Whichever
+ read returns first is used, and the other read request is discarded. Hedged reads can be
+ helpful for times where a rare slow read is caused by a transient error such as a failing
+ disk or flaky network connection.</para>
+ <para> Because a HBase RegionServer is a HDFS client, you can enable hedged reads in HBase, by
+ adding the following properties to the RegionServer's hbase-site.xml and tuning the values
+ to suit your environment.</para>
+ <itemizedlist>
+ <title>Configuration for Hedged Reads</title>
+ <listitem>
+ <para><code>dfs.client.hedged.read.threadpool.size</code> - the number of threads
+ dedicated to servicing hedged reads. If this is set to 0 (the default), hedged reads are
+ disabled.</para>
+ </listitem>
+ <listitem>
+ <para><code>dfs.client.hedged.read.threshold.millis</code> - the number of milliseconds to
+ wait before spawning a second read thread.</para>
+ </listitem>
+ </itemizedlist>
+ <example>
+ <title>Hedged Reads Configuration Example</title>
+ <screen><![CDATA[<property>
+ <name>dfs.client.hedged.read.threadpool.size</name>
+ <value>20</value> <!-- 20 threads -->
+</property>
+<property>
+ <name>dfs.client.hedged.read.threshold.millis</name>
+ <value>10</value> <!-- 10 milliseconds -->
+</property>]]></screen>
+ </example>
+ <para>Use the following metrics to tune the settings for hedged reads on
+ your cluster. See <xref linkend="hbase_metrics"/> for more information.</para>
+ <itemizedlist>
+ <title>Metrics for Hedged Reads</title>
+ <listitem>
+ <para>hedgedReadOps - the number of times hedged read threads have been triggered. This
+ could indicate that read requests are often slow, or that hedged reads are triggered too
+ quickly.</para>
+ </listitem>
+ <listitem>
+ <para>hedgeReadOpsWin - the number of times the hedged read thread was faster than the
+ original thread. This could indicate that a given RegionServer is having trouble
+ servicing requests.</para>
+ </listitem>
+ </itemizedlist>
+ </section>
</section>
<!-- reading -->
@@ -1052,7 +1108,7 @@ htable.close();
shortcircuit reads configuration page</link> for how to enable the latter, better version
of shortcircuit. For example, here is a minimal config. enabling short-circuit reads added
to <filename>hbase-site.xml</filename>: </para>
- <programlisting><![CDATA[<property>
+ <programlisting language="xml"><![CDATA[<property>
<name>dfs.client.read.shortcircuit</name>
<value>true</value>
<description>
http://git-wip-us.apache.org/repos/asf/hbase/blob/48d9d27d/src/main/docbkx/preface.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/preface.xml b/src/main/docbkx/preface.xml
index ff8efb9..a8f6895 100644
--- a/src/main/docbkx/preface.xml
+++ b/src/main/docbkx/preface.xml
@@ -39,15 +39,29 @@
xlink:href="http://wiki.apache.org/hadoop/Hbase">wiki</link> where the pertinent
information can be found.</para>
- <para>This reference guide is a work in progress. The source for this guide can be found at
- <filename>src/main/docbkx</filename> in a checkout of the hbase project. This reference
- guide is marked up using <link
- xlink:href="http://www.docbook.com/">DocBook</link> from which the the finished guide is
- generated as part of the 'site' build target. Run <programlisting>mvn site</programlisting>
- to generate this documentation. Amendments and improvements to the documentation are
- welcomed. Add a patch to an issue up in the HBase <link
- xlink:href="https://issues.apache.org/jira/browse/HBASE">JIRA</link>.</para>
-
+ <formalpara>
+ <title>About This Guide</title>
+ <para>This reference guide is a work in progress. The source for this guide can be found in
+ the <filename>src/main/docbkx</filename> directory of the HBase source. This reference
+ guide is marked up using <link xlink:href="http://www.docbook.org/">DocBook</link> from
+ which the the finished guide is generated as part of the 'site' build target. Run
+ <programlisting language="bourne">mvn site</programlisting> to generate this documentation. Amendments and
+ improvements to the documentation are welcomed. Click <link
+ xlink:href="https://issues.apache.org/jira/secure/CreateIssueDetails!init.jspa?pid=12310753&issuetype=1&components=12312132&summary=SHORT+DESCRIPTION"
+ >this link</link> to file a new documentation bug against Apache HBase with some
+ values pre-selected.</para>
+ </formalpara>
+ <formalpara>
+ <title>Contributing to the Documentation</title>
+ <para>For an overview of Docbook and suggestions to get started contributing to the documentation, see <xref linkend="appendix_contributing_to_documentation" />.</para>
+ </formalpara>
+ <formalpara>
+ <title>Providing Feedback</title>
+ <para>This guide allows you to leave comments or questions on any page, using Disqus. Look
+ for the Comments area at the bottom of the page. Answering these questions is a
+ volunteer effort, and may be delayed.</para>
+ </formalpara>
+
<note
xml:id="headsup">
<title>Heads-up if this is your first foray into the world of distributed
http://git-wip-us.apache.org/repos/asf/hbase/blob/48d9d27d/src/main/docbkx/schema_design.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/schema_design.xml b/src/main/docbkx/schema_design.xml
index 614dab7..65e64b0 100644
--- a/src/main/docbkx/schema_design.xml
+++ b/src/main/docbkx/schema_design.xml
@@ -44,7 +44,7 @@
xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html">HBaseAdmin</link>
in the Java API. </para>
<para>Tables must be disabled when making ColumnFamily modifications, for example:</para>
- <programlisting>
+ <programlisting language="java">
Configuration config = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
String table = "myTable";
@@ -280,7 +280,7 @@ d-foo0002
in those eight bytes. If you stored this number as a String -- presuming a byte per
character -- you need nearly 3x the bytes. </para>
<para>Not convinced? Below is some sample code that you can run on your own.</para>
- <programlisting>
+ <programlisting language="java">
// long
//
long l = 1234567890L;
@@ -403,7 +403,7 @@ COLUMN CELL
are accessible in the keyspace. </para>
<para>To conclude this example, the following is an example of how appropriate splits can be
pre-created for hex-keys:. </para>
- <programlisting><![CDATA[public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
+ <programlisting language="java"><![CDATA[public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
throws IOException {
try {
admin.createTable( table, splits );
@@ -439,18 +439,15 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio
xml:id="schema.versions.max">
<title>Maximum Number of Versions</title>
<para>The maximum number of row versions to store is configured per column family via <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
- >HColumnDescriptor</link>. The default for max versions is 3 prior to HBase 0.96.x, and 1
- in newer versions. This is an important parameter because as described in <xref
- linkend="datamodel"/> section HBase does <emphasis>not</emphasis> overwrite row values,
+ xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
+ The default for max versions is 1. This is an important parameter because as described in <xref
+ linkend="datamodel" /> section HBase does <emphasis>not</emphasis> overwrite row values,
but rather stores different values per row by time (and qualifier). Excess versions are
removed during major compactions. The number of max versions may need to be increased or
decreased depending on application needs. </para>
<para>It is not recommended setting the number of max versions to an exceedingly high level
(e.g., hundreds or more) unless those old values are very dear to you because this will
greatly increase StoreFile size. </para>
- <para>See <xref linkend="specify.number.of.versions"/> for examples for setting the maximum
- number of versions on a given column or globally.</para>
</section>
<section
xml:id="schema.minversions">
@@ -465,8 +462,6 @@ public static byte[][] getHexSplits(String startKey, String endKey, int numRegio
around</emphasis>" (where M is the value for minimum number of row versions, M<N). This
parameter should only be set when time-to-live is enabled for a column family and must be
less than the number of row versions. </para>
- <para>See <xref linkend="specify.number.of.versions"/> for examples for setting the minimum
- number of versions on a given column.</para>
</section>
</section>
<section
@@ -700,7 +695,7 @@ HColumnDescriptor.setKeepDeletedCells(true);
timestamps, by performing a mod operation on the timestamp. If time-oriented scans are
important, this could be a useful approach. Attention must be paid to the number of
buckets, because this will require the same number of scans to return results.</para>
- <programlisting>
+ <programlisting language="java">
long bucket = timestamp % numBuckets;
</programlisting>
<para>… to construct:</para>
@@ -1161,13 +1156,13 @@ long bucket = timestamp % numBuckets;
]]></programlisting>
<para>The other option we had was to do this entirely using:</para>
- <programlisting><![CDATA[
+ <programlisting language="xml"><![CDATA[
<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
]]></programlisting>
<para> where each row would contain multiple values. So in one case reading the first thirty
values would be: </para>
- <programlisting><![CDATA[
+ <programlisting language="java"><![CDATA[
scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
]]></programlisting>
<para>And in the second case it would be </para>