You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by jm...@apache.org on 2014/07/18 22:49:38 UTC
git commit: HBASE-11400 [docs] edit, consolidate,
and update compression and data encoding docs (Misty Stanley-Jones)
Repository: hbase
Updated Branches:
refs/heads/master a030b17ba -> 209dd6dcf
HBASE-11400 [docs] edit, consolidate, and update compression and data encoding docs (Misty Stanley-Jones)
Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/209dd6dc
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/209dd6dc
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/209dd6dc
Branch: refs/heads/master
Commit: 209dd6dcfeb249060df091d651fc2d579aa729b5
Parents: a030b17
Author: Jonathan M Hsieh <jm...@apache.org>
Authored: Fri Jul 18 13:45:57 2014 -0700
Committer: Jonathan M Hsieh <jm...@apache.org>
Committed: Fri Jul 18 13:45:57 2014 -0700
----------------------------------------------------------------------
src/main/docbkx/book.xml | 627 +++++++++++++------
.../images/data_block_diff_encoding.png | Bin 0 -> 54479 bytes
.../resources/images/data_block_no_encoding.png | Bin 0 -> 46836 bytes
.../images/data_block_prefix_encoding.png | Bin 0 -> 35271 bytes
4 files changed, 424 insertions(+), 203 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/hbase/blob/209dd6dc/src/main/docbkx/book.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index 92c372e..4c06dc6 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -4387,230 +4387,451 @@ This option should not normally be used, and it is not in <code>-fixAll</code>.
</section>
</appendix>
- <appendix xml:id="compression">
+ <appendix
+ xml:id="compression">
- <title >Compression In HBase<indexterm><primary>Compression</primary></indexterm></title>
+ <title>Compression and Data Block Encoding In
+ HBase<indexterm><primary>Compression</primary><secondary>Data Block
+ Encoding</secondary><seealso>codecs</seealso></indexterm></title>
<note>
- <para>Codecs mentioned in this section are for encoding and decoding data blocks. For
- information about replication codecs, see <xref
+ <para>Codecs mentioned in this section are for encoding and decoding data blocks or row keys.
+ For information about replication codecs, see <xref
linkend="cluster.replication.preserving.tags" />.</para>
</note>
- <para>There are a bunch of compression options in HBase. Some codecs come with java --
- e.g. gzip -- and so require no additional installations. Others require native
- libraries. The native libraries may be available in your hadoop as is the case
- with lz4 and it is just a matter of making sure the hadoop native .so is available
- to HBase. You may have to do extra work to make the codec accessible; for example,
- if the codec has an apache-incompatible license that makes it so hadoop cannot bundle
- the library.</para>
- <para>Below we
- discuss what is necessary for the common codecs. Whatever codec you use, be sure
- to test it is installed properly and is available on all nodes that make up your cluster.
- Add any necessary operational step that will ensure checking the codec present when you
- happen to add new nodes to your cluster. The <xref linkend="compression.test" />
- discussed below can help check the codec is properly install.</para>
- <para>As to which codec to use, there is some helpful discussion
- to be found in <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>.
- </para>
+ <para>Some of the information in this section is pulled from a <link
+ xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1/v=threaded">discussion</link> on the
+ HBase Development mailing list.</para>
+ <para>HBase supports several different compression algorithms which can be enabled on a
+ ColumnFamily. Data block encoding attempts to limit duplication of information in keys, taking
+ advantage of some of the fundamental designs and patterns of HBase, such as sorted row keys
+ and the schema of a given table. Compressors reduce the size of large, opaque byte arrays in
+ cells, and can significantly reduce the storage space needed to store uncompressed
+ data.</para>
+ <para>Compressors and data block encoding can be used together on the same ColumnFamily.</para>
+
+ <formalpara>
+ <title>Changes Take Effect Upon Compaction</title>
+ <para>If you change compression or encoding for a ColumnFamily, the changes take effect during
+ compaction.</para>
+ </formalpara>
+
+ <para>Some codecs take advantage of capabilities built into Java, such as GZip compression.
+ Others rely on native libraries. Native libraries may be available as part of Hadoop, such as
+ LZ4. In this case, HBase only needs access to the appropriate shared library. Other codecs,
+ such as Google Snappy, need to be installed first. Some codecs are licensed in ways that
+ conflict with HBase's license and cannot be shipped as part of HBase.</para>
+
+ <para>This section discusses common codecs that are used and tested with HBase. No matter what
+ codec you use, be sure to test that it is installed correctly and is available on all nodes in
+ your cluster. Extra operational steps may be necessary to be sure that codecs are available on
+ newly-deployed nodes. You can use the <xref
+ linkend="compression.test" /> utility to check that a given codec is correctly
+ installed.</para>
+
+ <para>To configure HBase to use a compressor, see <xref
+ linkend="compressor.install" />. To enable a compressor for a ColumnFamily, see <xref
+ linkend="changing.compression" />. To enable data block encoding for a ColumnFamily, see
+ <xref linkend="data.block.encoding.enable" />.</para>
+ <itemizedlist>
+ <title>Block Compressors</title>
+ <listitem>
+ <para>none</para>
+ </listitem>
+ <listitem>
+ <para>Snappy</para>
+ </listitem>
+ <listitem>
+ <para>LZO</para>
+ </listitem>
+ <listitem>
+ <para>LZ4</para>
+ </listitem>
+ <listitem>
+ <para>GZ</para>
+ </listitem>
+ </itemizedlist>
- <section xml:id="compression.test">
- <title>CompressionTest Tool</title>
- <para>
- HBase includes a tool to test compression is set up properly.
- To run it, type <code>/bin/hbase org.apache.hadoop.hbase.util.CompressionTest</code>.
- This will emit usage on how to run the tool.
- </para>
- <note><title>You need to restart regionserver for it to pick up changes!</title>
- <para>Be aware that the regionserver caches the result of the compression check it runs
- ahead of each region open. This means that you will have to restart the regionserver
- for it to notice that you have fixed any codec issues; e.g. changed symlinks or
- moved lib locations under HBase.</para>
- </note>
- <note xml:id="hbase.native.platform"><title>On the location of native libraries</title>
- <para>Hadoop looks in <filename>lib/native</filename> for .so files. HBase looks in
- <filename>lib/native/PLATFORM</filename>. See the <command>bin/hbase</command>.
- View the file and look for <varname>native</varname>. See how we
- do the work to find out what platform we are running on running a little java program
- <classname>org.apache.hadoop.util.PlatformName</classname> to figure it out.
- We'll then add <filename>./lib/native/PLATFORM</filename> to the
- <varname>LD_LIBRARY_PATH</varname> environment for when the JVM starts.
- The JVM will look in here (as well as in any other dirs specified on LD_LIBRARY_PATH)
- for codec native libs. If you are unable to figure your 'platform', do:
- <programlisting>$ ./bin/hbase org.apache.hadoop.util.PlatformName</programlisting>.
- An example platform would be <varname>Linux-amd64-64</varname>.
- </para>
- </note>
- </section>
- <section xml:id="hbase.regionserver.codecs">
- <title>
- <varname>
- hbase.regionserver.codecs
- </varname>
- </title>
- <para>
- To have a RegionServer test a set of codecs and fail-to-start if any
- code is missing or misinstalled, add the configuration
- <varname>
- hbase.regionserver.codecs
- </varname>
- to your <filename>hbase-site.xml</filename> with a value of
- codecs to test on startup. For example if the
- <varname>
- hbase.regionserver.codecs
- </varname> value is <code>lzo,gz</code> and if lzo is not present
- or improperly installed, the misconfigured RegionServer will fail
- to start.
- </para>
- <para>
- Administrators might make use of this facility to guard against
- the case where a new server is added to cluster but the cluster
- requires install of a particular coded.
- </para>
- </section>
+ <itemizedlist>
+ <title>Data Block Encoding Types</title>
+ <listitem>
+ <para>Prefix - Often, keys are very similar. Specifically, keys often share a common prefix
+ and only differ near the end. For instance, one key might be
+ <literal>RowKey:Family:Qualifier0</literal> and the next key might be
+ <literal>RowKey:Family:Qualifier1</literal>. In Prefix encoding, an extra column is
+ added which holds the length of the prefix shared between the current key and the previous
+ key. Assuming the first key here is totally different from the key before, its prefix
+ length is 0. The second key's prefix length is <literal>23</literal>, since they have the
+ first 23 characters in common.</para>
+ <para>Obviously if the keys tend to have nothing in common, Prefix will not provide much
+ benefit.</para>
+ <para>The following image shows a hypothetical ColumnFamily with no data block encoding.</para>
+ <figure>
+ <title>ColumnFamily with No Encoding</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="data_block_no_encoding.png" width="800"/>
+ </imageobject>
+ <textobject><para></para>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <para>Here is the same data with prefix data encoding.</para>
+ <figure>
+ <title>ColumnFamily with Prefix Encoding</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="data_block_prefix_encoding.png" width="800"/>
+ </imageobject>
+ <textobject><para></para>
+ </textobject>
+ </mediaobject>
+ </figure>
+ </listitem>
+ <listitem>
+ <para>Diff - Diff encoding expands upon Prefix encoding. Instead of considering the key
+ sequentially as a monolithic series of bytes, each key field is split so that each part of
+ the key can be compressed more efficiently. Two new fields are added: timestamp and type.
+ If the ColumnFamily is the same as the previous row, it is omitted from the current row.
+ If the key length, value length or type are the same as the previous row, the field is
+ omitted. In addition, for increased compression, the timestamp is stored as a Diff from
+ the previous row's timestamp, rather than being stored in full. Given the two row keys in
+ the Prefix example, and given an exact match on timestamp and the same type, neither the
+ value length, or type needs to be stored for the second row, and the timestamp value for
+ the second row is just 0, rather than a full timestamp.</para>
+ <para>Diff encoding is disabled by default because writing and scanning are slower but more
+ data is cached.</para>
+ <para>This image shows the same ColumnFamily from the previous images, with Diff encoding.</para>
+ <figure>
+ <title>ColumnFamily with Diff Encoding</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata fileref="data_block_diff_encoding.png" width="800"/>
+ </imageobject>
+ <textobject><para></para>
+ </textobject>
+ </mediaobject>
+ </figure>
+ </listitem>
+ <listitem>
+ <para>Fast Diff - Fast Diff works similar to Diff, but uses a faster implementation. It also
+ adds another field which stores a single bit to track whether the data itself is the same
+ as the previous row. If it is, the data is not stored again. Fast Diff is the recommended
+ codec to use if you have long keys or many columns. The data format is nearly identical to
+ Diff encoding, so there is not an image to illustrate it.</para>
+ </listitem>
+ <listitem>
+ <para>Prefix Tree encoding was introduced as an experimental feature in HBase 0.96. It
+ provides similar memory savings to the Prefix, Diff, and Fast Diff encoder, but provides
+ faster random access at a cost of slower encoding speed. Prefix Tree may be appropriate
+ for applications that have high block cache hit ratios. It introduces new 'tree' fields
+ for the row and column. The row tree field contains a list of offsets/references
+ corresponding to the cells in that row. This allows for a good deal of compression. For
+ more details about Prefix Tree encoding, see <link
+ xlink:href="https://issues.apache.org/jira/browse/HBASE-4676">HBASE-4676</link>. It is
+ difficult to graphically illustrate a prefix tree, so no image is included. See the
+ Wikipedia article for <link
+ xlink:href="http://en.wikipedia.org/wiki/Trie">Trie</link> for more general information
+ about this data structure.</para>
+ </listitem>
+ </itemizedlist>
- <section xml:id="gzip.compression">
- <title>
- GZIP
- </title>
- <para>
- GZIP will generally compress better than LZO but it will run slower.
- For some setups, better compression may be preferred ('cold' data).
- Java will use java's GZIP unless the native Hadoop libs are
- available on the CLASSPATH; in this case it will use native
- compressors instead (If the native libs are NOT present,
- you will see lots of <emphasis>Got brand-new compressor</emphasis>
- reports in your logs; see <xref linkend="brand.new.compressor" />).
- </para>
+ <section>
+ <title>Which Compressor or Data Block Encoder To Use</title>
+ <para>The compression or codec type to use depends on the characteristics of your data.
+ Choosing the wrong type could cause your data to take more space rather than less, and can
+ have performance implications. In general, you need to weigh your options between smaller
+ size and faster compression/decompression. Following are some general guidelines, expanded from a discussion at <link xlink:href="http://search-hadoop.com/m/lL12B1PFVhp1">Documenting Guidance on compression and codecs</link>. </para>
+ <itemizedlist>
+ <listitem>
+ <para>If you have long keys (compared to the values) or many columns, use a prefix
+ encoder. FAST_DIFF is recommended, as more testing is needed for Prefix Tree
+ encoding.</para>
+ </listitem>
+ <listitem>
+ <para>If the values are large (and not precompressed, such as images), use a data block
+ compressor.</para>
+ </listitem>
+ <listitem>
+ <para>Use GZIP for <firstterm>cold data</firstterm>, which is accessed infrequently. GZIP
+ compression uses more CPU resources than Snappy or LZO, but provides a higher
+ compression ratio.</para>
+ </listitem>
+ <listitem>
+ <para>Use Snappy or LZO for <firstterm>hot data</firstterm>, which is accessed
+ frequently. Snappy and LZO use fewer CPU resources than GZIP, but do not provide as high
+ of a compression ratio.</para>
+ </listitem>
+ <listitem>
+ <para>In most cases, enabling Snappy or LZO by default is a good choice, because they have
+ a low performance overhead and provide space savings.</para>
+ </listitem>
+ <listitem>
+ <para>Before Snappy became available by Google in 2011, LZO was the default. Snappy has
+ similar qualities as LZO but has been shown to perform better.</para>
+ </listitem>
+ </itemizedlist>
</section>
- <section xml:id="lz4.compression">
- <title>
- LZ4
- </title>
- <para>
- LZ4 is bundled with Hadoop. Make sure the hadoop .so is
- accessible when you start HBase. One means of doing this is after figuring your
- platform, see <xref linkend="hbase.native.platform" />, make a symlink from HBase
- to the native Hadoop libraries presuming the two software installs are colocated.
- For example, if my 'platform' is Linux-amd64-64:
- <programlisting>$ cd $HBASE_HOME
+ <section>
+ <title>Compressor Configuration, Installation, and Use</title>
+ <section
+ xml:id="compressor.install">
+ <title>Configure HBase For Compressors</title>
+ <para>Before HBase can use a given compressor, its libraries need to be available. Due to
+ licensing issues, only GZ compression is available to HBase (via native Java libraries) in
+ a default installation.</para>
+ <section>
+ <title>Compressor Support On the Master</title>
+ <para>A new configuration setting was introduced in HBase 0.95, to check the Master to
+ determine which data block encoders are installed and configured on it, and assume that
+ the entire cluster is configured the same. This option,
+ <code>hbase.master.check.compression</code>, defaults to <literal>true</literal>. This
+ prevents the situation described in <link
+ xlink:href="https://issues.apache.org/jira/browse/HBASE-6370">HBASE-6370</link>, where
+ a table is created or modified to support a codec that a region server does not support,
+ leading to failures that take a long time to occur and are difficult to debug. </para>
+ <para>If <code>hbase.master.check.compression</code> is enabled, libraries for all desired
+ compressors need to be installed and configured on the Master, even if the Master does
+ not run a region server.</para>
+ </section>
+ <section>
+ <title>Install GZ Support Via Native Libraries</title>
+ <para>HBase uses Java's built-in GZip support unless the native Hadoop libraries are
+ available on the CLASSPATH. The recommended way to add libraries to the CLASSPATH is to
+ set the environment variable <envar>HBASE_LIBRARY_PATH</envar> for the user running
+ HBase. If native libraries are not available and Java's GZIP is used, <literal>Got
+ brand-new compressor</literal> reports will be present in the logs. See <xref
+ linkend="brand.new.compressor" />).</para>
+ </section>
+ <section
+ xml:id="lzo.compression">
+ <title>Install LZO Support</title>
+ <para>HBase cannot ship with LZO because of incompatibility between HBase, which uses an
+ Apache Software License (ASL) and LZO, which uses a GPL license. See the <link
+ xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO
+ Compression</link> wiki page for information on configuring LZO support for HBase. </para>
+ <para>If you depend upon LZO compression, consider configuring your RegionServers to fail
+ to start if LZO is not available. See <xref
+ linkend="hbase.regionserver.codecs" />.</para>
+ </section>
+ <section
+ xml:id="lz4.compression">
+ <title>Configure LZ4 Support</title>
+ <para>LZ4 support is bundled with Hadoop. Make sure the hadoop shared library
+ (libhadoop.so) is accessible when you start
+ HBase. After configuring your platform (see <xref
+ linkend="hbase.native.platform" />), you can make a symbolic link from HBase to the native Hadoop
+ libraries. This assumes the two software installs are colocated. For example, if my
+ 'platform' is Linux-amd64-64:
+ <programlisting>$ cd $HBASE_HOME
$ mkdir lib/native
$ ln -s $HADOOP_HOME/lib/native lib/native/Linux-amd64-64</programlisting>
- Use the compression tool to check lz4 installed on all nodes.
- Start up (or restart) hbase. From here on out you will be able to create
- and alter tables to enable LZ4 as a compression codec. E.g.:
- <programlisting>hbase(main):003:0> alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</programlisting>
- </para>
- </section>
-
- <section xml:id="lzo.compression">
- <title>
- LZO
- </title>
- <para>Unfortunately, HBase cannot ship with LZO because of
- the licensing issues; HBase is Apache-licensed, LZO is GPL.
- Therefore LZO install is to be done post-HBase install.
- See the <link xlink:href="http://wiki.apache.org/hadoop/UsingLzoCompression">Using LZO Compression</link>
- wiki page for how to make LZO work with HBase.
- </para>
- <para>A common problem users run into when using LZO is that while initial
- setup of the cluster runs smooth, a month goes by and some sysadmin goes to
- add a machine to the cluster only they'll have forgotten to do the LZO
- fixup on the new machine. In versions since HBase 0.90.0, we should
- fail in a way that makes it plain what the problem is, but maybe not. </para>
- <para>See <xref linkend="hbase.regionserver.codecs" />
- for a feature to help protect against failed LZO install.</para>
- </section>
+ Use the compression tool to check that LZ4 is installed on all nodes. Start up (or restart)
+ HBase. Afterward, you can create and alter tables to enable LZ4 as a
+ compression codec.:
+ <screen>
+hbase(main):003:0> <userinput>alter 'TestTable', {NAME => 'info', COMPRESSION => 'LZ4'}</userinput>
+ </screen>
+ </para>
+ </section>
+ <section
+ xml:id="snappy.compression.installation">
+ <title>Install Snappy Support</title>
+ <para>HBase does not ship with Snappy support because of licensing issues. You can install
+ Snappy binaries (for instance, by using <command>yum install snappy</command> on CentOS)
+ or build Snappy from source. After installing Snappy, search for the shared library,
+ which will be called <filename>libsnappy.so.X</filename> where X is a number. If you
+ built from source, copy the shared library to a known location on your system, such as
+ <filename>/opt/snappy/lib/</filename>.</para>
+ <para>In addition to the Snappy library, HBase also needs access to the Hadoop shared
+ library, which will be called something like <filename>libhadoop.so.X.Y</filename>,
+ where X and Y are both numbers. Make note of the location of the Hadoop library, or copy
+ it to the same location as the Snappy library.</para>
+ <note>
+ <para>The Snappy and Hadoop libraries need to be available on each node of your cluster.
+ See <xref
+ linkend="compression.test" /> to find out how to test that this is the case.</para>
+ <para>See <xref
+ linkend="hbase.regionserver.codecs" /> to configure your RegionServers to fail to
+ start if a given compressor is not available.</para>
+ </note>
+ <para>Each of these library locations need to be added to the environment variable
+ <envar>HBASE_LIBRARY_PATH</envar> for the operating system user that runs HBase. You
+ need to restart the RegionServer for the changes to take effect.</para>
+ </section>
- <section xml:id="snappy.compression">
- <title>
- SNAPPY
- </title>
- <para>
- If snappy is installed, HBase can make use of it (courtesy of
- <link xlink:href="http://code.google.com/p/hadoop-snappy/">hadoop-snappy</link>
- <footnote><para>See <link xlink:href="http://search-hadoop.com/m/Ds8d51c263B1/%2522Hadoop-Snappy+in+synch+with+Hadoop+trunk%2522&subj=Hadoop+Snappy+in+synch+with+Hadoop+trunk">Alejandro's note</link> up on the list on difference between Snappy in Hadoop
- and Snappy in HBase</para></footnote>).
- <orderedlist>
- <listitem>
- <para>
- Build and install <link xlink:href="http://code.google.com/p/snappy/">snappy</link> on all nodes
- of your cluster (see below). HBase nor Hadoop cannot include snappy because of licensing issues (The
- hadoop libhadoop.so under its native dir does not include snappy; of note, the shipped .so
- may be for 32-bit architectures -- this fact has tripped up folks in the past with them thinking
- it 64-bit). The notes below are about installing snappy for HBase use. You may want snappy
- available in your hadoop context also. That is not covered here.
- HBase and Hadoop find the snappy .so in different locations currently: Hadoop picks those files in
- <filename>./lib</filename> while HBase finds the .so in <filename>./lib/[PLATFORM]</filename>.
- </para>
- </listitem>
- <listitem>
- <para>
- Use CompressionTest to verify snappy support is enabled and the libs can be loaded ON ALL NODES of your cluster:
- <programlisting>$ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://host/path/to/hbase snappy</programlisting>
- </para>
- </listitem>
- <listitem>
- <para>
- Create a column family with snappy compression and verify it in the hbase shell:
- <programlisting>$ hbase> create 't1', { NAME => 'cf1', COMPRESSION => 'SNAPPY' }
-hbase> describe 't1'</programlisting>
- In the output of the "describe" command, you need to ensure it lists "COMPRESSION => 'SNAPPY'"
- </para>
- </listitem>
+ <section
+ xml:id="compression.test">
+ <title>CompressionTest</title>
+ <para>You can use the CompressionTest tool to verify that your compressor is available to
+ HBase:</para>
+ <screen>
+ $ hbase org.apache.hadoop.hbase.util.CompressionTest hdfs://<replaceable>host/path/to/hbase</replaceable> snappy
+ </screen>
+ </section>
- </orderedlist>
- </para>
- <section xml:id="snappy.compression.installation">
- <title>
- Installation
- </title>
- <para>Snappy is used by hbase to compress HFiles on flush and when compacting.
- </para>
- <para>
- You will find the snappy library file under the .libs directory from your Snappy build (For example
- /home/hbase/snappy-1.0.5/.libs/). The file is called libsnappy.so.1.x.x where 1.x.x is the version of the snappy
- code you are building. You can either copy this file into your hbase lib directory -- under lib/native/PLATFORM --
- naming the file as libsnappy.so,
- or simply create a symbolic link to it (See ./bin/hbase for how it does library path for native libs).
- </para>
+ <section
+ xml:id="hbase.regionserver.codecs">
+ <title>Enforce Compression Settings On a RegionServer</title>
+ <para>You can configure a RegionServer so that it will fail to restart if compression is
+ configured incorrectly, by adding the option hbase.regionserver.codecs to the
+ <filename>hbase-site.xml</filename>, and setting its value to a comma-separated list
+ of codecs that need to be available. For example, if you set this property to
+ <literal>lzo,gz</literal>, the RegionServer would fail to start if both compressors
+ were not available. This would prevent a new server from being added to the cluster
+ without having codecs configured properly.</para>
+ </section>
+ </section>
- <para>
- The second file you need is the hadoop native library. You will find this file in your hadoop installation directory
- under lib/native/Linux-amd64-64/ or lib/native/Linux-i386-32/. The file you are looking for is libhadoop.so.1.x.x.
- Again, you can simply copy this file or link to it from under hbase in lib/native/PLATFORM (e.g. Linux-amd64-64, etc.),
- using the name libhadoop.so.
- </para>
+ <section
+ xml:id="changing.compression">
+ <title>Enable Compression On a ColumnFamily</title>
+ <para>To enable compression for a ColumnFamily, use an <code>alter</code> command. You do
+ not need to re-create the table or copy data. If you are changing codecs, be sure the old
+ codec is still available until all the old StoreFiles have been compacted.</para>
+ <example>
+ <title>Enabling Compression on a ColumnFamily of an Existing Table using HBase
+ Shell</title>
+ <screen><![CDATA[
+hbase> disable 'test'
+hbase> alter 'test', {NAME => 'cf', COMPRESSION => 'GZ'}
+hbase> enable 'test']]>
+ </screen>
+ </example>
+ <example>
+ <title>Creating a New Table with Compression On a ColumnFamily</title>
+ <screen><![CDATA[
+hbase> create 'test2', { NAME => 'cf2', COMPRESSION => 'SNAPPY' }
+ ]]></screen>
+ </example>
+ <example>
+ <title>Verifying a ColumnFamily's Compression Settings</title>
+ <screen><![CDATA[
+hbase> describe 'test'
+DESCRIPTION ENABLED
+ 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'NONE false
+ ', BLOOMFILTER => 'ROW', REPLICATION_SCOPE => '0',
+ VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERSIONS
+ => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'fa
+ lse', BLOCKSIZE => '65536', IN_MEMORY => 'false', B
+ LOCKCACHE => 'true'}
+1 row(s) in 0.1070 seconds
+ ]]></screen>
+ </example>
+ </section>
- <para>
- At the end of the installation, you should have both libsnappy.so and libhadoop.so links or files present into
- lib/native/Linux-amd64-64 or into lib/native/Linux-i386-32 (where the last part of the directory path is the
- PLATFORM you built and rare running the native lib on)
- </para>
- <para>To point hbase at snappy support, in hbase-env.sh set
- <programlisting>export HBASE_LIBRARY_PATH=/pathtoyourhadoop/lib/native/Linux-amd64-64</programlisting>
- In <filename>/pathtoyourhadoop/lib/native/Linux-amd64-64</filename> you should have something like:
- <programlisting>
- libsnappy.a
- libsnappy.so
- libsnappy.so.1
- libsnappy.so.1.1.2
- </programlisting>
- </para>
- </section>
+ <section>
+ <title>Testing Compression Performance</title>
+ <para>HBase includes a tool called LoadTestTool which provides mechanisms to test your
+ compression performance. You must specify either <literal>-write</literal> or
+ <literal>-update-read</literal> as your first parameter, and if you do not specify another
+ parameter, usage advice is printed for each option.</para>
+ <example>
+ <title><command>LoadTestTool</command> Usage</title>
+ <screen><![CDATA[
+$ bin/hbase org.apache.hadoop.hbase.util.LoadTestTool -h
+usage: bin/hbase org.apache.hadoop.hbase.util.LoadTestTool <options>
+Options:
+ -batchupdate Whether to use batch as opposed to separate
+ updates for every column in a row
+ -bloom <arg> Bloom filter type, one of [NONE, ROW, ROWCOL]
+ -compression <arg> Compression type, one of [LZO, GZ, NONE, SNAPPY,
+ LZ4]
+ -data_block_encoding <arg> Encoding algorithm (e.g. prefix compression) to
+ use for data blocks in the test column family, one
+ of [NONE, PREFIX, DIFF, FAST_DIFF, PREFIX_TREE].
+ -encryption <arg> Enables transparent encryption on the test table,
+ one of [AES]
+ -generator <arg> The class which generates load for the tool. Any
+ args for this class can be passed as colon
+ separated after class name
+ -h,--help Show usage
+ -in_memory Tries to keep the HFiles of the CF inmemory as far
+ as possible. Not guaranteed that reads are always
+ served from inmemory
+ -init_only Initialize the test table only, don't do any
+ loading
+ -key_window <arg> The 'key window' to maintain between reads and
+ writes for concurrent write/read workload. The
+ default is 0.
+ -max_read_errors <arg> The maximum number of read errors to tolerate
+ before terminating all reader threads. The default
+ is 10.
+ -multiput Whether to use multi-puts as opposed to separate
+ puts for every column in a row
+ -num_keys <arg> The number of keys to read/write
+ -num_tables <arg> A positive integer number. When a number n is
+ speicfied, load test tool will load n table
+ parallely. -tn parameter value becomes table name
+ prefix. Each table name is in format
+ <tn>_1...<tn>_n
+ -read <arg> <verify_percent>[:<#threads=20>]
+ -regions_per_server <arg> A positive integer number. When a number n is
+ specified, load test tool will create the test
+ table with n regions per server
+ -skip_init Skip the initialization; assume test table already
+ exists
+ -start_key <arg> The first key to read/write (a 0-based index). The
+ default value is 0.
+ -tn <arg> The name of the table to read or write
+ -update <arg> <update_percent>[:<#threads=20>][:<#whether to
+ ignore nonce collisions=0>]
+ -write <arg> <avg_cols_per_key>:<avg_data_size>[:<#threads=20>]
+ -zk <arg> ZK quorum as comma-separated host names without
+ port numbers
+ -zk_root <arg> name of parent znode in zookeeper
+ ]]></screen>
+ </example>
+ <example>
+ <title>Example Usage of LoadTestTool</title>
+ <screen>
+$ hbase org.apache.hadoop.hbase.util.LoadTestTool -write 1:10:100 -num_keys 1000000
+ -read 100:30 -num_tables 1 -data_block_encoding NONE -tn load_test_tool_NONE
+ </screen>
+ </example>
+ </section>
</section>
- <section xml:id="changing.compression">
- <title>Changing Compression Schemes</title>
- <para>A frequent question on the dist-list is how to change compression schemes for ColumnFamilies. This is actually quite simple,
- and can be done via an alter command. Because the compression scheme is encoded at the block-level in StoreFiles, the table does
- <emphasis>not</emphasis> need to be re-created and the data does <emphasis>not</emphasis> copied somewhere else. Just make sure
- the old codec is still available until you are sure that all of the old StoreFiles have been compacted.
- </para>
+
+ <section xml:id="data.block.encoding.enable">
+ <title>Enable Data Block Encoding</title>
+ <para>Codecs are built into HBase so no extra configuration is needed. Codecs are enabled on a
+ table by setting the <code>DATA_BLOCK_ENCODING</code> property. Disable the table before
+ altering its DATA_BLOCK_ENCODING setting. Following is an example using HBase Shell:</para>
+ <example>
+ <title>Enable Data Block Encoding On a Table</title>
+ <screen><![CDATA[
+hbase> disable 'test'
+hbase> alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
+Updating all regions with the new schema...
+0/1 regions updated.
+1/1 regions updated.
+Done.
+0 row(s) in 2.2820 seconds
+hbase> enable 'test'
+0 row(s) in 0.1580 seconds
+ ]]></screen>
+ </example>
+ <example>
+ <title>Verifying a ColumnFamily's Data Block Encoding</title>
+ <screen><![CDATA[
+hbase> describe 'test'
+DESCRIPTION ENABLED
+ 'test', {NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST true
+ _DIFF', BLOOMFILTER => 'ROW', REPLICATION_SCOPE =>
+ '0', VERSIONS => '1', COMPRESSION => 'GZ', MIN_VERS
+ IONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS =
+ > 'false', BLOCKSIZE => '65536', IN_MEMORY => 'fals
+ e', BLOCKCACHE => 'true'}
+1 row(s) in 0.0650 seconds
+ ]]></screen>
+ </example>
</section>
</appendix>
+
<appendix>
<title xml:id="ycsb"><link xlink:href="https://github.com/brianfrankcooper/YCSB/">YCSB: The Yahoo! Cloud Serving Benchmark</link> and HBase</title>
<para>TODO: Describe how YCSB is poor for putting up a decent cluster load.</para>
http://git-wip-us.apache.org/repos/asf/hbase/blob/209dd6dc/src/main/site/resources/images/data_block_diff_encoding.png
----------------------------------------------------------------------
diff --git a/src/main/site/resources/images/data_block_diff_encoding.png b/src/main/site/resources/images/data_block_diff_encoding.png
new file mode 100644
index 0000000..0bd03a4
Binary files /dev/null and b/src/main/site/resources/images/data_block_diff_encoding.png differ
http://git-wip-us.apache.org/repos/asf/hbase/blob/209dd6dc/src/main/site/resources/images/data_block_no_encoding.png
----------------------------------------------------------------------
diff --git a/src/main/site/resources/images/data_block_no_encoding.png b/src/main/site/resources/images/data_block_no_encoding.png
new file mode 100644
index 0000000..56498b4
Binary files /dev/null and b/src/main/site/resources/images/data_block_no_encoding.png differ
http://git-wip-us.apache.org/repos/asf/hbase/blob/209dd6dc/src/main/site/resources/images/data_block_prefix_encoding.png
----------------------------------------------------------------------
diff --git a/src/main/site/resources/images/data_block_prefix_encoding.png b/src/main/site/resources/images/data_block_prefix_encoding.png
new file mode 100644
index 0000000..4271847
Binary files /dev/null and b/src/main/site/resources/images/data_block_prefix_encoding.png differ