You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by dm...@apache.org on 2011/10/10 19:41:54 UTC
svn commit: r1181091 - in /hbase/trunk/src/docbkx: book.xml ops_mgt.xml

Author: dmeil
Date: Mon Oct 10 17:41:53 2011
New Revision: 1181091

URL: http://svn.apache.org/viewvc?rev=1181091&view=rev
Log:
HBASE-4566 book.xml,ops_mgt.xml - KeyValue documentation

Modified:
    hbase/trunk/src/docbkx/book.xml
    hbase/trunk/src/docbkx/ops_mgt.xml

Modified: hbase/trunk/src/docbkx/book.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/book.xml?rev=1181091&r1=1181090&r2=1181091&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/book.xml (original)
+++ hbase/trunk/src/docbkx/book.xml Mon Oct 10 17:41:53 2011
@@ -312,7 +312,7 @@ public static class MyReducer extends Ta
       <para>A good general introduction on the strength and weaknesses modelling on
           the various non-rdbms datastores is Ian Varleys' Master thesis,
           <link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
-          Recommended.
+          Recommended.  Also, read <xref linkend="keyvalue"/> for how HBase stores data internally.
       </para>
   <section xml:id="schema.creation">
   <title>
@@ -400,7 +400,7 @@ admin.enableTable(table);               
        </para>
        <para>Most of the time small inefficiencies don't matter all that much.  Unfortunately,
          this is a case where they do.  Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
-       several billion times in your data</para>
+       several billion times in your data.  See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para>
        <section xml:id="keysize.cf"><title>Column Families</title>
          <para>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
          </para> 
@@ -1615,6 +1615,8 @@ scan.setFilter(filter);
               Schubert Zhang's blog post on <link xlink:ref="http://cloudepr.blogspot.com/2009/09/hfile-block-indexed-file-format-to.html">HFile: A Block-Indexed File Format to Store Sorted Key-Value Pairs</link> makes for a thorough introduction to HBase's hfile.  Matteo Bertozzi has also put up a
               helpful description, <link xlink:href="http://th30z.blogspot.com/2011/02/hbase-io-hfile.html?spref=tw">HBase I/O: HFile</link>.
           </para>
+          <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFile.html">HFile source code</link>.
+          </para>
       </section>
 
       <section xml:id="hfile_tool">
@@ -1631,6 +1633,40 @@ scan.setFilter(filter);
         tool.</para>
       </section>
       </section>
+      <section xml:id="hfile.blocks">
+        <title>Blocks</title>
+        <para>StoreFiles are composed of blocks.  The blocksize is configured on a per-ColumnFamily basis.
+        </para>
+        <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/io/hfile/HFileBlock.html">HFileBlock source code</link>.
+        </para>
+      </section>
+      <section xml:id="keyvalue">
+        <title>KeyValue</title>
+        <para>The KeyValue class is the heart of data storage in HBase.  KeyValue wraps a byte array and takes offsets and lengths into passed array
+         at where to start interpreting the content as KeyValue.
+        </para>
+        <para>The KeyValue format inside a byte array is:
+           <itemizedlist>
+             <listitem>keylength</listitem>
+             <listitem>valuelength</listitem>
+             <listitem>key</listitem>
+             <listitem>value</listitem>
+           </itemizedlist>
+        </para>
+        <para>The Key is further decomposed as:
+           <itemizedlist>
+             <listitem>rowlength</listitem>
+             <listitem>row (i.e., the rowkey)</listitem>
+             <listitem>columnfamilylength</listitem>
+             <listitem>columnfamily</listitem>
+             <listitem>columnqualifier</listitem>
+             <listitem>timestamp</listitem>
+             <listitem>keytype (e.g., Put, Delete)</listitem>
+           </itemizedlist>
+        </para>
+        <para>For more information, see the <link xlink:href="http://hbase.apache.org/xref/org/apache/hadoop/hbase/KeyValue.html">KeyValue source code</link>.
+        </para>
+      </section>
       <section xml:id="compaction">
         <title>Compaction</title>
         <para>There are two types of compactions:  minor and major.  Minor compactions will usually pick up a couple of the smaller adjacent

Modified: hbase/trunk/src/docbkx/ops_mgt.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/ops_mgt.xml?rev=1181091&r1=1181090&r2=1181091&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/ops_mgt.xml (original)
+++ hbase/trunk/src/docbkx/ops_mgt.xml Mon Oct 10 17:41:53 2011
@@ -301,6 +301,32 @@ false
       <para>Since the cluster is up, there is a risk that edits could be missed in the export process.
       </para>
     </section>
+  </section>  <!--  backup -->
+  <section xml:id="ops.capacity"><title>Capacity Planning</title>
+    <section xml:id="ops.capacity.storage"><title>Storage</title>
+      <para>A common question for HBase administrators is estimating how much storage will be required for an HBase cluster.
+      There are several apsects to consider, the most important of which is what data load into the cluster.  Start
+      with a solid understanding of how HBase handles data internally (KeyValue).
+      </para>
+      <section xml:id="ops.capacity.storage.kv"><title>KeyValue</title>
+        <para>HBase storage will be dominated by KeyValues.  See <xref linkend="keyvalue" /> and <xref linkend="keysize" /> for 
+        how HBase stores data internally.  
+        </para>
+        <para>It is critical to understand that there is a KeyValue instance for every attribute stored in a row, and the 
+        rowkey-length, ColumnFamily name-length and attribute lengths will drive the size of the database more than any other
+        factor.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.sf"><title>StoreFiles and Blocks</title>
+        <para>KeyValue instances are aggregated into blocks, and the blocksize is configurable on a per-ColumnFamily basis.
+        Blocks are aggregated into StoreFile's.  See <xref linkend="regions.arch" />.
+        </para>
+      </section>
+      <section xml:id="ops.capacity.storage.hdfs"><title>HDFS Block Replication</title>
+        <para>Because HBase runs on top of HDFS, factor in HDFS block replication into storage calculations.
+        </para>
+      </section>
+    </section>
   </section>
 
 </chapter>