You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by jm...@apache.org on 2014/08/13 23:58:45 UTC

git commit: HBASE-11476 Expand 'Conceptual View' section of Data Model chapter (Misty Stanley-Jones)

Repository: hbase
Updated Branches:
  refs/heads/master 2153a92fa -> 92c3b877c


HBASE-11476 Expand 'Conceptual View' section of Data Model chapter (Misty Stanley-Jones)


Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/92c3b877
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/92c3b877
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/92c3b877

Branch: refs/heads/master
Commit: 92c3b877c0a2f1ca0fa6c791e41fbcb889f220ad
Parents: 2153a92
Author: Jonathan M Hsieh <jm...@apache.org>
Authored: Wed Aug 13 14:57:16 2014 -0700
Committer: Jonathan M Hsieh <jm...@apache.org>
Committed: Wed Aug 13 14:57:16 2014 -0700

----------------------------------------------------------------------
 src/main/docbkx/book.xml | 342 +++++++++++++++++++++++++++++-------------
 1 file changed, 240 insertions(+), 102 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hbase/blob/92c3b877/src/main/docbkx/book.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/book.xml b/src/main/docbkx/book.xml
index d37537f..603839c 100644
--- a/src/main/docbkx/book.xml
+++ b/src/main/docbkx/book.xml
@@ -91,38 +91,129 @@
   <chapter
     xml:id="datamodel">
     <title>Data Model</title>
-    <para>In short, applications store data into an HBase table. Tables are made of rows and
-      columns. All columns in HBase belong to a particular column family. Table cells -- the
-      intersection of row and column coordinates -- are versioned. A cell’s content is an
-      uninterpreted array of bytes. </para>
-    <para>Table row keys are also byte arrays so almost anything can serve as a row key from strings
-      to binary representations of longs or even serialized data structures. Rows in HBase tables
-      are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key
-      -- its primary key. </para>
+    <para>In HBase, data is stored in tables, which have rows and columns. This is a terminology
+      overlap with relational databases (RDBMSs), but this is not a helpful analogy. Instead, it can
+    be helpful to think of an HBase table as a multi-dimensional map.</para>
+    <variablelist>
+      <title>HBase Data Model Terminology</title>
+      <varlistentry>
+        <term>Table</term>
+        <listitem>
+          <para>An HBase table consists of multiple rows.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Row</term>
+        <listitem>
+          <para>A row in HBase consists of a row key and one or more columns with values associated
+            with them. Rows are sorted alphabetically by the row key as they are stored. For this
+            reason, the design of the row key is very important. The goal is to store data in such a
+            way that related rows are near each other. A common row key pattern is a website domain.
+            If your row keys are domains, you should probably store them in reverse (org.apache.www,
+            org.apache.mail, org.apache.jira). This way, all of the Apache domains are near each
+            other in the table, rather than being spread out based on the first letter of the
+            subdomain.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Column</term>
+        <listitem>
+          <para>A column in HBase consists of a column family and a column qualifier, which are
+            delimited by a <literal>:</literal> (colon) character.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Column Family</term>
+        <listitem>
+          <para>Column families physically colocate a set of columns and their values, often for
+            performance reasons. Each column family has a set of storage properties, such as whether
+            its values should be cached in memory, how its data is compressed or its row keys are
+            encoded, and others. Each row in a table has the same column
+            families, though a given row might not store anything in a given column family.</para>
+          <para>Column families are specified when you create your table, and influence the way your
+            data is stored in the underlying filesystem. Therefore, the column families should be
+            considered carefully during schema design.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Column Qualifier</term>
+        <listitem>
+          <para>A column qualifier is added to a column family to provide the index for a given
+            piece of data. Given a column family <literal>content</literal>, a column qualifier
+            might be <literal>content:html</literal>, and another might be
+            <literal>content:pdf</literal>. Though column families are fixed at table creation,
+            column qualifiers are mutable and may differ greatly between rows.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Cell</term>
+        <listitem>
+          <para>A cell is a combination of row, column family, and column qualifier, and contains a
+            value and a timestamp, which represents the value's version.</para>
+          <para>A cell's value is an uninterpreted array of bytes.</para>
+        </listitem>
+      </varlistentry>
+      <varlistentry>
+        <term>Timestamp</term>
+        <listitem>
+          <para>A timestamp is written alongside each value, and is the identifier for a given
+            version of a value. By default, the timestamp represents the time on the RegionServer
+            when the data was written, but you can specify a different timestamp value when you put
+            data into the cell.</para>
+          <caution>
+            <para>Direct manipulation of timestamps is an advanced feature which is only exposed for
+              special cases that are deeply integrated with HBase, and is discouraged in general.
+              Encoding a timestamp at the application level is the preferred pattern.</para>
+          </caution>
+          <para>You can specify the maximum number of versions of a value that HBase retains, per column
+            family. When the maximum number of versions is reached, the oldest versions are 
+            eventually deleted. By default, only the newest version is kept.</para>
+        </listitem>
+      </varlistentry>
+    </variablelist>
 
     <section
       xml:id="conceptual.view">
       <title>Conceptual View</title>
+      <para>You can read a very understandable explanation of the HBase data model in the blog post <link
+          xlink:href="http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable">Understanding
+          HBase and BigTable</link> by Jim R. Wilson. Another good explanation is available in the
+        PDF <link
+          xlink:href="http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf">Introduction
+          to Basic Schema Design</link> by Amandeep Khurana. It may help to read different
+        perspectives to get a solid understanding of HBase schema design. The linked articles cover
+        the same ground as the information in this section.</para>
       <para> The following example is a slightly modified form of the one on page 2 of the <link
           xlink:href="http://research.google.com/archive/bigtable.html">BigTable</link> paper. There
-        is a table called <varname>webtable</varname> that contains two column families named
-          <varname>contents</varname> and <varname>anchor</varname>. In this example,
+        is a table called <varname>webtable</varname> that contains two rows
+        (<literal>com.cnn.www</literal>
+          and <literal>com.example.www</literal>), three column families named
+          <varname>contents</varname>, <varname>anchor</varname>, and <varname>people</varname>. In
+          this example, for the first row (<literal>com.cnn.www</literal>), 
           <varname>anchor</varname> contains two columns (<varname>anchor:cssnsi.com</varname>,
           <varname>anchor:my.look.ca</varname>) and <varname>contents</varname> contains one column
-          (<varname>contents:html</varname>). <note>
+          (<varname>contents:html</varname>). This example contains 5 versions of the row with the
+        row key <literal>com.cnn.www</literal>, and one version of the row with the row key
+        <literal>com.example.www</literal>. The <varname>contents:html</varname> column qualifier contains the entire
+        HTML of a given website. Qualifiers of the <varname>anchor</varname> column family each
+        contain the external site which links to the site represented by the row, along with the
+        text it used in the anchor of its link. The <varname>people</varname> column family represents
+        people associated with the site.
+      </para>
+        <note>
           <title>Column Names</title>
-          <para> By convention, a column name is made of its column family prefix and a
-              <emphasis>qualifier</emphasis>. For example, the column
-              <emphasis>contents:html</emphasis> is made up of the column family
-              <varname>contents</varname> and <varname>html</varname> qualifier. The colon character
-              (<literal>:</literal>) delimits the column family from the column family
-              <emphasis>qualifier</emphasis>. </para>
+        <para> By convention, a column name is made of its column family prefix and a
+            <emphasis>qualifier</emphasis>. For example, the column
+            <emphasis>contents:html</emphasis> is made up of the column family
+            <varname>contents</varname> and the <varname>html</varname> qualifier. The colon
+          character (<literal>:</literal>) delimits the column family from the column family
+            <emphasis>qualifier</emphasis>. </para>
         </note>
         <table
           frame="all">
           <title>Table <varname>webtable</varname></title>
           <tgroup
-            cols="4"
+            cols="5"
             align="left"
             colsep="1"
             rowsep="1">
@@ -134,12 +225,15 @@
               colname="c3" />
             <colspec
               colname="c4" />
+            <colspec
+              colname="c5" />
             <thead>
               <row>
                 <entry>Row Key</entry>
                 <entry>Time Stamp</entry>
                 <entry>ColumnFamily <varname>contents</varname></entry>
                 <entry>ColumnFamily <varname>anchor</varname></entry>
+                <entry>ColumnFamily <varname>people</varname></entry>
               </row>
             </thead>
             <tbody>
@@ -148,128 +242,172 @@
                 <entry>t9</entry>
                 <entry />
                 <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
+                <entry />
               </row>
               <row>
                 <entry>"com.cnn.www"</entry>
                 <entry>t8</entry>
                 <entry />
                 <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
+                <entry />
               </row>
               <row>
                 <entry>"com.cnn.www"</entry>
                 <entry>t6</entry>
                 <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
                 <entry />
+                <entry />
               </row>
               <row>
                 <entry>"com.cnn.www"</entry>
                 <entry>t5</entry>
                 <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
                 <entry />
+                <entry />
               </row>
               <row>
                 <entry>"com.cnn.www"</entry>
                 <entry>t3</entry>
                 <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
                 <entry />
+                <entry />
+              </row>
+              <row>
+                <entry>"com.example.www"</entry>
+                <entry>t5</entry>
+                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+                <entry></entry>
+                <entry>people:author = "John Doe"</entry>
               </row>
             </tbody>
           </tgroup>
         </table>
-      </para>
+      <para>Cells in this table that appear to be empty do not take space, or in fact exist, in
+        HBase. This is what makes HBase "sparse." A tabular view is not the only possible way to
+        look at data in HBase, or even the most accurate. The following represents the same
+        information as a multi-dimensional map. This is only a mock-up for illustrative
+        purposes and may not be strictly accurate.</para>
+      <programlisting><![CDATA[
+{
+	"com.cnn.www": {
+		contents: {
+			t6: contents:html: "<html>..."
+			t5: contents:html: "<html>..."
+			t3: contents:html: "<html>..."
+		}
+		anchor: {
+			t9: anchor:cnnsi.com = "CNN"
+			t8: anchor:my.look.ca = "CNN.com"
+		}
+		people: {}
+	}
+	"com.example.www": {
+		contents: {
+			t5: contents:html: "<html>..."
+		}
+		anchor: {}
+		people: {
+			t5: people:author: "John Doe"
+		}
+	}
+}        
+        ]]></programlisting>
+
     </section>
     <section
       xml:id="physical.view">
       <title>Physical View</title>
-      <para> Although at a conceptual level tables may be viewed as a sparse set of rows. Physically
-        they are stored on a per-column family basis. New columns (i.e.,
-          <varname>columnfamily:column</varname>) can be added to any column family without
-        pre-announcing them. <table
-          frame="all">
-          <title>ColumnFamily <varname>anchor</varname></title>
-          <tgroup
-            cols="3"
-            align="left"
-            colsep="1"
-            rowsep="1">
-            <colspec
-              colname="c1" />
-            <colspec
-              colname="c2" />
-            <colspec
-              colname="c3" />
-            <thead>
-              <row>
-                <entry>Row Key</entry>
-                <entry>Time Stamp</entry>
-                <entry>Column Family <varname>anchor</varname></entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t9</entry>
-                <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t8</entry>
-                <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </table>
-        <table
-          frame="all">
-          <title>ColumnFamily <varname>contents</varname></title>
-          <tgroup
-            cols="3"
-            align="left"
-            colsep="1"
-            rowsep="1">
-            <colspec
-              colname="c1" />
-            <colspec
-              colname="c2" />
-            <colspec
-              colname="c3" />
-            <thead>
-              <row>
-                <entry>Row Key</entry>
-                <entry>Time Stamp</entry>
-                <entry>ColumnFamily "contents:"</entry>
-              </row>
-            </thead>
-            <tbody>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t6</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t5</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-              </row>
-              <row>
-                <entry>"com.cnn.www"</entry>
-                <entry>t3</entry>
-                <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
-              </row>
-            </tbody>
-          </tgroup>
-        </table> It is important to note in the diagram above that the empty cells shown in the
-        conceptual view are not stored since they need not be in a column-oriented storage format.
+      <para> Although at a conceptual level tables may be viewed as a sparse set of rows, they are
+        physically stored by column family. A new column qualifier (column_family:column_qualifier)
+        can be added to an existing column family at any time.</para>
+      <table
+        frame="all">
+        <title>ColumnFamily <varname>anchor</varname></title>
+        <tgroup
+          cols="3"
+          align="left"
+          colsep="1"
+          rowsep="1">
+          <colspec
+            colname="c1" />
+          <colspec
+            colname="c2" />
+          <colspec
+            colname="c3" />
+          <thead>
+            <row>
+              <entry>Row Key</entry>
+              <entry>Time Stamp</entry>
+              <entry>Column Family <varname>anchor</varname></entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t9</entry>
+              <entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry>
+            </row>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t8</entry>
+              <entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </table>
+      <table
+        frame="all">
+        <title>ColumnFamily <varname>contents</varname></title>
+        <tgroup
+          cols="3"
+          align="left"
+          colsep="1"
+          rowsep="1">
+          <colspec
+            colname="c1" />
+          <colspec
+            colname="c2" />
+          <colspec
+            colname="c3" />
+          <thead>
+            <row>
+              <entry>Row Key</entry>
+              <entry>Time Stamp</entry>
+              <entry>ColumnFamily "contents:"</entry>
+            </row>
+          </thead>
+          <tbody>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t6</entry>
+              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+            </row>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t5</entry>
+              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+            </row>
+            <row>
+              <entry>"com.cnn.www"</entry>
+              <entry>t3</entry>
+              <entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry>
+            </row>
+          </tbody>
+        </tgroup>
+      </table>
+      <para>The empty cells shown in the
+        conceptual view are not stored at all.
         Thus a request for the value of the <varname>contents:html</varname> column at time stamp
           <literal>t8</literal> would return no value. Similarly, a request for an
           <varname>anchor:my.look.ca</varname> value at time stamp <literal>t9</literal> would
         return no value. However, if no timestamp is supplied, the most recent value for a
-        particular column would be returned and would also be the first one found since timestamps
+        particular column would be returned. Given multiple versions, the most recent is also the
+        first one found,  since timestamps
         are stored in descending order. Thus a request for the values of all columns in the row
           <varname>com.cnn.www</varname> if no timestamp is specified would be: the value of
-          <varname>contents:html</varname> from time stamp <literal>t6</literal>, the value of
-          <varname>anchor:cnnsi.com</varname> from time stamp <literal>t9</literal>, the value of
-          <varname>anchor:my.look.ca</varname> from time stamp <literal>t8</literal>. </para>
+          <varname>contents:html</varname> from timestamp <literal>t6</literal>, the value of
+          <varname>anchor:cnnsi.com</varname> from timestamp <literal>t9</literal>, the value of
+          <varname>anchor:my.look.ca</varname> from timestamp <literal>t8</literal>. </para>
       <para>For more information about the internals of how Apache HBase stores data, see <xref
           linkend="regions.arch" />. </para>
     </section>