You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by dm...@apache.org on 2011/10/15 23:31:49 UTC
svn commit: r1183727 [2/2] - /hbase/trunk/src/docbkx/book.xml

Modified: hbase/trunk/src/docbkx/book.xml
URL: http://svn.apache.org/viewvc/hbase/trunk/src/docbkx/book.xml?rev=1183727&r1=1183726&r2=1183727&view=diff
==============================================================================
--- hbase/trunk/src/docbkx/book.xml (original)
+++ hbase/trunk/src/docbkx/book.xml Sat Oct 15 21:31:49 2011
@@ -64,1002 +64,998 @@
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="upgrading.xml" />
   <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="shell.xml" />
 
+  <chapter xml:id="datamodel">
+    <title>Data Model</title>
+    <para>In short, applications store data into an HBase table.
+        Tables are made of rows and columns.
+      All columns in HBase belong to a particular column family.
+      Table cells -- the intersection of row and column
+      coordinates -- are versioned.
+      A cellâs content is an uninterpreted array of bytes.
+  </para>
+      <para>Table row keys are also byte arrays so almost anything can
+      serve as a row key from strings to binary representations of longs or
+      even serialized data structures. Rows in HBase tables
+      are sorted by row key. The sort is byte-ordered. All table accesses are
+      via the table row key -- its primary key.
+</para>
 
+    <section xml:id="conceptual.view"><title>Conceptual View</title>
+	<para>
+        The following example is a slightly modified form of the one on page
+        2 of the <link xlink:href="http://labs.google.com/papers/bigtable.html">BigTable</link> paper.
+    There is a table called <varname>webtable</varname> that contains two column families named
+    <varname>contents</varname> and <varname>anchor</varname>.
+    In this example, <varname>anchor</varname> contains two
+    columns (<varname>anchor:cssnsi.com</varname>, <varname>anchor:my.look.ca</varname>)
+    and <varname>contents</varname> contains one column (<varname>contents:html</varname>).
+    <note>
+        <title>Column Names</title>
+      <para>
+      By convention, a column name is made of its column family prefix and a
+      <emphasis>qualifier</emphasis>. For example, the
+      column
+      <emphasis>contents:html</emphasis> is of the column family <varname>contents</varname>
+          The colon character (<literal
+          moreinfo="none">:</literal>) delimits the column family from the
+          column family <emphasis>qualifier</emphasis>.
+    </para>
+    </note>
+    <table frame='all'><title>Table <varname>webtable</varname></title>
+	<tgroup cols='4' align='left' colsep='1' rowsep='1'>
+	<colspec colname='c1'/>
+	<colspec colname='c2'/>
+	<colspec colname='c3'/>
+	<colspec colname='c4'/>
+	<thead>
+        <row><entry>Row Key</entry><entry>Time Stamp</entry><entry>ColumnFamily <varname>contents</varname></entry><entry>ColumnFamily <varname>anchor</varname></entry></row>
+	</thead>
+	<tbody>
+        <row><entry>"com.cnn.www"</entry><entry>t9</entry><entry></entry><entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t8</entry><entry></entry><entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t6</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry><entry></entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t5</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry><entry></entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t3</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry><entry></entry></row>
+	</tbody>
+	</tgroup>
+	</table>
+	</para>
+	</section>
+    <section xml:id="physical.view"><title>Physical View</title>
+	<para>
+        Although at a conceptual level tables may be viewed as a sparse set of rows.
+        Physically they are stored on a per-column family basis.  New columns
+        (i.e., <varname>columnfamily:column</varname>) can be added to any
+        column family without pre-announcing them. 
+        <table frame='all'><title>ColumnFamily <varname>anchor</varname></title>
+	<tgroup cols='3' align='left' colsep='1' rowsep='1'>
+	<colspec colname='c1'/>
+	<colspec colname='c2'/>
+	<colspec colname='c3'/>
+	<thead>
+        <row><entry>Row Key</entry><entry>Time Stamp</entry><entry>Column Family <varname>anchor</varname></entry></row>
+	</thead>
+	<tbody>
+        <row><entry>"com.cnn.www"</entry><entry>t9</entry><entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t8</entry><entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry></row>
+	</tbody>
+	</tgroup>
+	</table>
+    <table frame='all'><title>ColumnFamily <varname>contents</varname></title>
+	<tgroup cols='3' align='left' colsep='1' rowsep='1'>
+	<colspec colname='c1'/>
+	<colspec colname='c2'/>
+	<colspec colname='c3'/>
+	<thead>
+	<row><entry>Row Key</entry><entry>Time Stamp</entry><entry>ColumnFamily "contents:"</entry></row>
+	</thead>
+	<tbody>
+        <row><entry>"com.cnn.www"</entry><entry>t6</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t5</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry></row>
+        <row><entry>"com.cnn.www"</entry><entry>t3</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry></row>
+	</tbody>
+	</tgroup>
+	</table>
+    It is important to note in the diagram above that the empty cells shown in the
+    conceptual view are not stored since they need not be in a column-oriented
+    storage format. Thus a request for the value of the <varname>contents:html</varname>
+    column at time stamp <literal>t8</literal> would return no value. Similarly, a
+    request for an <varname>anchor:my.look.ca</varname> value at time stamp
+    <literal>t9</literal> would return no value.  However, if no timestamp is
+    supplied, the most recent value for a particular column would be returned
+    and would also be the first one found since timestamps are stored in
+    descending order. Thus a request for the values of all columns in the row
+    <varname>com.cnn.www</varname> if no timestamp is specified would be:
+    the value of <varname>contents:html</varname> from time stamp
+    <literal>t6</literal>, the value of <varname>anchor:cnnsi.com</varname>
+    from time stamp <literal>t9</literal>, the value of 
+    <varname>anchor:my.look.ca</varname> from time stamp <literal>t8</literal>.
+	</para>
+	</section>
 
-  <chapter xml:id="mapreduce">
-  <title>HBase and MapReduce</title>
-  <para>See <link xlink:href="http://hbase.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description">HBase and MapReduce</link> up in javadocs.
-  Start there.  Below is some additional help.</para>
-  <section xml:id="splitter">
-    <title>Map-Task Spitting</title>
-    <section xml:id="splitter.default">
-    <title>The Default HBase MapReduce Splitter</title>
-    <para>When <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html">TableInputFormat</link>
-    is used to source an HBase table in a MapReduce job,
-    its splitter will make a map task for each region of the table.
-    Thus, if there are 100 regions in the table, there will be
-    100 map-tasks for the job - regardless of how many column families are selected in the Scan.</para>
+    <section xml:id="table">
+      <title>Table</title>
+      <para>
+      Tables are declared up front at schema definition time.
+      </para>
     </section>
-    <section xml:id="splitter.custom">
-    <title>Custom Splitters</title>
-    <para>For those interested in implementing custom splitters, see the method <code>getSplits</code> in 
-    <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormatBase.html">TableInputFormatBase</link>.
-    That is where the logic for map-task assignment resides.  
-    </para>
+
+    <section xml:id="row">
+      <title>Row</title>
+      <para>Row keys are uninterrpreted bytes. Rows are
+      lexicographically sorted with the lowest order appearing first
+      in a table.  The empty byte array is used to denote both the
+      start and end of a tables' namespace.</para>
     </section>
-  </section>
-  <section xml:id="mapreduce.example">
-    <title>HBase MapReduce Examples</title>
-    <section xml:id="mapreduce.example.read">
-    <title>HBase MapReduce Read Example</title>
-    <para>The following is an example of using HBase as a MapReduce source in read-only manner.  Specifically,
-    there is a Mapper instance but no Reducer, and nothing is being emitted from the Mapper.  There job would be defined
-    as follows...
-	<programlisting>
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config, "ExampleRead");
-job.setJarByClass(MyReadJob.class);     // class that contains mapper
-	
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-...
-  
-TableMapReduceUtil.initTableMapperJob(
-  tableName,        // input HBase table name
-  scan,             // Scan instance to control CF and attribute selection
-  MyMapper.class,   // mapper
-  null,             // mapper output key 
-  null,             // mapper output value
-  job);
-job.setOutputFormatClass(NullOutputFormat.class);   // because we aren't emitting anything from mapper
-	    
-boolean b = job.waitForCompletion(true);
-if (!b) {
-  throw new IOException("error with job!");
-}
-  </programlisting>
-  ...and the mapper instance would extend <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapper.html">TableMapper</link>...
-	<programlisting>
-public static class MyMapper extends TableMapper&lt;Text, Text&gt; {
 
-  public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
-    // process data for the row from the Result instance.
-   }
-}    
-    </programlisting>
-  	  </para>
-  	 </section>
-    <section xml:id="mapreduce.example.readwrite">
-    <title>HBase MapReduce Read/Write Example</title>
-    <para>The following is an example of using HBase both as a source and as a sink with MapReduce. 
-    This example will simply copy data from one table to another.
-    <programlisting>
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleReadWrite");
-job.setJarByClass(MyReadWriteJob.class);    // class that contains mapper
-	        	        
-Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-	        
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,      // input table
-	scan,	          // Scan instance to control CF and attribute selection
-	MyMapper.class,   // mapper class
-	null,	          // mapper output key
-	null,	          // mapper output value
-	job);
-TableMapReduceUtil.initTableReducerJob(
-	targetTable,      // output table
-	null,             // reducer class
-	job);
-job.setNumReduceTasks(0);
-	        
-boolean b = job.waitForCompletion(true);
-if (!b) {
-    throw new IOException("error with job!");
-}
-    </programlisting>
-	An explanation is required of what <classname>TableMapReduceUtil</classname> is doing, especially with the reducer.  
-	<link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableOutputFormat.html">TableOutputFormat</link> is being used
-	as the outputFormat class, and several parameters are being set on the config (e.g., TableOutputFormat.OUTPUT_TABLE), as
-	well as setting the reducer output key to <classname>ImmutableBytesWritable</classname> and reducer value to <classname>Writable</classname>.
-	These could be set by the programmer on the job and conf, but <classname>TableMapReduceUtil</classname> tries to make things easier.    
-	<para>The following is the example mapper, which will create a <classname>Put</classname> and matching the input <classname>Result</classname>
-	and emit it.  Note:  this is what the CopyTable utility does.
-	</para>
-    <programlisting>
-public static class MyMapper extends TableMapper&lt;ImmutableBytesWritable, Put&gt;  {
+    <section xml:id="columnfamily">
+      <title>Column Family<indexterm><primary>Column Family</primary></indexterm></title>
+        <para>
+      Columns in HBase are grouped into <emphasis>column families</emphasis>.
+      All column members of a column family have the same prefix.  For example, the
+      columns <emphasis>courses:history</emphasis> and
+      <emphasis>courses:math</emphasis> are both members of the
+      <emphasis>courses</emphasis> column family.
+          The colon character (<literal
+          moreinfo="none">:</literal>) delimits the column family from the
+      <indexterm>column family <emphasis>qualifier</emphasis><primary>Column Family Qualifier</primary></indexterm>.
+        The column family prefix must be composed of
+      <emphasis>printable</emphasis> characters. The qualifying tail, the
+      column family <emphasis>qualifier</emphasis>, can be made of any
+      arbitrary bytes. Column families must be declared up front
+      at schema definition time whereas columns do not need to be
+      defined at schema time but can be conjured on the fly while
+      the table is up an running.</para>
+      <para>Physically, all column family members are stored together on the
+      filesystem.  Because tunings and
+      storage specifications are done at the column family level, it is
+      advised that all column family members have the same general access
+      pattern and size characteristics.</para>
 
-	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-		// this example is just copying the data from the source table...
-   		context.write(row, resultToPut(row,value));
-   	}
-        
-  	private static Put resultToPut(ImmutableBytesWritable key, Result result) throws IOException {
-  		Put put = new Put(key.get());
- 		for (KeyValue kv : result.raw()) {
-			put.add(kv);
-		}
-		return put;
-   	}
-}
-    </programlisting>
-    <para>There isn't actually a reducer step, so <classname>TableOutputFormat</classname> takes care of sending the <classname>Put</classname>
-    to the target table. 
-    </para>
-    <para>This is just an example, developers could choose not to use <classname>TableOutputFormat</classname> and connect to the 
-    target table themselves.
-    </para>
-    </para>
+      <para></para>
     </section>
-    <section xml:id="mapreduce.example.summary">
-    <title>HBase MapReduce Summary Example</title>
-    <para>The following example uses HBase as a MapReduce source and sink with a summarization step.  This example will 
-    count the number of distinct instances of a value in a table and write those summarized counts in another table.
-    <programlisting>
-Configuration config = HBaseConfiguration.create();
-Job job = new Job(config,"ExampleSummary");
-job.setJarByClass(MySummaryJob.class);     // class that contains mapper and reducer
-	        
+    <section xml:id="cells">
+      <title>Cells<indexterm><primary>Cells</primary></indexterm></title>
+      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly
+      specifies a <literal>cell</literal> in HBase. 
+      Cell content is uninterrpreted bytes</para>
+    </section>
+    <section xml:id="data_model_operations">
+       <title>Data Model Operations</title>
+       <para>The four primary data model operations are Get, Put, Scan, and Delete.  Operations are applied via 
+       <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html">HTable</link> instances.
+       </para>
+      <section xml:id="get">
+        <title>Get</title>
+        <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Get.html">Get</link> returns
+        attributes for a specified row.  Gets are executed via 
+        <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#get%28org.apache.hadoop.hbase.client.Get%29">
+        HTable.get</link>.
+        </para>
+      </section>
+      <section xml:id="put">
+        <title>Put</title>
+        <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Put.html">Put</link> either 
+        adds new rows to a table (if the key is new) or can update existing rows (if the key already exists).  Puts are executed via
+        <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#put%28org.apache.hadoop.hbase.client.Put%29">
+        HTable.put</link> (writeBuffer) or <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29">
+        HTable.batch</link> (non-writeBuffer).  
+        </para>
+      </section>
+      <section xml:id="scan">
+          <title>Scans</title>
+          <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Scan.html">Scan</link> allow
+          iteration over multiple rows for specified attributes.
+          </para>
+          <para>The following is an example of a 
+           on an HTable table instance.  Assume that a table is populated with rows with keys "row1", "row2", "row3", 
+           and then another set of rows with the keys "abc1", "abc2", and "abc3".  The following example shows how startRow and stopRow 
+           can be applied to a Scan instance to return the rows beginning with "row".        
+<programlisting>
+HTable htable = ...      // instantiate HTable
+    
 Scan scan = new Scan();
-scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
-scan.setCacheBlocks(false);  // don't set to true for MR jobs
-// set other scan attrs
-	        
-TableMapReduceUtil.initTableMapperJob(
-	sourceTable,        // input table
-	scan,               // Scan instance to control CF and attribute selection
-	MyMapper.class,     // mapper class
-	Text.class,         // mapper output key
-	IntWritable.class,  // mapper output value
-	job);
-TableMapReduceUtil.initTableReducerJob(
-	targetTable,        // output table
-	MyReducer.class,    // reducer class
-	job);
-job.setNumReduceTasks(1);   // at least one, adjust as required
-	    
-boolean b = job.waitForCompletion(true);
-if (!b) {
-	throw new IOException("error with job!");
-}    
-    </programlisting>
-    In this example mapper a column with a String-value is chosen as the value to summarize upon.  
-    This value is used as the key to emit from the mapper, and an <classname>IntWritable</classname> represents an instance counter.
-    <programlisting>
-public static class MyMapper extends TableMapper&lt;Text, IntWritable&gt;  {
-
-	private final IntWritable ONE = new IntWritable(1);
-   	private Text text = new Text();
-    	
-   	public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-        	String val = new String(value.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr1")));
-          	text.set(val);     // we can only emit Writables...
-
-        	context.write(text, ONE);
-   	}
-}
-    </programlisting>
-    In the reducer, the "ones" are counted (just like any other MR example that does this), and then emits a <classname>Put</classname>.
-    <programlisting>
-public static class MyReducer extends TableReducer&lt;Text, IntWritable, ImmutableBytesWritable&gt;  {
-        
- 	public void reduce(Text key, Iterable&lt;IntWritable&gt; values, Context context) throws IOException, InterruptedException {
-    		int i = 0;
-    		for (IntWritable val : values) {
-    			i += val.get();
-    		}
-    		Put put = new Put(Bytes.toBytes(key.toString()));
-    		put.add(Bytes.toBytes("cf"), Bytes.toBytes("count"), Bytes.toBytes(i));
-
-    		context.write(null, put);
-   	}
+scan.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("attr"));
+scan.setStartRow( Bytes.toBytes("row"));
+scan.setStopRow( Bytes.toBytes("row" +  new byte[] {0}));  // note: stop key != start key
+for(Result result : htable.getScanner(scan)) {
+  // process Result instance
 }
-    </programlisting>
-    
-    </para>
+</programlisting>
+         </para>        
+        </section>
+      <section xml:id="delete">
+        <title>Delete</title>
+        <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Delete.html">Delete</link> removes
+        a row from a table.  Deletes are executed via 
+        <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29">
+        HTable.delete</link>.
+        </para>
+      </section>
+            
     </section>
-   </section>
-   <section xml:id="mapreduce.htable.access">
-   <title>Accessing Other HBase Tables in a MapReduce Job</title>
-	<para>Although the framework currently allows one HBase table as input to a
-    MapReduce job, other HBase tables can 
-	be accessed as lookup tables, etc., in a
-    MapReduce job via creating an HTable instance in the setup method of the Mapper.
-	<programlisting>public class MyMapper extends TableMapper&lt;Text, LongWritable&gt; {
-  private HTable myOtherTable;
 
-  public void setup(Context context) {
-    myOtherTable = new HTable("myOtherTable");
-  }
-  
-  public void map(ImmutableBytesWritable row, Result value, Context context) throws IOException, InterruptedException {
-	// process Result...
-	// use 'myOtherTable' for lookups
-  }
-  
-  </programlisting>
-   </para>
-    </section>
-  <section xml:id="mapreduce.specex">
-  <title>Speculative Execution</title>
-  <para>It is generally advisable to turn off speculative execution for
-      MapReduce jobs that use HBase as a source.  This can either be done on a
-      per-Job basis through properties, on on the entire cluster.  Especially
-      for longer running jobs, speculative execution will create duplicate
-      map-tasks which will double-write your data to HBase; this is probably
-      not what you want.
-  </para>
-  </section>
-  </chapter>
 
-  <chapter xml:id="schema">
-  <title>HBase and Schema Design</title>
-      <para>A good general introduction on the strength and weaknesses modelling on
-          the various non-rdbms datastores is Ian Varleys' Master thesis,
-          <link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
-          Recommended.  Also, read <xref linkend="keyvalue"/> for how HBase stores data internally.
-      </para>
-  <section xml:id="schema.creation">
-  <title>
-      Schema Creation
-  </title>
-  <para>HBase schemas can be created or updated with <xref linkend="shell" />
-      or by using <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html">HBaseAdmin</link> in the Java API.
-      </para>
-      <para>Tables must be disabled when making ColumnFamily modifications, for example..
-      <programlisting>
-Configuration config = HBaseConfiguration.create();  
-HBaseAdmin admin = new HBaseAdmin(conf);    
-String table = "myTable";
+    <section xml:id="versions">
+      <title>Versions<indexterm><primary>Versions</primary></indexterm></title>
 
-admin.disableTable(table);           
+      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly
+      specifies a <literal>cell</literal> in HBase. Its possible to have an
+      unbounded number of cells where the row and column are the same but the
+      cell address differs only in its version dimension.</para>
 
-HColumnDescriptor cf1 = ...;
-admin.addColumn(table, cf1  );      // adding new ColumnFamily
-HColumnDescriptor cf2 = ...;
-admin.modifyColumn(table, cf2 );    // modifying existing ColumnFamily
+      <para>While rows and column keys are expressed as bytes, the version is
+      specified using a long integer. Typically this long contains time
+      instances such as those returned by
+      <code>java.util.Date.getTime()</code> or
+      <code>System.currentTimeMillis()</code>, that is: <quote>the difference,
+      measured in milliseconds, between the current time and midnight, January
+      1, 1970 UTC</quote>.</para>
 
-admin.enableTable(table);                
-      </programlisting>
-      </para>See <xref linkend="client_dependencies"/> for more information about configuring client connections.
-      <para>
-      </para>
-  </section>   
-  <section xml:id="number.of.cfs">
-  <title>
-      On the number of column families
-  </title>
-  <para>
-      HBase currently does not do well with anything above two or three column families so keep the number
-      of column families in your schema low.  Currently, flushing and compactions are done on a per Region basis so
-      if one column family is carrying the bulk of the data bringing on flushes, the adjacent families
-      will also be flushed though the amount of data they carry is small.  Compaction is currently triggered
-      by the total number of files under a column family.  Its not size based.  When many column families the
-      flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by
-      changing flushing and compaction to work on a per column family basis).
-    </para>
-    <para>Try to make do with one column family if you can in your schemas.  Only introduce a
-        second and third column family in the case where data access is usually column scoped;
-        i.e. you query one column family or the other but usually not both at the one time.
-    </para>
-  </section>
-  <section xml:id="rowkey.design"><title>Rowkey Design</title>
-    <section xml:id="timeseries">
-    <title>
-    Monotonically Increasing Row Keys/Timeseries Data
-    </title>
-    <para>
-      In the HBase chapter of Tom White's book <link xlink:url="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide</link> (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc.  With monotonically increasing row-keys (i.e., using a timestamp), this will happen.  See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores:
-      <link xlink:href="http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/">monotonically increasing values are bad</link>.  The pile-up on a single region brought on
-      by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key. 
-    </para>
+      <para>The HBase version dimension is stored in decreasing order, so that
+      when reading from a store file, the most recent values are found
+      first.</para>
 
+      <para>There is a lot of confusion over the semantics of
+      <literal>cell</literal> versions, in HBase. In particular, a couple
+      questions that often come up are:<itemizedlist>
+          <listitem>
+            <para>If multiple writes to a cell have the same version, are all
+            versions maintained or just the last?<footnote>
+                <para>Currently, only the last written is fetchable.</para>
+              </footnote></para>
+          </listitem>
 
-    <para>If you do need to upload time series data into HBase, you should
-    study <link xlink:href="http://opentsdb.net/">OpenTSDB</link> as a
-    successful example.  It has a page describing the <link xlink:href=" http://opentsdb.net/schema.html">schema</link> it uses in
-    HBase.  The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key.  However, the difference is that the timestamp is not in the <emphasis>lead</emphasis> position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types.  Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
-   </para>
-  </section>
-  <section xml:id="keysize">
-      <title>Try to minimize row and column sizes</title>
-      <subtitle>Or why are my StoreFile indices large?</subtitle>
-      <para>In HBase, values are always freighted with their coordinates; as a
-          cell value passes through the system, it'll be accompanied by its
-          row, column name, and timestamp - always.  If your rows and column names
-          are large, especially compared to the size of the cell value, then
-          you may run up against some interesting scenarios.  One such is
-          the case described by Marc Limotte at the tail of
-          <link xlink:url="https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=13005272#comment-13005272">HBASE-3551</link>
-          (recommended!).
-          Therein, the indices that are kept on HBase storefiles (<xref linkend="hfile" />)
-                  to facilitate random access may end up occupyng large chunks of the HBase
-                  allotted RAM because the cell value coordinates are large.
-                  Mark in the above cited comment suggests upping the block size so
-                  entries in the store file index happen at a larger interval or
-                  modify the table schema so it makes for smaller rows and column
-                  names.
-                  Compression will also make for larger indices.  See
-                  the thread <link xlink:href="http://search-hadoop.com/m/hemBv1LiN4Q1/a+question+storefileIndexSize&amp;subj=a+question+storefileIndexSize">a question storefileIndexSize</link>
-                  up on the user mailing list.
-       </para>
-       <para>Most of the time small inefficiencies don't matter all that much.  Unfortunately,
-         this is a case where they do.  Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
-       several billion times in your data.  See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para>
-       <section xml:id="keysize.cf"><title>Column Families</title>
-         <para>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
-         </para> 
-       </section>
-       <section xml:id="keysize.atttributes"><title>Attributes</title>
-         <para>Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via")
-         to store in HBase.
-         </para> 
-       </section>
-       <section xml:id="keysize.row"><title>Rowkey Length</title>
-         <para>Keep them as short as is reasonable such that they can still be useful for required data access (e.g., Get vs. Scan). 
-         A short key that is useless for data access is not better than a longer key with better get/scan properties.  Expect tradeoffs
-         when designing rowkeys.
-         </para> 
-       </section>
-    </section>
-    <section xml:id="reverse.timestamp"><title>Reverse Timestamps</title>
-    <para>A common problem in database processing is quickly finding the most recent version of a value.  A technique using reverse timestamps
-    as a part of the key can help greatly with a special case of this problem.  Also found in the HBase chapter of Tom White's book Hadoop:  The Definitive Guide (O'Reilly), 
-    the technique involves appending (<code>Long.MAX_VALUE - timestamp</code>) to the end of any key, e.g., [key][reverse_timestamp].
-    </para>
-    <para>The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record.  Since HBase keys
-    are in sorted order, this key sorts before any older row-keys for [key] and thus is first.
-    </para>
-    <para>This technique would be used instead of using <xref linkend="schema.versions">HBase Versioning</xref> where the intent is to hold onto all versions
-    "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.
-    </para>
-    </section>
-    <section xml:id="rowkey.scope">
-    <title>Rowkeys and ColumnFamilies</title>
-    <para>Rowkeys are scoped to ColumnFamilies.  Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.
-    </para>
-    </section>
-    <section xml:id="changing.rowkeys"><title>Immutability of Rowkeys</title>
-    <para>Rowkeys cannot be changed.  The only way they can be "changed" in a table is if the row is deleted and then re-inserted.
-    This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've 
-    inserted a lot of data).
-    </para>
-    </section>
-    </section>  <!--  rowkey design -->  
-    <section xml:id="schema.versions">
-  <title>
-  Number of Versions
-  </title>
-     <section xml:id="schema.versions.max"><title>Maximum Number of Versions</title>
-      <para>The maximum number of row versions to store is configured per column
-      family via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
-      The default for max versions is 3.
-      This is an important parameter because as described in <xref linkend="datamodel" />
-      section HBase does <emphasis>not</emphasis> overwrite row values, but rather
-      stores different values per row by time (and qualifier).  Excess versions are removed during major
-      compactions.  The number of max versions may need to be increased or decreased depending on application needs.
-      </para>
-      <para>It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are 
-      very dear to you because this will greatly increase StoreFile size. 
-      </para>
-     </section>
-    <section xml:id="schema.minversions">
-    <title>
-    Minimum Number of Versions
-    </title>
-    <para>Like number of max row versions, the minimum number of row versions to keep is configured per column
-      family via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
-      The default for min versions is 0, which means the feature is disabled.
-      The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the
-      number of row versions parameter to allow configurations such as
-      "keep the last T minutes worth of data, at most N versions, <emphasis>but keep at least M versions around</emphasis>"
-      (where M is the value for minimum number of row versions, M&lt;=N).
-      This parameter should only be set when time-to-live is enabled for a column family and must be less or equal to the 
-      number of row versions.
-    </para>
-    </section>
-  </section>
-  <section xml:id="supported.datatypes">
-  <title>
-  Supported Datatypes
-  </title>
-  <para>HBase supports a "bytes-in/bytes-out" interface via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link> and
-  <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html">Result</link>, so anything that can be
-  converted to an array of bytes can be stored as a value.  Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.  
-  </para>
-  <para>There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask);
-  search the mailling list for conversations on this topic. All rows in HBase conform to the <xref linkend="datamodel">datamodel</xref>, and 
-  that includes versioning.  Take that into consideration when making your design, as well as block size for the ColumnFamily.  
-  </para>
-    <section xml:id="counters">
-      <title>Counters</title>
-      <para>
-      One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers).  See 
-      <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#increment%28org.apache.hadoop.hbase.client.Increment%29">Increment</link> in HTable.
-      </para>
-      <para>Synchronization on counters are done on the RegionServer, not in the client.
-      </para>
-    </section> 
-  </section>
-  <section xml:id="cf.in.memory">
-  <title>
-  In-Memory ColumnFamilies
-  </title>
-  <para>ColumnFamilies can optionally be defined as in-memory.  Data is still persisted to disk, just like any other ColumnFamily.  
-  In-memory blocks have the highest priority in the <xref linkend="block.cache" />, but it is not a guarantee that the entire table
-  will be in memory.
-  </para>
-  <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information.
-  </para>
-  </section>
-  <section xml:id="ttl">
-  <title>Time To Live (TTL)</title>
-  <para>ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached.
-  This applies to <emphasis>all</emphasis> versions of a row - even the current one.  The TTL time encoded in the HBase for the row is specified in UTC.
-  </para>
-  <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information.
-  </para>
-  </section>
-  <section xml:id="schema.bloom">
-  <title>Bloom Filters</title>
-  <para>Bloom Filters can be enabled per-ColumnFamily.
-        Use <code>HColumnDescriptor.setBloomFilterType(NONE | ROW |
-        ROWCOL)</code> to enable blooms per Column Family. Default =
-        <varname>NONE</varname> for no bloom filters. If
-        <varname>ROW</varname>, the hash of the row will be added to the bloom
-        on each insert. If <varname>ROWCOL</varname>, the hash of the row +
-        column family + column family qualifier will be added to the bloom on
-        each key insert.</para>
-  <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> and 
-  <xref linkend="blooms"/> for more information.
-  </para>
-  </section>
-  <section xml:id="secondary.indexes">
-  <title>
-  Secondary Indexes and Alternate Query Paths
-  </title>
-  <para>This section could also be titled "what if my table rowkey looks like <emphasis>this</emphasis> but I also want to query my table like <emphasis>that</emphasis>."
-  A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain 
-  time ranges.  Thus, selecting by user is easy because it is in the lead position of the key, but time is not.
-  </para>
-  <para>There is no single answer on the best way to handle this because it depends on...
-   <itemizedlist>
-       <listitem>Number of users</listitem>  
-       <listitem>Data size and data arrival rate</listitem>
-       <listitem>Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges) </listitem>  
-       <listitem>Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others) </listitem>  
-   </itemizedlist>
-   ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution.  
-   Common techniques are in sub-sections below.  This is a comprehensive, but not exhaustive, list of approaches.   
-  </para>
-  <para>It should not be a surprise that secondary indexes require additional cluster space and processing.  
-  This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update.  RBDMS products
-  are more advanced in this regard to handle alternative index management out of the box.  However, HBase scales better at larger data volumes, so this is a feature trade-off. 
-  </para>
-  <para>Pay attention to <xref linkend="performance"/> when implementing any of these approaches.</para>
-  <para>Additionally, see the David Butler response in this dist-list thread <link xlink:href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&amp;subj=Stargate+hbase">HBase, mail # user - Stargate+hbase</link>
-   </para>
-    <section xml:id="secondary.indexes.filter">
-      <title>
-       Filter Query
-      </title>
-      <para>Depending on the case, it may be appropriate to use <xref linkend="client.filter"/>.  In this case, no secondary index is created.
-      However, don't try a full-scan on a large table like this from an application (i.e., single-threaded client).
-      </para>
-    </section>
-    <section xml:id="secondary.indexes.periodic">
-      <title>
-       Periodic-Update Secondary Index
-      </title>
-      <para>A secondary index could be created in an other table which is periodically updated via a MapReduce job.  The job could be executed intra-day, but depending on 
-      load-strategy it could still potentially be out of sync with the main data table.</para>
-      <para>See <xref linkend="mapreduce.example.readwrite"/> for more information.</para>
-    </section>
-    <section xml:id="secondary.indexes.dualwrite">
-      <title>
-       Dual-Write Secondary Index
-      </title>
-      <para>Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). 
-      If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see <xref linkend="secondary.indexes.periodic"/>).</para>
-    </section>
-    <section xml:id="secondary.indexes.summary">
-      <title>
-       Summary Tables
-      </title>
-      <para>Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach.
-      These would be generated with MapReduce jobs into another table.</para>
-      <para>See <xref linkend="mapreduce.example.summary"/> for more information.</para>
-    </section>
-    <section xml:id="secondary.indexes.coproc">
-      <title>
-       Coprocessor Secondary Index
-      </title>
-      <para>Coprocessors act like RDBMS triggers.  These are currently on TRUNK.
-      </para>
-    </section>
-  </section>
-  <section xml:id="schema.smackdown"><title>Schema Design Smackdown</title>
-    <para>This section will describe common schema design questions that appear on the dist-list.  These are 
-    general guidelines and not laws - each application must consider it's own needs.  
-    </para>
-    <section xml:id="schema.smackdown.rowsversions"><title>Rows vs. Versions</title>
-      <para>A common question is whether one should prefer rows or HBase's built-in-versioning.  The context is typically where there are
-      "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 3 max versions).  The 
-      rows-approach would require storing a timstamp in some portion of the rowkey so that they would not overwite with each successive update.
-      </para>
-      <para>Preference:  Rows (generally speaking).
-      </para>
+          <listitem>
+            <para>Is it OK to write cells in a non-increasing version
+            order?<footnote>
+                <para>Yes</para>
+              </footnote></para>
+          </listitem>
+        </itemizedlist></para>
+
+      <para>Below we describe how the version dimension in HBase currently
+      works<footnote>
+          <para>See <link
+          xlink:href="https://issues.apache.org/jira/browse/HBASE-2406">HBASE-2406</link>
+          for discussion of HBase versions. <link
+          xlink:href="http://outerthought.org/blog/417-ot.html">Bending time
+          in HBase</link> makes for a good read on the version, or time,
+          dimension in HBase. It has more detail on versioning than is
+          provided here. As of this writing, the limiitation
+          <emphasis>Overwriting values at existing timestamps</emphasis>
+          mentioned in the article no longer holds in HBase. This section is
+          basically a synopsis of this article by Bruno Dumon.</para>
+        </footnote>.</para>
+
+      <section xml:id="versions.ops">
+        <title>Versions and HBase Operations</title>
+
+        <para>In this section we look at the behavior of the version dimension
+        for each of the core HBase operations.</para>
+
+        <section>
+          <title>Get/Scan</title>
+
+          <para>Gets are implemented on top of Scans. The below discussion of
+            <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Get.html">Get</link> applies equally to <link
+            xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Scan.html">Scans</link>.</para>
+
+          <para>By default, i.e. if you specify no explicit version, when
+          doing a <literal>get</literal>, the cell whose version has the
+          largest value is returned (which may or may not be the latest one
+          written, see later). The default behavior can be modified in the
+          following ways:</para>
+
+          <itemizedlist>
+            <listitem>
+              <para>to return more than one version, see <link
+              xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Get.html#setMaxVersions()">Get.setMaxVersions()</link></para>
+            </listitem>
+
+            <listitem>
+              <para>to return versions other than the latest, see <link
+              xlink:href="???">Get.setTimeRange()</link></para>
+
+              <para>To retrieve the latest version that is less than or equal
+              to a given value, thus giving the 'latest' state of the record
+              at a certain point in time, just use a range from 0 to the
+              desired version and set the max versions to 1.</para>
+            </listitem>
+          </itemizedlist>
+
+        </section>
+        <section xml:id="default_get_example">
+        <title>Default Get Example</title>
+        <para>The following Get will only retrieve the current version of the row
+<programlisting>
+Get get = new Get(Bytes.toBytes("row1"));
+Result r = htable.get(get);
+byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value          
+</programlisting>
+        </para>
+        </section>
+        <section xml:id="versioned_get_example">
+        <title>Versioned Get Example</title>
+        <para>The following Get will return the last 3 versions of the row.
+<programlisting>
+Get get = new Get(Bytes.toBytes("row1"));
+get.setMaxVersions(3);  // will return last 3 versions of row
+Result r = htable.get(get);
+byte[] b = r.getValue(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns current version of value
+List&lt;KeyValue&gt; kv = r.getColumn(Bytes.toBytes("cf"), Bytes.toBytes("attr"));  // returns all versions of this column       
+</programlisting>
+        </para>
+        </section>
+     
+        <section>
+          <title>Put</title>
+
+          <para>Doing a put always creates a new version of a
+          <literal>cell</literal>, at a certain timestamp. By default the
+          system uses the server's <literal>currentTimeMillis</literal>, but
+          you can specify the version (= the long integer) yourself, on a
+          per-column level. This means you could assign a time in the past or
+          the future, or use the long value for non-time purposes.</para>
+
+          <para>To overwrite an existing value, do a put at exactly the same
+          row, column, and version as that of the cell you would
+          overshadow.</para>
+          <section xml:id="implicit_version_example">
+          <title>Implicit Version Example</title>
+          <para>The following Put will be implicitly versioned by HBase with the current time.
+<programlisting>
+Put put = new Put(Bytes.toBytes(row));
+put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), Bytes.toBytes( data));
+htable.put(put);
+</programlisting>
+          </para>
+          </section>
+          <section xml:id="explicit_version_example">
+          <title>Explicit Version Example</title>
+          <para>The following Put has the version timestamp explicitly set.
+<programlisting>
+Put put = new Put( Bytes.toBytes(row));
+long explicitTimeInMs = 555;  // just an example
+put.add(Bytes.toBytes("cf"), Bytes.toBytes("attr1"), explicitTimeInMs, Bytes.toBytes(data));
+htable.put(put);
+</programlisting>
+          Caution:  the version timestamp is internally by HBase for things like time-to-live calculations.  
+          It's usually best to avoid setting the timestamp yourself.
+          </para>
+          </section>
+          
+        </section>
+
+        <section>
+          <title>Delete</title>
+
+          <para>When performing a delete operation in HBase, there are two
+          ways to specify the versions to be deleted</para>
+
+          <itemizedlist>
+            <listitem>
+              <para>Delete all versions older than a certain timestamp</para>
+            </listitem>
+
+            <listitem>
+              <para>Delete the version at a specific timestamp</para>
+            </listitem>
+          </itemizedlist>
+
+          <para>A delete can apply to a complete row, a complete column
+          family, or to just one column. It is only in the last case that you
+          can delete explicit versions. For the deletion of a row or all the
+          columns within a family, it always works by deleting all cells older
+          than a certain version.</para>
+
+          <para>Deletes work by creating <emphasis>tombstone</emphasis>
+          markers. For example, let's suppose we want to delete a row. For
+          this you can specify a version, or else by default the
+          <literal>currentTimeMillis</literal> is used. What this means is
+          <quote>delete all cells where the version is less than or equal to
+          this version</quote>. HBase never modifies data in place, so for
+          example a delete will not immediately delete (or mark as deleted)
+          the entries in the storage file that correspond to the delete
+          condition. Rather, a so-called <emphasis>tombstone</emphasis> is
+          written, which will mask the deleted values<footnote>
+              <para>When HBase does a major compaction, the tombstones are
+              processed to actually remove the dead values, together with the
+              tombstones themselves.</para>
+            </footnote>. If the version you specified when deleting a row is
+          larger than the version of any value in the row, then you can
+          consider the complete row to be deleted.</para>
+        </section>
+      </section>
+
+      <section>
+        <title>Current Limitations</title>
+
+        <para>There are still some bugs (or at least 'undecided behavior')
+        with the version dimension that will be addressed by later HBase
+        releases.</para>
+
+        <section>
+          <title>Deletes mask Puts</title>
+
+          <para>Deletes mask puts, even puts that happened after the delete
+          was entered<footnote>
+              <para><link
+              xlink:href="https://issues.apache.org/jira/browse/HBASE-2256">HBASE-2256</link></para>
+            </footnote>. Remember that a delete writes a tombstone, which only
+          disappears after then next major compaction has run. Suppose you do
+          a delete of everything &lt;= T. After this you do a new put with a
+          timestamp &lt;= T. This put, even if it happened after the delete,
+          will be masked by the delete tombstone. Performing the put will not
+          fail, but when you do a get you will notice the put did have no
+          effect. It will start working again after the major compaction has
+          run. These issues should not be a problem if you use
+          always-increasing versions for new puts to a row. But they can occur
+          even if you do not care about time: just do delete and put
+          immediately after each other, and there is some chance they happen
+          within the same millisecond.</para>
+        </section>
+
+        <section>
+          <title>Major compactions change query results</title>
+
+          <para><quote>...create three cell versions at t1, t2 and t3, with a
+          maximum-versions setting of 2. So when getting all versions, only
+          the values at t2 and t3 will be returned. But if you delete the
+          version at t2 or t3, the one at t1 will appear again. Obviously,
+          once a major compaction has run, such behavior will not be the case
+          anymore...<footnote>
+              <para>See <emphasis>Garbage Collection</emphasis> in <link
+              xlink:href="http://outerthought.org/blog/417-ot.html">Bending
+              time in HBase</link> </para>
+            </footnote></quote></para>
+        </section>
+      </section>
     </section>
-    <section xml:id="schema.smackdown.rowscols"><title>Rows vs. Columns</title>
-      <para>Another common question is whether one should prefer rows or columns.  The context is typically in extreme cases of wide
-      tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.  
+  </chapter>  <!-- data model -->
+
+ <chapter xml:id="schema">
+  <title>HBase and Schema Design</title>
+      <para>A good general introduction on the strength and weaknesses modelling on
+          the various non-rdbms datastores is Ian Varleys' Master thesis,
+          <link xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation: The Mixed Blessings of Non-Relational Databases</link>.
+          Recommended.  Also, read <xref linkend="keyvalue"/> for how HBase stores data internally.
       </para>
-      <para>Preference:  Rows (generally speaking).  To be clear, this guideline is in the context is in extremely wide cases, not in the 
-      standard use-case where one needs to store a few dozen or hundred columns.
+  <section xml:id="schema.creation">
+  <title>
+      Schema Creation
+  </title>
+  <para>HBase schemas can be created or updated with <xref linkend="shell" />
+      or by using <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html">HBaseAdmin</link> in the Java API.
       </para>
-    </section>
-  </section>
+      <para>Tables must be disabled when making ColumnFamily modifications, for example..
+      <programlisting>
+Configuration config = HBaseConfiguration.create();  
+HBaseAdmin admin = new HBaseAdmin(conf);    
+String table = "myTable";
 
-  </chapter>
+admin.disableTable(table);           
 
-  
-  <chapter xml:id="datamodel">
-    <title>Data Model</title>
-    <para>In short, applications store data into an HBase table.
-        Tables are made of rows and columns.
-      All columns in HBase belong to a particular column family.
-      Table cells -- the intersection of row and column
-      coordinates -- are versioned.
-      A cellâs content is an uninterpreted array of bytes.
-  </para>
-      <para>Table row keys are also byte arrays so almost anything can
-      serve as a row key from strings to binary representations of longs or
-      even serialized data structures. Rows in HBase tables
-      are sorted by row key. The sort is byte-ordered. All table accesses are
-      via the table row key -- its primary key.
-</para>
+HColumnDescriptor cf1 = ...;
+admin.addColumn(table, cf1  );      // adding new ColumnFamily
+HColumnDescriptor cf2 = ...;
+admin.modifyColumn(table, cf2 );    // modifying existing ColumnFamily
 
-    <section xml:id="conceptual.view"><title>Conceptual View</title>
-	<para>
-        The following example is a slightly modified form of the one on page
-        2 of the <link xlink:href="http://labs.google.com/papers/bigtable.html">BigTable</link> paper.
-    There is a table called <varname>webtable</varname> that contains two column families named
-    <varname>contents</varname> and <varname>anchor</varname>.
-    In this example, <varname>anchor</varname> contains two
-    columns (<varname>anchor:cssnsi.com</varname>, <varname>anchor:my.look.ca</varname>)
-    and <varname>contents</varname> contains one column (<varname>contents:html</varname>).
-    <note>
-        <title>Column Names</title>
+admin.enableTable(table);                
+      </programlisting>
+      </para>See <xref linkend="client_dependencies"/> for more information about configuring client connections.
       <para>
-      By convention, a column name is made of its column family prefix and a
-      <emphasis>qualifier</emphasis>. For example, the
-      column
-      <emphasis>contents:html</emphasis> is of the column family <varname>contents</varname>
-          The colon character (<literal
-          moreinfo="none">:</literal>) delimits the column family from the
-          column family <emphasis>qualifier</emphasis>.
+      </para>
+  </section>   
+  <section xml:id="number.of.cfs">
+  <title>
+      On the number of column families
+  </title>
+  <para>
+      HBase currently does not do well with anything above two or three column families so keep the number
+      of column families in your schema low.  Currently, flushing and compactions are done on a per Region basis so
+      if one column family is carrying the bulk of the data bringing on flushes, the adjacent families
+      will also be flushed though the amount of data they carry is small.  Compaction is currently triggered
+      by the total number of files under a column family.  Its not size based.  When many column families the
+      flushing and compaction interaction can make for a bunch of needless i/o loading (To be addressed by
+      changing flushing and compaction to work on a per column family basis).
+    </para>
+    <para>Try to make do with one column family if you can in your schemas.  Only introduce a
+        second and third column family in the case where data access is usually column scoped;
+        i.e. you query one column family or the other but usually not both at the one time.
+    </para>
+  </section>
+  <section xml:id="rowkey.design"><title>Rowkey Design</title>
+    <section xml:id="timeseries">
+    <title>
+    Monotonically Increasing Row Keys/Timeseries Data
+    </title>
+    <para>
+      In the HBase chapter of Tom White's book <link xlink:url="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide</link> (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import process walks in lock-step with all clients in concert pounding one of the table's regions (and thus, a single node), then moving onto the next region, etc.  With monotonically increasing row-keys (i.e., using a timestamp), this will happen.  See this comic by IKai Lan on why monotonically increasing row keys are problematic in BigTable-like datastores:
+      <link xlink:href="http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/">monotonically increasing values are bad</link>.  The pile-up on a single region brought on
+      by monotonically increasing keys can be mitigated by randomizing the input records to not be in sorted order, but in general its best to avoid using a timestamp or a sequence (e.g. 1, 2, 3) as the row-key. 
     </para>
-    </note>
-    <table frame='all'><title>Table <varname>webtable</varname></title>
-	<tgroup cols='4' align='left' colsep='1' rowsep='1'>
-	<colspec colname='c1'/>
-	<colspec colname='c2'/>
-	<colspec colname='c3'/>
-	<colspec colname='c4'/>
-	<thead>
-        <row><entry>Row Key</entry><entry>Time Stamp</entry><entry>ColumnFamily <varname>contents</varname></entry><entry>ColumnFamily <varname>anchor</varname></entry></row>
-	</thead>
-	<tbody>
-        <row><entry>"com.cnn.www"</entry><entry>t9</entry><entry></entry><entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t8</entry><entry></entry><entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t6</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry><entry></entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t5</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry><entry></entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t3</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry><entry></entry></row>
-	</tbody>
-	</tgroup>
-	</table>
-	</para>
-	</section>
-    <section xml:id="physical.view"><title>Physical View</title>
-	<para>
-        Although at a conceptual level tables may be viewed as a sparse set of rows.
-        Physically they are stored on a per-column family basis.  New columns
-        (i.e., <varname>columnfamily:column</varname>) can be added to any
-        column family without pre-announcing them. 
-        <table frame='all'><title>ColumnFamily <varname>anchor</varname></title>
-	<tgroup cols='3' align='left' colsep='1' rowsep='1'>
-	<colspec colname='c1'/>
-	<colspec colname='c2'/>
-	<colspec colname='c3'/>
-	<thead>
-        <row><entry>Row Key</entry><entry>Time Stamp</entry><entry>Column Family <varname>anchor</varname></entry></row>
-	</thead>
-	<tbody>
-        <row><entry>"com.cnn.www"</entry><entry>t9</entry><entry><varname>anchor:cnnsi.com</varname> = "CNN"</entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t8</entry><entry><varname>anchor:my.look.ca</varname> = "CNN.com"</entry></row>
-	</tbody>
-	</tgroup>
-	</table>
-    <table frame='all'><title>ColumnFamily <varname>contents</varname></title>
-	<tgroup cols='3' align='left' colsep='1' rowsep='1'>
-	<colspec colname='c1'/>
-	<colspec colname='c2'/>
-	<colspec colname='c3'/>
-	<thead>
-	<row><entry>Row Key</entry><entry>Time Stamp</entry><entry>ColumnFamily "contents:"</entry></row>
-	</thead>
-	<tbody>
-        <row><entry>"com.cnn.www"</entry><entry>t6</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t5</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry></row>
-        <row><entry>"com.cnn.www"</entry><entry>t3</entry><entry><varname>contents:html</varname> = "&lt;html&gt;..."</entry></row>
-	</tbody>
-	</tgroup>
-	</table>
-    It is important to note in the diagram above that the empty cells shown in the
-    conceptual view are not stored since they need not be in a column-oriented
-    storage format. Thus a request for the value of the <varname>contents:html</varname>
-    column at time stamp <literal>t8</literal> would return no value. Similarly, a
-    request for an <varname>anchor:my.look.ca</varname> value at time stamp
-    <literal>t9</literal> would return no value.  However, if no timestamp is
-    supplied, the most recent value for a particular column would be returned
-    and would also be the first one found since timestamps are stored in
-    descending order. Thus a request for the values of all columns in the row
-    <varname>com.cnn.www</varname> if no timestamp is specified would be:
-    the value of <varname>contents:html</varname> from time stamp
-    <literal>t6</literal>, the value of <varname>anchor:cnnsi.com</varname>
-    from time stamp <literal>t9</literal>, the value of 
-    <varname>anchor:my.look.ca</varname> from time stamp <literal>t8</literal>.
-	</para>
-	</section>
 
-    <section xml:id="table">
-      <title>Table</title>
-      <para>
-      Tables are declared up front at schema definition time.
-      </para>
+
+    <para>If you do need to upload time series data into HBase, you should
+    study <link xlink:href="http://opentsdb.net/">OpenTSDB</link> as a
+    successful example.  It has a page describing the <link xlink:href=" http://opentsdb.net/schema.html">schema</link> it uses in
+    HBase.  The key format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at first glance to contradict the previous advice about not using a timestamp as the key.  However, the difference is that the timestamp is not in the <emphasis>lead</emphasis> position of the key, and the design assumption is that there are dozens or hundreds (or more) of different metric types.  Thus, even with a continual stream of input data with a mix of metric types, the Puts are distributed across various points of regions in the table.
+   </para>
+  </section>
+  <section xml:id="keysize">
+      <title>Try to minimize row and column sizes</title>
+      <subtitle>Or why are my StoreFile indices large?</subtitle>
+      <para>In HBase, values are always freighted with their coordinates; as a
+          cell value passes through the system, it'll be accompanied by its
+          row, column name, and timestamp - always.  If your rows and column names
+          are large, especially compared to the size of the cell value, then
+          you may run up against some interesting scenarios.  One such is
+          the case described by Marc Limotte at the tail of
+          <link xlink:url="https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&amp;focusedCommentId=13005272#comment-13005272">HBASE-3551</link>
+          (recommended!).
+          Therein, the indices that are kept on HBase storefiles (<xref linkend="hfile" />)
+                  to facilitate random access may end up occupyng large chunks of the HBase
+                  allotted RAM because the cell value coordinates are large.
+                  Mark in the above cited comment suggests upping the block size so
+                  entries in the store file index happen at a larger interval or
+                  modify the table schema so it makes for smaller rows and column
+                  names.
+                  Compression will also make for larger indices.  See
+                  the thread <link xlink:href="http://search-hadoop.com/m/hemBv1LiN4Q1/a+question+storefileIndexSize&amp;subj=a+question+storefileIndexSize">a question storefileIndexSize</link>
+                  up on the user mailing list.
+       </para>
+       <para>Most of the time small inefficiencies don't matter all that much.  Unfortunately,
+         this is a case where they do.  Whatever patterns are selected for ColumnFamilies, attributes, and rowkeys they could be repeated
+       several billion times in your data.  See <xref linkend="keyvalue"/> for more information on HBase stores data internally.</para>
+       <section xml:id="keysize.cf"><title>Column Families</title>
+         <para>Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).
+         </para> 
+       </section>
+       <section xml:id="keysize.atttributes"><title>Attributes</title>
+         <para>Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via")
+         to store in HBase.
+         </para> 
+       </section>
+       <section xml:id="keysize.row"><title>Rowkey Length</title>
+         <para>Keep them as short as is reasonable such that they can still be useful for required data access (e.g., Get vs. Scan). 
+         A short key that is useless for data access is not better than a longer key with better get/scan properties.  Expect tradeoffs
+         when designing rowkeys.
+         </para> 
+       </section>
     </section>
-
-    <section xml:id="row">
-      <title>Row</title>
-      <para>Row keys are uninterrpreted bytes. Rows are
-      lexicographically sorted with the lowest order appearing first
-      in a table.  The empty byte array is used to denote both the
-      start and end of a tables' namespace.</para>
+    <section xml:id="reverse.timestamp"><title>Reverse Timestamps</title>
+    <para>A common problem in database processing is quickly finding the most recent version of a value.  A technique using reverse timestamps
+    as a part of the key can help greatly with a special case of this problem.  Also found in the HBase chapter of Tom White's book Hadoop:  The Definitive Guide (O'Reilly), 
+    the technique involves appending (<code>Long.MAX_VALUE - timestamp</code>) to the end of any key, e.g., [key][reverse_timestamp].
+    </para>
+    <para>The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record.  Since HBase keys
+    are in sorted order, this key sorts before any older row-keys for [key] and thus is first.
+    </para>
+    <para>This technique would be used instead of using <xref linkend="schema.versions">HBase Versioning</xref> where the intent is to hold onto all versions
+    "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.
+    </para>
     </section>
-
-    <section xml:id="columnfamily">
-      <title>Column Family<indexterm><primary>Column Family</primary></indexterm></title>
-        <para>
-      Columns in HBase are grouped into <emphasis>column families</emphasis>.
-      All column members of a column family have the same prefix.  For example, the
-      columns <emphasis>courses:history</emphasis> and
-      <emphasis>courses:math</emphasis> are both members of the
-      <emphasis>courses</emphasis> column family.
-          The colon character (<literal
-          moreinfo="none">:</literal>) delimits the column family from the
-      <indexterm>column family <emphasis>qualifier</emphasis><primary>Column Family Qualifier</primary></indexterm>.
-        The column family prefix must be composed of
-      <emphasis>printable</emphasis> characters. The qualifying tail, the
-      column family <emphasis>qualifier</emphasis>, can be made of any
-      arbitrary bytes. Column families must be declared up front
-      at schema definition time whereas columns do not need to be
-      defined at schema time but can be conjured on the fly while
-      the table is up an running.</para>
-      <para>Physically, all column family members are stored together on the
-      filesystem.  Because tunings and
-      storage specifications are done at the column family level, it is
-      advised that all column family members have the same general access
-      pattern and size characteristics.</para>
-
-      <para></para>
+    <section xml:id="rowkey.scope">
+    <title>Rowkeys and ColumnFamilies</title>
+    <para>Rowkeys are scoped to ColumnFamilies.  Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.
+    </para>
     </section>
-    <section xml:id="cells">
-      <title>Cells<indexterm><primary>Cells</primary></indexterm></title>
-      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly
-      specifies a <literal>cell</literal> in HBase. 
-      Cell content is uninterrpreted bytes</para>
+    <section xml:id="changing.rowkeys"><title>Immutability of Rowkeys</title>
+    <para>Rowkeys cannot be changed.  The only way they can be "changed" in a table is if the row is deleted and then re-inserted.
+    This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you've 
+    inserted a lot of data).
+    </para>
     </section>
-    <section xml:id="data_model_operations">
-       <title>Data Model Operations</title>
-       <para>The four primary data model operations are Get, Put, Scan, and Delete.  Operations are applied via 
-       <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html">HTable</link> instances.
-       </para>
-      <section xml:id="get">
-        <title>Get</title>
-        <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Get.html">Get</link> returns
-        attributes for a specified row.  Gets are executed via 
-        <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#get%28org.apache.hadoop.hbase.client.Get%29">
-        HTable.get</link>.
-        </para>
-      </section>
-      <section xml:id="put">
-        <title>Put</title>
-        <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Put.html">Put</link> either 
-        adds new rows to a table (if the key is new) or can update existing rows (if the key already exists).  Puts are executed via
-        <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#put%28org.apache.hadoop.hbase.client.Put%29">
-        HTable.put</link> (writeBuffer) or <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#batch%28java.util.List%29">
-        HTable.batch</link> (non-writeBuffer).  
-        </para>
-      </section>
-      <section xml:id="scan">
-          <title>Scans</title>
-          <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Scan.html">Scan</link> allow
-          iteration over multiple rows for specified attributes.
-          </para>
-          <para>The following is an example of a 
-           on an HTable table instance.  Assume that a table is populated with rows with keys "row1", "row2", "row3", 
-           and then another set of rows with the keys "abc1", "abc2", and "abc3".  The following example shows how startRow and stopRow 
-           can be applied to a Scan instance to return the rows beginning with "row".        
-<programlisting>
-HTable htable = ...      // instantiate HTable
-    
-Scan scan = new Scan();
-scan.addColumn(Bytes.toBytes("cf"),Bytes.toBytes("attr"));
-scan.setStartRow( Bytes.toBytes("row"));
-scan.setStopRow( Bytes.toBytes("row" +  new byte[] {0}));  // note: stop key != start key
-for(Result result : htable.getScanner(scan)) {
-  // process Result instance
-}
-</programlisting>
-         </para>        
-        </section>
-      <section xml:id="delete">
-        <title>Delete</title>
-        <para><link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/Delete.html">Delete</link> removes
-        a row from a table.  Deletes are executed via 
-        <link xlink:href="http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#delete%28org.apache.hadoop.hbase.client.Delete%29">
-        HTable.delete</link>.
-        </para>
-      </section>
-            
+    </section>  <!--  rowkey design -->  
+    <section xml:id="schema.versions">
+  <title>
+  Number of Versions
+  </title>
+     <section xml:id="schema.versions.max"><title>Maximum Number of Versions</title>
+      <para>The maximum number of row versions to store is configured per column
+      family via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
+      The default for max versions is 3.
+      This is an important parameter because as described in <xref linkend="datamodel" />
+      section HBase does <emphasis>not</emphasis> overwrite row values, but rather
+      stores different values per row by time (and qualifier).  Excess versions are removed during major
+      compactions.  The number of max versions may need to be increased or decreased depending on application needs.
+      </para>
+      <para>It is not recommended setting the number of max versions to an exceedingly high level (e.g., hundreds or more) unless those old values are 
+      very dear to you because this will greatly increase StoreFile size. 
+      </para>
+     </section>
+    <section xml:id="schema.minversions">
+    <title>
+    Minimum Number of Versions
+    </title>
+    <para>Like number of max row versions, the minimum number of row versions to keep is configured per column
+      family via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
+      The default for min versions is 0, which means the feature is disabled.
+      The minimum number of row versions parameter is used together with the time-to-live parameter and can be combined with the
+      number of row versions parameter to allow configurations such as
+      "keep the last T minutes worth of data, at most N versions, <emphasis>but keep at least M versions around</emphasis>"
+      (where M is the value for minimum number of row versions, M&lt;=N).
+      This parameter should only be set when time-to-live is enabled for a column family and must be less or equal to the 
+      number of row versions.
+    </para>
+    </section>
+  </section>
+  <section xml:id="supported.datatypes">
+  <title>
+  Supported Datatypes
+  </title>
+  <para>HBase supports a "bytes-in/bytes-out" interface via <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link> and
+  <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html">Result</link>, so anything that can be
+  converted to an array of bytes can be stored as a value.  Input could be strings, numbers, complex objects, or even images as long as they can rendered as bytes.  
+  </para>
+  <para>There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase would probably be too much to ask);
+  search the mailling list for conversations on this topic. All rows in HBase conform to the <xref linkend="datamodel">datamodel</xref>, and 
+  that includes versioning.  Take that into consideration when making your design, as well as block size for the ColumnFamily.  
+  </para>
+    <section xml:id="counters">
+      <title>Counters</title>
+      <para>
+      One supported datatype that deserves special mention are "counters" (i.e., the ability to do atomic increments of numbers).  See 
+      <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#increment%28org.apache.hadoop.hbase.client.Increment%29">Increment</link> in HTable.
+      </para>
+      <para>Synchronization on counters are done on the RegionServer, not in the client.
+      </para>
+    </section> 
+  </section>
+  <section xml:id="cf.in.memory">
+  <title>
+  In-Memory ColumnFamilies
+  </title>
+  <para>ColumnFamilies can optionally be defined as in-memory.  Data is still persisted to disk, just like any other ColumnFamily.  
+  In-memory blocks have the highest priority in the <xref linkend="block.cache" />, but it is not a guarantee that the entire table
+  will be in memory.
+  </para>
+  <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information.
+  </para>
+  </section>
+  <section xml:id="ttl">
+  <title>Time To Live (TTL)</title>
+  <para>ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows once the expiration time is reached.
+  This applies to <emphasis>all</emphasis> versions of a row - even the current one.  The TTL time encoded in the HBase for the row is specified in UTC.
+  </para>
+  <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> for more information.
+  </para>
+  </section>
+  <section xml:id="schema.bloom">
+  <title>Bloom Filters</title>
+  <para>Bloom Filters can be enabled per-ColumnFamily.
+        Use <code>HColumnDescriptor.setBloomFilterType(NONE | ROW |
+        ROWCOL)</code> to enable blooms per Column Family. Default =
+        <varname>NONE</varname> for no bloom filters. If
+        <varname>ROW</varname>, the hash of the row will be added to the bloom
+        on each insert. If <varname>ROWCOL</varname>, the hash of the row +
+        column family + column family qualifier will be added to the bloom on
+        each key insert.</para>
+  <para>See <link xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link> and 
+  <xref linkend="blooms"/> for more information.
+  </para>
+  </section>
+  <section xml:id="secondary.indexes">
+  <title>
+  Secondary Indexes and Alternate Query Paths
+  </title>
+  <para>This section could also be titled "what if my table rowkey looks like <emphasis>this</emphasis> but I also want to query my table like <emphasis>that</emphasis>."
+  A common example on the dist-list is where a row-key is of the format "user-timestamp" but there are are reporting requirements on activity across users for certain 
+  time ranges.  Thus, selecting by user is easy because it is in the lead position of the key, but time is not.
+  </para>
+  <para>There is no single answer on the best way to handle this because it depends on...
+   <itemizedlist>
+       <listitem>Number of users</listitem>  
+       <listitem>Data size and data arrival rate</listitem>
+       <listitem>Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs. pre-configured ranges) </listitem>  
+       <listitem>Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an ad-hoc report, whereas it may be too long for others) </listitem>  
+   </itemizedlist>
+   ... and solutions are also influenced by the size of the cluster and how much processing power you have to throw at the solution.  
+   Common techniques are in sub-sections below.  This is a comprehensive, but not exhaustive, list of approaches.   
+  </para>
+  <para>It should not be a surprise that secondary indexes require additional cluster space and processing.  
+  This is precisely what happens in an RDBMS because the act of creating an alternate index requires both space and processing cycles to update.  RBDMS products
+  are more advanced in this regard to handle alternative index management out of the box.  However, HBase scales better at larger data volumes, so this is a feature trade-off. 
+  </para>
+  <para>Pay attention to <xref linkend="performance"/> when implementing any of these approaches.</para>
+  <para>Additionally, see the David Butler response in this dist-list thread <link xlink:href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&amp;subj=Stargate+hbase">HBase, mail # user - Stargate+hbase</link>
+   </para>
+    <section xml:id="secondary.indexes.filter">
+      <title>
+       Filter Query
+      </title>
+      <para>Depending on the case, it may be appropriate to use <xref linkend="client.filter"/>.  In this case, no secondary index is created.
+      However, don't try a full-scan on a large table like this from an application (i.e., single-threaded client).
+      </para>
     </section>
+    <section xml:id="secondary.indexes.periodic">
+      <title>
+       Periodic-Update Secondary Index
+      </title>
+      <para>A secondary index could be created in an other table which is periodically updated via a MapReduce job.  The job could be executed intra-day, but depending on 
+      load-strategy it could still potentially be out of sync with the main data table.</para>
+      <para>See <xref linkend="mapreduce.example.readwrite"/> for more information.</para>
+    </section>
+    <section xml:id="secondary.indexes.dualwrite">
+      <title>
+       Dual-Write Secondary Index
+      </title>
+      <para>Another strategy is to build the secondary index while publishing data to the cluster (e.g., write to data table, write to index table). 
+      If this is approach is taken after a data table already exists, then bootstrapping will be needed for the secondary index with a MapReduce job (see <xref linkend="secondary.indexes.periodic"/>).</para>
+    </section>
+    <section xml:id="secondary.indexes.summary">
+      <title>
+       Summary Tables
+      </title>
+      <para>Where time-ranges are very wide (e.g., year-long report) and where the data is voluminous, summary tables are a common approach.
+      These would be generated with MapReduce jobs into another table.</para>
+      <para>See <xref linkend="mapreduce.example.summary"/> for more information.</para>
+    </section>
+    <section xml:id="secondary.indexes.coproc">
+      <title>
+       Coprocessor Secondary Index
+      </title>
+      <para>Coprocessors act like RDBMS triggers.  These are currently on TRUNK.
+      </para>
+    </section>
+  </section>
+  <section xml:id="schema.smackdown"><title>Schema Design Smackdown</title>
+    <para>This section will describe common schema design questions that appear on the dist-list.  These are 
+    general guidelines and not laws - each application must consider it's own needs.  
+    </para>
+    <section xml:id="schema.smackdown.rowsversions"><title>Rows vs. Versions</title>
+      <para>A common question is whether one should prefer rows or HBase's built-in-versioning.  The context is typically where there are
+      "a lot" of versions of a row to be retained (e.g., where it is significantly above the HBase default of 3 max versions).  The 
+      rows-approach would require storing a timstamp in some portion of the rowkey so that they would not overwite with each successive update.
+      </para>
+      <para>Preference:  Rows (generally speaking).
+      </para>
+    </section>
+    <section xml:id="schema.smackdown.rowscols"><title>Rows vs. Columns</title>
+      <para>Another common question is whether one should prefer rows or columns.  The context is typically in extreme cases of wide
+      tables, such as having 1 row with 1 million attributes, or 1 million rows with 1 columns apiece.  
+      </para>
+      <para>Preference:  Rows (generally speaking).  To be clear, this guideline is in the context is in extremely wide cases, not in the 
+      standard use-case where one needs to store a few dozen or hundred columns.
+      </para>
+    </section>
+  </section>
 
+  </chapter>   <!--  schema design -->
 
-    <section xml:id="versions">
-      <title>Versions<indexterm><primary>Versions</primary></indexterm></title>
-
-      <para>A <emphasis>{row, column, version} </emphasis>tuple exactly
-      specifies a <literal>cell</literal> in HBase. Its possible to have an
-      unbounded number of cells where the row and column are the same but the
-      cell address differs only in its version dimension.</para>
-
-      <para>While rows and column keys are expressed as bytes, the version is
-      specified using a long integer. Typically this long contains time
-      instances such as those returned by
-      <code>java.util.Date.getTime()</code> or
-      <code>System.currentTimeMillis()</code>, that is: <quote>the difference,
-      measured in milliseconds, between the current time and midnight, January
-      1, 1970 UTC</quote>.</para>
-
-      <para>The HBase version dimension is stored in decreasing order, so that
-      when reading from a store file, the most recent values are found
-      first.</para>
-
-      <para>There is a lot of confusion over the semantics of
-      <literal>cell</literal> versions, in HBase. In particular, a couple
-      questions that often come up are:<itemizedlist>
-          <listitem>
-            <para>If multiple writes to a cell have the same version, are all
-            versions maintained or just the last?<footnote>
-                <para>Currently, only the last written is fetchable.</para>
-              </footnote></para>
-          </listitem>
-
-          <listitem>
-            <para>Is it OK to write cells in a non-increasing version
-            order?<footnote>
-                <para>Yes</para>
-              </footnote></para>
-          </listitem>
-        </itemizedlist></para>
-
-      <para>Below we describe how the version dimension in HBase currently
-      works<footnote>
-          <para>See <link
-          xlink:href="https://issues.apache.org/jira/browse/HBASE-2406">HBASE-2406</link>
-          for discussion of HBase versions. <link
-          xlink:href="http://outerthought.org/blog/417-ot.html">Bending time
-          in HBase</link> makes for a good read on the version, or time,

[... 440 lines stripped ...]