You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by nd...@apache.org on 2015/08/19 00:35:58 UTC
[06/15] hbase git commit: HBASE-14066 clean out old docbook docs from
branch-1.
http://git-wip-us.apache.org/repos/asf/hbase/blob/0acbff24/src/main/docbkx/rpc.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/rpc.xml b/src/main/docbkx/rpc.xml
deleted file mode 100644
index 2e5dd5f..0000000
--- a/src/main/docbkx/rpc.xml
+++ /dev/null
@@ -1,301 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<appendix
- xml:id="hbase.rpc"
- version="5.0"
- xmlns="http://docbook.org/ns/docbook"
- xmlns:xlink="http://www.w3.org/1999/xlink"
- xmlns:xi="http://www.w3.org/2001/XInclude"
- xmlns:svg="http://www.w3.org/2000/svg"
- xmlns:m="http://www.w3.org/1998/Math/MathML"
- xmlns:html="http://www.w3.org/1999/xhtml"
- xmlns:db="http://docbook.org/ns/docbook">
- <!--/**
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
--->
-
- <title>0.95 RPC Specification</title>
- <para>In 0.95, all client/server communication is done with <link
- xlink:href="https://code.google.com/p/protobuf/">protobuf’ed</link> Messages rather than
- with <link
- xlink:href="http://hadoop.apache.org/docs/current/api/org/apache/hadoop/io/Writable.html">Hadoop
- Writables</link>. Our RPC wire format therefore changes. This document describes the
- client/server request/response protocol and our new RPC wire-format.</para>
- <para />
- <para>For what RPC is like in 0.94 and previous, see Benoît/Tsuna’s <link
- xlink:href="https://github.com/OpenTSDB/asynchbase/blob/master/src/HBaseRpc.java#L164">Unofficial
- Hadoop / HBase RPC protocol documentation</link>. For more background on how we arrived
- at this spec., see <link
- xlink:href="https://docs.google.com/document/d/1WCKwgaLDqBw2vpux0jPsAu2WPTRISob7HGCO8YhfDTA/edit#">HBase
- RPC: WIP</link></para>
- <para />
- <section>
- <title>Goals</title>
- <para>
- <orderedlist>
- <listitem>
- <para>A wire-format we can evolve</para>
- </listitem>
- <listitem>
- <para>A format that does not require our rewriting server core or radically
- changing its current architecture (for later).</para>
- </listitem>
- </orderedlist>
- </para>
- </section>
- <section>
- <title>TODO</title>
- <para>
- <orderedlist>
- <listitem>
- <para>List of problems with currently specified format and where we would like
- to go in a version2, etc. For example, what would we have to change if
- anything to move server async or to support streaming/chunking?</para>
- </listitem>
- <listitem>
- <para>Diagram on how it works</para>
- </listitem>
- <listitem>
- <para>A grammar that succinctly describes the wire-format. Currently we have
- these words and the content of the rpc protobuf idl but a grammar for the
- back and forth would help with groking rpc. Also, a little state machine on
- client/server interactions would help with understanding (and ensuring
- correct implementation).</para>
- </listitem>
- </orderedlist>
- </para>
- </section>
- <section>
- <title>RPC</title>
- <para>The client will send setup information on connection establish. Thereafter, the client
- invokes methods against the remote server sending a protobuf Message and receiving a
- protobuf Message in response. Communication is synchronous. All back and forth is
- preceded by an int that has the total length of the request/response. Optionally,
- Cells(KeyValues) can be passed outside of protobufs in follow-behind Cell blocks
- (because <link
- xlink:href="https://docs.google.com/document/d/1WEtrq-JTIUhlnlnvA0oYRLp0F8MKpEBeBSCFcQiacdw/edit#">we
- can’t protobuf megabytes of KeyValues</link> or Cells). These CellBlocks are encoded
- and optionally compressed.</para>
- <para />
- <para>For more detail on the protobufs involved, see the <link
- xlink:href="http://svn.apache.org/viewvc/hbase/trunk/hbase-protocol/src/main/protobuf/RPC.proto?view=markup">RPC.proto</link>
- file in trunk.</para>
-
- <section>
- <title>Connection Setup</title>
- <para>Client initiates connection.</para>
- <section>
- <title>Client</title>
- <para>On connection setup, client sends a preamble followed by a connection header. </para>
-
- <section>
- <title><preamble></title>
- <programlisting><MAGIC 4 byte integer> <1 byte RPC Format Version> <1 byte auth type></programlisting>
- <para> We need the auth method spec. here so the connection header is encoded if auth enabled.</para>
- <para>E.g.: HBas0x000x50 -- 4 bytes of MAGIC -- ‘HBas’ -- plus one-byte of
- version, 0 in this case, and one byte, 0x50 (SIMPLE). of an auth
- type.</para>
- </section>
-
- <section>
- <title><Protobuf ConnectionHeader Message></title>
- <para>Has user info, and “protocol”, as well as the encoders and compression the
- client will use sending CellBlocks. CellBlock encoders and compressors are
- for the life of the connection. CellBlock encoders implement
- org.apache.hadoop.hbase.codec.Codec. CellBlocks may then also be compressed.
- Compressors implement org.apache.hadoop.io.compress.CompressionCodec. This
- protobuf is written using writeDelimited so is prefaced by a pb varint with
- its serialized length</para>
- </section>
- </section>
- <!--Client-->
-
- <section>
- <title>Server</title>
- <para>After client sends preamble and connection header, server does NOT respond if
- successful connection setup. No response means server is READY to accept
- requests and to give out response. If the version or authentication in the
- preamble is not agreeable or the server has trouble parsing the preamble, it
- will throw a org.apache.hadoop.hbase.ipc.FatalConnectionException explaining the
- error and will then disconnect. If the client in the connection header -- i.e.
- the protobuf’d Message that comes after the connection preamble -- asks for for
- a Service the server does not support or a codec the server does not have, again
- we throw a FatalConnectionException with explanation.</para>
- </section>
- </section>
-
- <section>
- <title>Request</title>
- <para>After a Connection has been set up, client makes requests. Server responds.</para>
- <para>A request is made up of a protobuf RequestHeader followed by a protobuf Message
- parameter. The header includes the method name and optionally, metadata on the
- optional CellBlock that may be following. The parameter type suits the method being
- invoked: i.e. if we are doing a getRegionInfo request, the protobuf Message param
- will be an instance of GetRegionInfoRequest. The response will be a
- GetRegionInfoResponse. The CellBlock is optionally used ferrying the bulk of the RPC
- data: i.e Cells/KeyValues.</para>
- <section>
- <title>Request Parts</title>
- <section>
- <title><Total Length></title>
- <para>The request is prefaced by an int that holds the total length of what
- follows.</para>
- </section>
- <section>
- <title><Protobuf RequestHeader Message></title>
- <para>Will have call.id, trace.id, and method name, etc. including optional
- Metadata on the Cell block IFF one is following. Data is protobuf’d inline
- in this pb Message or optionally comes in the following CellBlock</para>
- </section>
- <section>
- <title><Protobuf Param Message></title>
- <para>If the method being invoked is getRegionInfo, if you study the Service
- descriptor for the client to regionserver protocol, you will find that the
- request sends a GetRegionInfoRequest protobuf Message param in this
- position.</para>
- </section>
- <section>
- <title><CellBlock></title>
- <para>An encoded and optionally compressed Cell block.</para>
- </section>
- </section>
- <!--Request parts-->
- </section>
- <!--Request-->
-
- <section>
- <title>Response</title>
- <para>Same as Request, it is a protobuf ResponseHeader followed by a protobuf Message
- response where the Message response type suits the method invoked. Bulk of the data
- may come in a following CellBlock.</para>
- <section>
- <title>Response Parts</title>
- <section>
- <title><Total Length></title>
- <para>The response is prefaced by an int that holds the total length of what
- follows.</para>
- </section>
- <section>
- <title><Protobuf ResponseHeader Message></title>
- <para>Will have call.id, etc. Will include exception if failed processing.
- Optionally includes metadata on optional, IFF there is a CellBlock
- following.</para>
- </section>
-
- <section>
- <title><Protobuf Response Message></title>
- <para>Return or may be nothing if exception. If the method being invoked is
- getRegionInfo, if you study the Service descriptor for the client to
- regionserver protocol, you will find that the response sends a
- GetRegionInfoResponse protobuf Message param in this position.</para>
- </section>
- <section>
- <title><CellBlock></title>
- <para>An encoded and optionally compressed Cell block.</para>
- </section>
- </section>
- <!--Parts-->
- </section>
- <!--Response-->
-
- <section>
- <title>Exceptions</title>
- <para>There are two distinct types. There is the request failed which is encapsulated
- inside the response header for the response. The connection stays open to receive
- new requests. The second type, the FatalConnectionException, kills the
- connection.</para>
- <para>Exceptions can carry extra information. See the ExceptionResponse protobuf type.
- It has a flag to indicate do-no-retry as well as other miscellaneous payload to help
- improve client responsiveness.</para>
- </section>
- <section>
- <title>CellBlocks</title>
- <para>These are not versioned. Server can do the codec or it cannot. If new version of a
- codec with say, tighter encoding, then give it a new class name. Codecs will live on
- the server for all time so old clients can connect.</para>
- </section>
- </section>
-
-
- <section>
- <title>Notes</title>
- <section>
- <title>Constraints</title>
- <para>In some part, current wire-format -- i.e. all requests and responses preceeded by
- a length -- has been dictated by current server non-async architecture.</para>
- </section>
- <section>
- <title>One fat pb request or header+param</title>
- <para>We went with pb header followed by pb param making a request and a pb header
- followed by pb response for now. Doing header+param rather than a single protobuf
- Message with both header and param content:</para>
- <para>
- <orderedlist>
- <listitem>
- <para>Is closer to what we currently have</para>
- </listitem>
- <listitem>
- <para>Having a single fat pb requires extra copying putting the already pb’d
- param into the body of the fat request pb (and same making
- result)</para>
- </listitem>
- <listitem>
- <para>We can decide whether to accept the request or not before we read the
- param; for example, the request might be low priority. As is, we read
- header+param in one go as server is currently implemented so this is a
- TODO.</para>
- </listitem>
- </orderedlist>
- </para>
- <para>The advantages are minor. If later, fat request has clear advantage, can roll out
- a v2 later.</para>
- </section>
- <section
- xml:id="rpc.configs">
- <title>RPC Configurations</title>
- <section>
- <title>CellBlock Codecs</title>
- <para>To enable a codec other than the default <classname>KeyValueCodec</classname>,
- set <varname>hbase.client.rpc.codec</varname> to the name of the Codec class to
- use. Codec must implement hbase's <classname>Codec</classname> Interface. After
- connection setup, all passed cellblocks will be sent with this codec. The server
- will return cellblocks using this same codec as long as the codec is on the
- servers' CLASSPATH (else you will get
- <classname>UnsupportedCellCodecException</classname>).</para>
- <para>To change the default codec, set
- <varname>hbase.client.default.rpc.codec</varname>. </para>
- <para>To disable cellblocks completely and to go pure protobuf, set the default to
- the empty String and do not specify a codec in your Configuration. So, set
- <varname>hbase.client.default.rpc.codec</varname> to the empty string and do
- not set <varname>hbase.client.rpc.codec</varname>. This will cause the client to
- connect to the server with no codec specified. If a server sees no codec, it
- will return all responses in pure protobuf. Running pure protobuf all the time
- will be slower than running with cellblocks. </para>
- </section>
- <section>
- <title>Compression</title>
- <para>Uses hadoops compression codecs. To enable compressing of passed CellBlocks,
- set <varname>hbase.client.rpc.compressor</varname> to the name of the Compressor
- to use. Compressor must implement Hadoops' CompressionCodec Interface. After
- connection setup, all passed cellblocks will be sent compressed. The server will
- return cellblocks compressed using this same compressor as long as the
- compressor is on its CLASSPATH (else you will get
- <classname>UnsupportedCompressionCodecException</classname>).</para>
- </section>
- </section>
- </section>
-</appendix>
http://git-wip-us.apache.org/repos/asf/hbase/blob/0acbff24/src/main/docbkx/schema_design.xml
----------------------------------------------------------------------
diff --git a/src/main/docbkx/schema_design.xml b/src/main/docbkx/schema_design.xml
deleted file mode 100644
index 65e64b0..0000000
--- a/src/main/docbkx/schema_design.xml
+++ /dev/null
@@ -1,1247 +0,0 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<chapter
- version="5.0"
- xml:id="schema"
- xmlns="http://docbook.org/ns/docbook"
- xmlns:xlink="http://www.w3.org/1999/xlink"
- xmlns:xi="http://www.w3.org/2001/XInclude"
- xmlns:svg="http://www.w3.org/2000/svg"
- xmlns:m="http://www.w3.org/1998/Math/MathML"
- xmlns:html="http://www.w3.org/1999/xhtml"
- xmlns:db="http://docbook.org/ns/docbook">
- <!--
-/**
- *
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * "License"); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an "AS IS" BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
--->
- <title>HBase and Schema Design</title>
- <para>A good general introduction on the strength and weaknesses modelling on the various
- non-rdbms datastores is Ian Varley's Master thesis, <link
- xlink:href="http://ianvarley.com/UT/MR/Varley_MastersReport_Full_2009-08-07.pdf">No Relation:
- The Mixed Blessings of Non-Relational Databases</link>. Recommended. Also, read <xref
- linkend="keyvalue" /> for how HBase stores data internally, and the section on <xref
- linkend="schema.casestudies" />. </para>
- <section
- xml:id="schema.creation">
- <title> Schema Creation </title>
- <para>HBase schemas can be created or updated with <xref
- linkend="shell" /> or by using <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HBaseAdmin.html">HBaseAdmin</link>
- in the Java API. </para>
- <para>Tables must be disabled when making ColumnFamily modifications, for example:</para>
- <programlisting language="java">
-Configuration config = HBaseConfiguration.create();
-HBaseAdmin admin = new HBaseAdmin(conf);
-String table = "myTable";
-
-admin.disableTable(table);
-
-HColumnDescriptor cf1 = ...;
-admin.addColumn(table, cf1); // adding new ColumnFamily
-HColumnDescriptor cf2 = ...;
-admin.modifyColumn(table, cf2); // modifying existing ColumnFamily
-
-admin.enableTable(table);
- </programlisting>
- <para>See <xref
- linkend="client_dependencies" /> for more information about configuring client
- connections.</para>
- <para>Note: online schema changes are supported in the 0.92.x codebase, but the 0.90.x codebase
- requires the table to be disabled. </para>
- <section
- xml:id="schema.updates">
- <title>Schema Updates</title>
- <para>When changes are made to either Tables or ColumnFamilies (e.g., region size, block
- size), these changes take effect the next time there is a major compaction and the
- StoreFiles get re-written. </para>
- <para>See <xref
- linkend="store" /> for more information on StoreFiles. </para>
- </section>
- </section>
- <section
- xml:id="number.of.cfs">
- <title> On the number of column families </title>
- <para> HBase currently does not do well with anything above two or three column families so keep
- the number of column families in your schema low. Currently, flushing and compactions are done
- on a per Region basis so if one column family is carrying the bulk of the data bringing on
- flushes, the adjacent families will also be flushed though the amount of data they carry is
- small. When many column families the flushing and compaction interaction can make for a bunch
- of needless i/o loading (To be addressed by changing flushing and compaction to work on a per
- column family basis). For more information on compactions, see <xref
- linkend="compaction" />. </para>
- <para>Try to make do with one column family if you can in your schemas. Only introduce a second
- and third column family in the case where data access is usually column scoped; i.e. you query
- one column family or the other but usually not both at the one time. </para>
- <section
- xml:id="number.of.cfs.card">
- <title>Cardinality of ColumnFamilies</title>
- <para>Where multiple ColumnFamilies exist in a single table, be aware of the cardinality
- (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion
- rows, ColumnFamilyA's data will likely be spread across many, many regions (and
- RegionServers). This makes mass scans for ColumnFamilyA less efficient. </para>
- </section>
- </section>
- <section
- xml:id="rowkey.design">
- <title>Rowkey Design</title>
- <section>
- <title>Hotspotting</title>
- <para>Rows in HBase are sorted lexicographically by row key. This design optimizes for scans,
- allowing you to store related rows, or rows that will be read together, near each other.
- However, poorly designed row keys are a common source of <firstterm>hotspotting</firstterm>.
- Hotspotting occurs when a large amount of client traffic is directed at one node, or only a
- few nodes, of a cluster. This traffic may represent reads, writes, or other operations. The
- traffic overwhelms the single machine responsible for hosting that region, causing
- performance degradation and potentially leading to region unavailability. This can also have
- adverse effects on other regions hosted by the same region server as that host is unable to
- service the requested load. It is important to design data access patterns such that the
- cluster is fully and evenly utilized.</para>
- <para>To prevent hotspotting on writes, design your row keys such that rows that truly do need
- to be in the same region are, but in the bigger picture, data is being written to multiple
- regions across the cluster, rather than one at a time. Some common techniques for avoiding
- hotspotting are described below, along with some of their advantages and drawbacks.</para>
- <formalpara>
- <title>Salting</title>
- <para>Salting in this sense has nothing to do with cryptography, but refers to adding random
- data to the start of a row key. In this case, salting refers to adding a randomly-assigned
- prefix to the row key to cause it to sort differently than it otherwise would. The number
- of possible prefixes correspond to the number of regions you want to spread the data
- across. Salting can be helpful if you have a few "hot" row key patterns which come up over
- and over amongst other more evenly-distributed rows. Consider the following example, which
- shows that salting can spread write load across multiple regionservers, and illustrates
- some of the negative implications for reads.</para>
- </formalpara>
- <example>
- <title>Salting Example</title>
- <para>Suppose you have the following list of row keys, and your table is split such that
- there is one region for each letter of the alphabet. Prefix 'a' is one region, prefix 'b'
- is another. In this table, all rows starting with 'f' are in the same region. This example
- focuses on rows with keys like the following:</para>
- <screen>
-foo0001
-foo0002
-foo0003
-foo0004
- </screen>
- <para>Now, imagine that you would like to spread these across four different regions. You
- decide to use four different salts: <literal>a</literal>, <literal>b</literal>,
- <literal>c</literal>, and <literal>d</literal>. In this scenario, each of these letter
- prefixes will be on a different region. After applying the salts, you have the following
- rowkeys instead. Since you can now write to four separate regions, you theoretically have
- four times the throughput when writing that you would have if all the writes were going to
- the same region.</para>
- <screen>
-a-foo0003
-b-foo0001
-c-foo0004
-d-foo0002
- </screen>
- <para>Then, if you add another row, it will randomly be assigned one of the four possible
- salt values and end up near one of the existing rows.</para>
- <screen>
-a-foo0003
-b-foo0001
-<emphasis>c-foo0003</emphasis>
-c-foo0004
-d-foo0002
- </screen>
- <para>Since this assignment will be random, you will need to do more work if you want to
- retrieve the rows in lexicographic order. In this way, salting attempts to increase
- throughput on writes, but has a cost during reads.</para>
- </example>
- <para></para>
- <formalpara>
- <title>Hashing</title>
- <para>Instead of a random assignment, you could use a one-way <firstterm>hash</firstterm>
- that would cause a given row to always be "salted" with the same prefix, in a way that
- would spread the load across the regionservers, but allow for predictability during reads.
- Using a deterministic hash allows the client to reconstruct the complete rowkey and use a
- Get operation to retrieve that row as normal.</para>
- </formalpara>
- <example>
- <title>Hashing Example</title>
- <para>Given the same situation in the salting example above, you could instead apply a
- one-way hash that would cause the row with key <literal>foo0003</literal> to always, and
- predictably, receive the <literal>a</literal> prefix. Then, to retrieve that row, you
- would already know the key. You could also optimize things so that certain pairs of keys
- were always in the same region, for instance.</para>
- </example>
- <formalpara>
- <title>Reversing the Key</title>
- <para>A third common trick for preventing hotspotting is to reverse a fixed-width or numeric
- row key so that the part that changes the most often (the least significant digit) is first.
- This effectively randomizes row keys, but sacrifices row ordering properties.</para>
- </formalpara>
- <para>See <link
- xlink:href="https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables"
- >https://communities.intel.com/community/itpeernetwork/datastack/blog/2013/11/10/discussion-on-designing-hbase-tables</link>,
- and <link xlink:href="http://phoenix.apache.org/salted.html">article on Salted Tables</link>
- from the Phoenix project, and the discussion in the comments of <link
- xlink:href="https://issues.apache.org/jira/browse/HBASE-11682">HBASE-11682</link> for more
- information about avoiding hotspotting.</para>
- </section>
- <section
- xml:id="timeseries">
- <title> Monotonically Increasing Row Keys/Timeseries Data </title>
- <para> In the HBase chapter of Tom White's book <link
- xlink:href="http://oreilly.com/catalog/9780596521981">Hadoop: The Definitive Guide</link>
- (O'Reilly) there is a an optimization note on watching out for a phenomenon where an import
- process walks in lock-step with all clients in concert pounding one of the table's regions
- (and thus, a single node), then moving onto the next region, etc. With monotonically
- increasing row-keys (i.e., using a timestamp), this will happen. See this comic by IKai Lan
- on why monotonically increasing row keys are problematic in BigTable-like datastores: <link
- xlink:href="http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/">monotonically
- increasing values are bad</link>. The pile-up on a single region brought on by
- monotonically increasing keys can be mitigated by randomizing the input records to not be in
- sorted order, but in general it's best to avoid using a timestamp or a sequence (e.g. 1, 2,
- 3) as the row-key. </para>
- <para>If you do need to upload time series data into HBase, you should study <link
- xlink:href="http://opentsdb.net/">OpenTSDB</link> as a successful example. It has a page
- describing the <link
- xlink:href=" http://opentsdb.net/schema.html">schema</link> it uses in HBase. The key
- format in OpenTSDB is effectively [metric_type][event_timestamp], which would appear at
- first glance to contradict the previous advice about not using a timestamp as the key.
- However, the difference is that the timestamp is not in the <emphasis>lead</emphasis>
- position of the key, and the design assumption is that there are dozens or hundreds (or
- more) of different metric types. Thus, even with a continual stream of input data with a mix
- of metric types, the Puts are distributed across various points of regions in the table. </para>
- <para>See <xref
- linkend="schema.casestudies" /> for some rowkey design examples. </para>
- </section>
- <section
- xml:id="keysize">
- <title>Try to minimize row and column sizes</title>
- <subtitle>Or why are my StoreFile indices large?</subtitle>
- <para>In HBase, values are always freighted with their coordinates; as a cell value passes
- through the system, it'll be accompanied by its row, column name, and timestamp - always. If
- your rows and column names are large, especially compared to the size of the cell value,
- then you may run up against some interesting scenarios. One such is the case described by
- Marc Limotte at the tail of <link
- xlink:href="https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272">HBASE-3551</link>
- (recommended!). Therein, the indices that are kept on HBase storefiles (<xref
- linkend="hfile" />) to facilitate random access may end up occupyng large chunks of the
- HBase allotted RAM because the cell value coordinates are large. Mark in the above cited
- comment suggests upping the block size so entries in the store file index happen at a larger
- interval or modify the table schema so it makes for smaller rows and column names.
- Compression will also make for larger indices. See the thread <link
- xlink:href="http://search-hadoop.com/m/hemBv1LiN4Q1/a+question+storefileIndexSize&subj=a+question+storefileIndexSize">a
- question storefileIndexSize</link> up on the user mailing list. </para>
- <para>Most of the time small inefficiencies don't matter all that much. Unfortunately, this is
- a case where they do. Whatever patterns are selected for ColumnFamilies, attributes, and
- rowkeys they could be repeated several billion times in your data. </para>
- <para>See <xref
- linkend="keyvalue" /> for more information on HBase stores data internally to see why this
- is important.</para>
- <section
- xml:id="keysize.cf">
- <title>Column Families</title>
- <para>Try to keep the ColumnFamily names as small as possible, preferably one character
- (e.g. "d" for data/default). </para>
- <para>See <xref
- linkend="keyvalue" /> for more information on HBase stores data internally to see why
- this is important.</para>
- </section>
- <section
- xml:id="keysize.attributes">
- <title>Attributes</title>
- <para>Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to
- read, prefer shorter attribute names (e.g., "via") to store in HBase. </para>
- <para>See <xref
- linkend="keyvalue" /> for more information on HBase stores data internally to see why
- this is important.</para>
- </section>
- <section
- xml:id="keysize.row">
- <title>Rowkey Length</title>
- <para>Keep them as short as is reasonable such that they can still be useful for required
- data access (e.g., Get vs. Scan). A short key that is useless for data access is not
- better than a longer key with better get/scan properties. Expect tradeoffs when designing
- rowkeys. </para>
- </section>
- <section
- xml:id="keysize.patterns">
- <title>Byte Patterns</title>
- <para>A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615
- in those eight bytes. If you stored this number as a String -- presuming a byte per
- character -- you need nearly 3x the bytes. </para>
- <para>Not convinced? Below is some sample code that you can run on your own.</para>
- <programlisting language="java">
-// long
-//
-long l = 1234567890L;
-byte[] lb = Bytes.toBytes(l);
-System.out.println("long bytes length: " + lb.length); // returns 8
-
-String s = "" + l;
-byte[] sb = Bytes.toBytes(s);
-System.out.println("long as string length: " + sb.length); // returns 10
-
-// hash
-//
-MessageDigest md = MessageDigest.getInstance("MD5");
-byte[] digest = md.digest(Bytes.toBytes(s));
-System.out.println("md5 digest bytes length: " + digest.length); // returns 16
-
-String sDigest = new String(digest);
-byte[] sbDigest = Bytes.toBytes(sDigest);
-System.out.println("md5 digest as string length: " + sbDigest.length); // returns 26
- </programlisting>
- <para>Unfortunately, using a binary representation of a type will make your data harder to
- read outside of your code. For example, this is what you will see in the shell when you
- increment a value:</para>
- <programlisting>
-hbase(main):001:0> incr 't', 'r', 'f:q', 1
-COUNTER VALUE = 1
-
-hbase(main):002:0> get 't', 'r'
-COLUMN CELL
- f:q timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x01
-1 row(s) in 0.0310 seconds
- </programlisting>
- <para>The shell makes a best effort to print a string, and it this case it decided to just
- print the hex. The same will happen to your row keys inside the region names. It can be
- okay if you know what's being stored, but it might also be unreadable if arbitrary data
- can be put in the same cells. This is the main trade-off. </para>
- </section>
-
- </section>
- <section
- xml:id="reverse.timestamp">
- <title>Reverse Timestamps</title>
- <note>
- <title>Reverse Scan API</title>
- <para>
- <link
- xlink:href="https://issues.apache.org/jira/browse/HBASE-4811">HBASE-4811</link>
- implements an API to scan a table or a range within a table in reverse, reducing the need
- to optimize your schema for forward or reverse scanning. This feature is available in
- HBase 0.98 and later. See <link
- xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed%28boolean" />
- for more information. </para>
- </note>
-
- <para>A common problem in database processing is quickly finding the most recent version of a
- value. A technique using reverse timestamps as a part of the key can help greatly with a
- special case of this problem. Also found in the HBase chapter of Tom White's book Hadoop:
- The Definitive Guide (O'Reilly), the technique involves appending (<code>Long.MAX_VALUE -
- timestamp</code>) to the end of any key, e.g., [key][reverse_timestamp]. </para>
- <para>The most recent value for [key] in a table can be found by performing a Scan for [key]
- and obtaining the first record. Since HBase keys are in sorted order, this key sorts before
- any older row-keys for [key] and thus is first. </para>
- <para>This technique would be used instead of using <xref
- linkend="schema.versions" /> where the intent is to hold onto all versions "forever" (or a
- very long time) and at the same time quickly obtain access to any other version by using the
- same Scan technique. </para>
- </section>
- <section
- xml:id="rowkey.scope">
- <title>Rowkeys and ColumnFamilies</title>
- <para>Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each
- ColumnFamily that exists in a table without collision. </para>
- </section>
- <section
- xml:id="changing.rowkeys">
- <title>Immutability of Rowkeys</title>
- <para>Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row
- is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so
- it pays to get the rowkeys right the first time (and/or before you've inserted a lot of
- data). </para>
- </section>
- <section
- xml:id="rowkey.regionsplits">
- <title>Relationship Between RowKeys and Region Splits</title>
- <para>If you pre-split your table, it is <emphasis>critical</emphasis> to understand how your
- rowkey will be distributed across the region boundaries. As an example of why this is
- important, consider the example of using displayable hex characters as the lead position of
- the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through
- <code>Bytes.split</code> (which is the split strategy used when creating regions in
- <code>HBaseAdmin.createTable(byte[] startKey, byte[] endKey, numRegions)</code> for 10
- regions will generate the following splits...</para>
- <screen>
-48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 // 0
-54 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 // 6
-61 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68 // =
-68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126 // D
-75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72 // K
-82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14 // R
-88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44 // X
-95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102 // _
-102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 // f
- </screen>
- <para>... (note: the lead byte is listed to the right as a comment.) Given that the first
- split is a '0' and the last split is an 'f', everything is great, right? Not so fast. </para>
- <para>The problem is that all the data is going to pile up in the first 2 regions and the last
- region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer
- to an <link
- xlink:href="http://www.asciitable.com">ASCII Table</link>. '0' is byte 48, and 'f' is byte
- 102, but there is a huge gap in byte values (bytes 58 to 96) that will <emphasis>never
- appear in this keyspace</emphasis> because the only values are [0-9] and [a-f]. Thus, the
- middle regions regions will never be used. To make pre-spliting work with this example
- keyspace, a custom definition of splits (i.e., and not relying on the built-in split method)
- is required. </para>
- <para>Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split
- them in such a way that all the regions are accessible in the keyspace. While this example
- demonstrated the problem with a hex-key keyspace, the same problem can happen with
- <emphasis>any</emphasis> keyspace. Know your data. </para>
- <para>Lesson #2: While generally not advisable, using hex-keys (and more generally,
- displayable data) can still work with pre-split tables as long as all the created regions
- are accessible in the keyspace. </para>
- <para>To conclude this example, the following is an example of how appropriate splits can be
- pre-created for hex-keys:. </para>
- <programlisting language="java"><![CDATA[public static boolean createTable(HBaseAdmin admin, HTableDescriptor table, byte[][] splits)
-throws IOException {
- try {
- admin.createTable( table, splits );
- return true;
- } catch (TableExistsException e) {
- logger.info("table " + table.getNameAsString() + " already exists");
- // the table already exists...
- return false;
- }
-}
-
-public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {
- byte[][] splits = new byte[numRegions-1][];
- BigInteger lowestKey = new BigInteger(startKey, 16);
- BigInteger highestKey = new BigInteger(endKey, 16);
- BigInteger range = highestKey.subtract(lowestKey);
- BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));
- lowestKey = lowestKey.add(regionIncrement);
- for(int i=0; i < numRegions-1;i++) {
- BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));
- byte[] b = String.format("%016x", key).getBytes();
- splits[i] = b;
- }
- return splits;
-}]]></programlisting>
- </section>
- </section>
- <!-- rowkey design -->
- <section
- xml:id="schema.versions">
- <title> Number of Versions </title>
- <section
- xml:id="schema.versions.max">
- <title>Maximum Number of Versions</title>
- <para>The maximum number of row versions to store is configured per column family via <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
- The default for max versions is 1. This is an important parameter because as described in <xref
- linkend="datamodel" /> section HBase does <emphasis>not</emphasis> overwrite row values,
- but rather stores different values per row by time (and qualifier). Excess versions are
- removed during major compactions. The number of max versions may need to be increased or
- decreased depending on application needs. </para>
- <para>It is not recommended setting the number of max versions to an exceedingly high level
- (e.g., hundreds or more) unless those old values are very dear to you because this will
- greatly increase StoreFile size. </para>
- </section>
- <section
- xml:id="schema.minversions">
- <title> Minimum Number of Versions </title>
- <para>Like maximum number of row versions, the minimum number of row versions to keep is
- configured per column family via <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html">HColumnDescriptor</link>.
- The default for min versions is 0, which means the feature is disabled. The minimum number
- of row versions parameter is used together with the time-to-live parameter and can be
- combined with the number of row versions parameter to allow configurations such as "keep the
- last T minutes worth of data, at most N versions, <emphasis>but keep at least M versions
- around</emphasis>" (where M is the value for minimum number of row versions, M<N). This
- parameter should only be set when time-to-live is enabled for a column family and must be
- less than the number of row versions. </para>
- </section>
- </section>
- <section
- xml:id="supported.datatypes">
- <title> Supported Datatypes </title>
- <para>HBase supports a "bytes-in/bytes-out" interface via <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Put.html">Put</link>
- and <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.html">Result</link>,
- so anything that can be converted to an array of bytes can be stored as a value. Input could
- be strings, numbers, complex objects, or even images as long as they can rendered as bytes. </para>
- <para>There are practical limits to the size of values (e.g., storing 10-50MB objects in HBase
- would probably be too much to ask); search the mailling list for conversations on this topic.
- All rows in HBase conform to the <xref
- linkend="datamodel" />, and that includes versioning. Take that into consideration when
- making your design, as well as block size for the ColumnFamily. </para>
-
- <section
- xml:id="counters">
- <title>Counters</title>
- <para> One supported datatype that deserves special mention are "counters" (i.e., the ability
- to do atomic increments of numbers). See <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#increment%28org.apache.hadoop.hbase.client.Increment%29">Increment</link>
- in HTable. </para>
- <para>Synchronization on counters are done on the RegionServer, not in the client. </para>
- </section>
- </section>
- <section
- xml:id="schema.joins">
- <title>Joins</title>
- <para>If you have multiple tables, don't forget to factor in the potential for <xref
- linkend="joins" /> into the schema design. </para>
- </section>
- <section
- xml:id="ttl">
- <title>Time To Live (TTL)</title>
- <para> ColumnFamilies can set a TTL length in seconds, and HBase will automatically delete rows
- once the expiration time is reached. This applies to <emphasis>all</emphasis> versions of a
- row - even the current one. The TTL time encoded in the HBase for the row is specified in UTC.
- </para>
- <para> Store files which contains only expired rows are deleted on minor compaction. Setting
- <varname>hbase.store.delete.expired.storefile</varname> to <code>false</code> disables this
- feature. Setting <link linkend="schema.minversions">minimum number of versions</link> to other
- than 0 also disables this.</para>
- <para> See <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html"
- >HColumnDescriptor</link> for more information. </para>
- </section>
- <section
- xml:id="cf.keep.deleted">
- <title> Keeping Deleted Cells </title>
- <para>By default, delete markers extend back to the beginning of time. Therefore, <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html">Get</link>
- or <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html">Scan</link>
- operations will not see a deleted cell (row or column), even when the Get or Scan operation
- indicates a time range
- before the delete marker was placed.</para>
- <para>ColumnFamilies can optionally keep deleted cells. In this case, deleted cells can still be
- retrieved, as long as these operations specify a time range that ends before the timestamp of
- any delete that would affect the cells. This allows for point-in-time queries even in the
- presence of deletes. </para>
- <para> Deleted cells are still subject to TTL and there will never be more than "maximum number
- of versions" deleted cells. A new "raw" scan options returns all deleted rows and the delete
- markers. </para>
- <example>
- <title>Change the Value of <code>KEEP_DELETED_CELLS</code> Using HBase Shell</title>
- <screen>hbase> hbase> alter ‘t1′, NAME => ‘f1′, KEEP_DELETED_CELLS => true</screen>
- </example>
- <example>
- <title>Change the Value of <code>KEEP_DELETED_CELLS</code> Using the API</title>
- <programlisting language="java">...
-HColumnDescriptor.setKeepDeletedCells(true);
-...
- </programlisting>
- </example>
- <para>See the API documentation for <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/HColumnDescriptor.html#KEEP_DELETED_CELLS"
- >KEEP_DELETED_CELLS</link> for more information. </para>
- </section>
- <section
- xml:id="secondary.indexes">
- <title> Secondary Indexes and Alternate Query Paths </title>
- <para>This section could also be titled "what if my table rowkey looks like
- <emphasis>this</emphasis> but I also want to query my table like <emphasis>that</emphasis>."
- A common example on the dist-list is where a row-key is of the format "user-timestamp" but
- there are reporting requirements on activity across users for certain time ranges. Thus,
- selecting by user is easy because it is in the lead position of the key, but time is not. </para>
- <para>There is no single answer on the best way to handle this because it depends on... </para>
- <itemizedlist>
- <listitem>
- <para>Number of users</para>
- </listitem>
- <listitem>
- <para>Data size and data arrival rate</para>
- </listitem>
- <listitem>
- <para>Flexibility of reporting requirements (e.g., completely ad-hoc date selection vs.
- pre-configured ranges) </para>
- </listitem>
- <listitem>
- <para>Desired execution speed of query (e.g., 90 seconds may be reasonable to some for an
- ad-hoc report, whereas it may be too long for others) </para>
- </listitem>
- </itemizedlist>
- <para>... and solutions are also influenced by the size of the cluster and how much processing
- power you have to throw at the solution. Common techniques are in sub-sections below. This is
- a comprehensive, but not exhaustive, list of approaches. </para>
- <para>It should not be a surprise that secondary indexes require additional cluster space and
- processing. This is precisely what happens in an RDBMS because the act of creating an
- alternate index requires both space and processing cycles to update. RDBMS products are more
- advanced in this regard to handle alternative index management out of the box. However, HBase
- scales better at larger data volumes, so this is a feature trade-off. </para>
- <para>Pay attention to <xref
- linkend="performance" /> when implementing any of these approaches.</para>
- <para>Additionally, see the David Butler response in this dist-list thread <link
- xlink:href="http://search-hadoop.com/m/nvbiBp2TDP/Stargate%252Bhbase&subj=Stargate+hbase">HBase,
- mail # user - Stargate+hbase</link>
- </para>
- <section
- xml:id="secondary.indexes.filter">
- <title> Filter Query </title>
- <para>Depending on the case, it may be appropriate to use <xref
- linkend="client.filter" />. In this case, no secondary index is created. However, don't
- try a full-scan on a large table like this from an application (i.e., single-threaded
- client). </para>
- </section>
- <section
- xml:id="secondary.indexes.periodic">
- <title> Periodic-Update Secondary Index </title>
- <para>A secondary index could be created in an other table which is periodically updated via a
- MapReduce job. The job could be executed intra-day, but depending on load-strategy it could
- still potentially be out of sync with the main data table.</para>
- <para>See <xref
- linkend="mapreduce.example.readwrite" /> for more information.</para>
- </section>
- <section
- xml:id="secondary.indexes.dualwrite">
- <title> Dual-Write Secondary Index </title>
- <para>Another strategy is to build the secondary index while publishing data to the cluster
- (e.g., write to data table, write to index table). If this is approach is taken after a data
- table already exists, then bootstrapping will be needed for the secondary index with a
- MapReduce job (see <xref
- linkend="secondary.indexes.periodic" />).</para>
- </section>
- <section
- xml:id="secondary.indexes.summary">
- <title> Summary Tables </title>
- <para>Where time-ranges are very wide (e.g., year-long report) and where the data is
- voluminous, summary tables are a common approach. These would be generated with MapReduce
- jobs into another table.</para>
- <para>See <xref
- linkend="mapreduce.example.summary" /> for more information.</para>
- </section>
- <section
- xml:id="secondary.indexes.coproc">
- <title> Coprocessor Secondary Index </title>
- <para>Coprocessors act like RDBMS triggers. These were added in 0.92. For more information,
- see <xref
- linkend="coprocessors" />
- </para>
- </section>
- </section>
- <section
- xml:id="constraints">
- <title>Constraints</title>
- <para>HBase currently supports 'constraints' in traditional (SQL) database parlance. The advised
- usage for Constraints is in enforcing business rules for attributes in the table (eg. make
- sure values are in the range 1-10). Constraints could also be used to enforce referential
- integrity, but this is strongly discouraged as it will dramatically decrease the write
- throughput of the tables where integrity checking is enabled. Extensive documentation on using
- Constraints can be found at: <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/constraint">Constraint</link>
- since version 0.94. </para>
- </section>
- <section
- xml:id="schema.casestudies">
- <title>Schema Design Case Studies</title>
- <para>The following will describe some typical data ingestion use-cases with HBase, and how the
- rowkey design and construction can be approached. Note: this is just an illustration of
- potential approaches, not an exhaustive list. Know your data, and know your processing
- requirements. </para>
- <para>It is highly recommended that you read the rest of the <xref
- linkend="schema" /> first, before reading these case studies. </para>
- <para>The following case studies are described: </para>
- <itemizedlist>
- <listitem>
- <para>Log Data / Timeseries Data</para>
- </listitem>
- <listitem>
- <para>Log Data / Timeseries on Steroids</para>
- </listitem>
- <listitem>
- <para>Customer/Order</para>
- </listitem>
- <listitem>
- <para>Tall/Wide/Middle Schema Design</para>
- </listitem>
- <listitem>
- <para>List Data</para>
- </listitem>
- </itemizedlist>
- <section
- xml:id="schema.casestudies.log-timeseries">
- <title>Case Study - Log Data and Timeseries Data</title>
- <para>Assume that the following data elements are being collected. </para>
- <itemizedlist>
- <listitem>
- <para>Hostname</para>
- </listitem>
- <listitem>
- <para>Timestamp</para>
- </listitem>
- <listitem>
- <para>Log event</para>
- </listitem>
- <listitem>
- <para>Value/message</para>
- </listitem>
- </itemizedlist>
- <para>We can store them in an HBase table called LOG_DATA, but what will the rowkey be? From
- these attributes the rowkey will be some combination of hostname, timestamp, and log-event -
- but what specifically? </para>
- <section
- xml:id="schema.casestudies.log-timeseries.tslead">
- <title>Timestamp In The Rowkey Lead Position</title>
- <para>The rowkey <code>[timestamp][hostname][log-event]</code> suffers from the
- monotonically increasing rowkey problem described in <xref
- linkend="timeseries" />. </para>
- <para>There is another pattern frequently mentioned in the dist-lists about “bucketing”
- timestamps, by performing a mod operation on the timestamp. If time-oriented scans are
- important, this could be a useful approach. Attention must be paid to the number of
- buckets, because this will require the same number of scans to return results.</para>
- <programlisting language="java">
-long bucket = timestamp % numBuckets;
- </programlisting>
- <para>… to construct:</para>
- <programlisting>
-[bucket][timestamp][hostname][log-event]
- </programlisting>
- <para>As stated above, to select data for a particular timerange, a Scan will need to be
- performed for each bucket. 100 buckets, for example, will provide a wide distribution in
- the keyspace but it will require 100 Scans to obtain data for a single timestamp, so there
- are trade-offs. </para>
- </section>
- <!-- ts lead -->
- <section
- xml:id="schema.casestudies.log-timeseries.hostlead">
- <title>Host In The Rowkey Lead Position</title>
- <para>The rowkey <code>[hostname][log-event][timestamp]</code> is a candidate if there is a
- large-ish number of hosts to spread the writes and reads across the keyspace. This
- approach would be useful if scanning by hostname was a priority. </para>
- </section>
- <!-- host lead -->
- <section
- xml:id="schema.casestudies.log-timeseries.revts">
- <title>Timestamp, or Reverse Timestamp?</title>
- <para>If the most important access path is to pull most recent events, then storing the
- timestamps as reverse-timestamps (e.g., <code>timestamp = Long.MAX_VALUE –
- timestamp</code>) will create the property of being able to do a Scan on
- <code>[hostname][log-event]</code> to obtain the quickly obtain the most recently
- captured events. </para>
- <para>Neither approach is wrong, it just depends on what is most appropriate for the
- situation. </para>
- <note>
- <title>Reverse Scan API</title>
- <para>
- <link
- xlink:href="https://issues.apache.org/jira/browse/HBASE-4811">HBASE-4811</link>
- implements an API to scan a table or a range within a table in reverse, reducing the
- need to optimize your schema for forward or reverse scanning. This feature is available
- in HBase 0.98 and later. See <link
- xlink:href="https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed%28boolean" />
- for more information. </para>
- </note>
- </section>
- <!-- revts -->
- <section
- xml:id="schema.casestudies.log-timeseries.varkeys">
- <title>Variangle Length or Fixed Length Rowkeys?</title>
- <para>It is critical to remember that rowkeys are stamped on every column in HBase. If the
- hostname is “a” and the event type is “e1” then the resulting rowkey would be quite small.
- However, what if the ingested hostname is “myserver1.mycompany.com” and the event type is
- “com.package1.subpackage2.subsubpackage3.ImportantService”? </para>
- <para>It might make sense to use some substitution in the rowkey. There are at least two
- approaches: hashed and numeric. In the Hostname In The Rowkey Lead Position example, it
- might look like this: </para>
- <para>Composite Rowkey With Hashes:</para>
- <itemizedlist>
- <listitem>
- <para>[MD5 hash of hostname] = 16 bytes</para>
- </listitem>
- <listitem>
- <para>[MD5 hash of event-type] = 16 bytes</para>
- </listitem>
- <listitem>
- <para>[timestamp] = 8 bytes</para>
- </listitem>
- </itemizedlist>
- <para>Composite Rowkey With Numeric Substitution: </para>
- <para>For this approach another lookup table would be needed in addition to LOG_DATA, called
- LOG_TYPES. The rowkey of LOG_TYPES would be: </para>
- <itemizedlist>
- <listitem>
- <para>[type] (e.g., byte indicating hostname vs. event-type)</para>
- </listitem>
- <listitem>
- <para>[bytes] variable length bytes for raw hostname or event-type.</para>
- </listitem>
- </itemizedlist>
- <para>A column for this rowkey could be a long with an assigned number, which could be
- obtained by using an <link
- xlink:href="http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/HTable.html#incrementColumnValue%28byte[],%20byte[],%20byte[],%20long%29">HBase
- counter</link>. </para>
- <para>So the resulting composite rowkey would be: </para>
- <itemizedlist>
- <listitem>
- <para>[substituted long for hostname] = 8 bytes</para>
- </listitem>
- <listitem>
- <para>[substituted long for event type] = 8 bytes</para>
- </listitem>
- <listitem>
- <para>[timestamp] = 8 bytes</para>
- </listitem>
- </itemizedlist>
- <para>In either the Hash or Numeric substitution approach, the raw values for hostname and
- event-type can be stored as columns. </para>
- </section>
- <!-- varkeys -->
- </section>
- <!-- log data and timeseries -->
- <section
- xml:id="schema.casestudies.log-steroids">
- <title>Case Study - Log Data and Timeseries Data on Steroids</title>
- <para>This effectively is the OpenTSDB approach. What OpenTSDB does is re-write data and pack
- rows into columns for certain time-periods. For a detailed explanation, see: <link
- xlink:href="http://opentsdb.net/schema.html">http://opentsdb.net/schema.html</link>, and <link
- xlink:href="http://www.cloudera.com/content/cloudera/en/resources/library/hbasecon/video-hbasecon-2012-lessons-learned-from-opentsdb.html">Lessons
- Learned from OpenTSDB</link> from HBaseCon2012. </para>
- <para>But this is how the general concept works: data is ingested, for example, in this
- manner…</para>
- <screen>
-[hostname][log-event][timestamp1]
-[hostname][log-event][timestamp2]
-[hostname][log-event][timestamp3]
- </screen>
- <para>… with separate rowkeys for each detailed event, but is re-written like this… </para>
- <screen>[hostname][log-event][timerange]</screen>
- <para>… and each of the above events are converted into columns stored with a time-offset
- relative to the beginning timerange (e.g., every 5 minutes). This is obviously a very
- advanced processing technique, but HBase makes this possible. </para>
- </section>
- <!-- log data timeseries steroids -->
-
- <section
- xml:id="schema.casestudies.custorder">
- <title>Case Study - Customer/Order</title>
- <para>Assume that HBase is used to store customer and order information. There are two core
- record-types being ingested: a Customer record type, and Order record type. </para>
- <para>The Customer record type would include all the things that you’d typically expect: </para>
- <itemizedlist>
- <listitem>
- <para>Customer number</para>
- </listitem>
- <listitem>
- <para>Customer name</para>
- </listitem>
- <listitem>
- <para>Address (e.g., city, state, zip)</para>
- </listitem>
- <listitem>
- <para>Phone numbers, etc.</para>
- </listitem>
- </itemizedlist>
- <para>The Order record type would include things like: </para>
- <itemizedlist>
- <listitem>
- <para>Customer number</para>
- </listitem>
- <listitem>
- <para>Order number</para>
- </listitem>
- <listitem>
- <para>Sales date</para>
- </listitem>
- <listitem>
- <para>A series of nested objects for shipping locations and line-items (see <xref
- linkend="schema.casestudies.custorder.obj" /> for details)</para>
- </listitem>
- </itemizedlist>
- <para>Assuming that the combination of customer number and sales order uniquely identify an
- order, these two attributes will compose the rowkey, and specifically a composite key such
- as: </para>
- <screen>[customer number][order number]</screen>
- <para>… for a ORDER table. However, there are more design decisions to make: are the
- <emphasis>raw</emphasis> values the best choices for rowkeys? </para>
- <para>The same design questions in the Log Data use-case confront us here. What is the
- keyspace of the customer number, and what is the format (e.g., numeric? alphanumeric?) As it
- is advantageous to use fixed-length keys in HBase, as well as keys that can support a
- reasonable spread in the keyspace, similar options appear: </para>
- <para>Composite Rowkey With Hashes: </para>
- <itemizedlist>
- <listitem>
- <para>[MD5 of customer number] = 16 bytes</para>
- </listitem>
- <listitem>
- <para>[MD5 of order number] = 16 bytes</para>
- </listitem>
- </itemizedlist>
- <para>Composite Numeric/Hash Combo Rowkey: </para>
- <itemizedlist>
- <listitem>
- <para>[substituted long for customer number] = 8 bytes</para>
- </listitem>
- <listitem>
- <para>[MD5 of order number] = 16 bytes</para>
- </listitem>
- </itemizedlist>
- <section
- xml:id="schema.casestudies.custorder.tables">
- <title>Single Table? Multiple Tables?</title>
- <para>A traditional design approach would have separate tables for CUSTOMER and SALES.
- Another option is to pack multiple record types into a single table (e.g., CUSTOMER++). </para>
- <para>Customer Record Type Rowkey: </para>
- <itemizedlist>
- <listitem>
- <para>[customer-id]</para>
- </listitem>
- <listitem>
- <para>[type] = type indicating ‘1’ for customer record type</para>
- </listitem>
- </itemizedlist>
- <para>Order Record Type Rowkey: </para>
- <itemizedlist>
- <listitem>
- <para>[customer-id]</para>
- </listitem>
- <listitem>
- <para>[type] = type indicating ‘2’ for order record type</para>
- </listitem>
- <listitem>
- <para>[order]</para>
- </listitem>
- </itemizedlist>
- <para>The advantage of this particular CUSTOMER++ approach is that organizes many different
- record-types by customer-id (e.g., a single scan could get you everything about that
- customer). The disadvantage is that it’s not as easy to scan for a particular record-type.
- </para>
- </section>
- <section
- xml:id="schema.casestudies.custorder.obj">
- <title>Order Object Design</title>
- <para>Now we need to address how to model the Order object. Assume that the class structure
- is as follows:</para>
- <variablelist>
- <varlistentry>
- <term>Order</term>
- <listitem>
- <para>(an Order can have multiple ShippingLocations</para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>LineItem</term>
- <listitem>
- <para>(a ShippingLocation can have multiple LineItems</para>
- </listitem>
- </varlistentry>
- </variablelist>
- <para>... there are multiple options on storing this data. </para>
- <section
- xml:id="schema.casestudies.custorder.obj.norm">
- <title>Completely Normalized</title>
- <para>With this approach, there would be separate tables for ORDER, SHIPPING_LOCATION, and
- LINE_ITEM. </para>
- <para>The ORDER table's rowkey was described above: <xref
- linkend="schema.casestudies.custorder" />
- </para>
- <para>The SHIPPING_LOCATION's composite rowkey would be something like this: </para>
- <itemizedlist>
- <listitem>
- <para>[order-rowkey]</para>
- </listitem>
- <listitem>
- <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para>
- </listitem>
- </itemizedlist>
- <para>The LINE_ITEM table's composite rowkey would be something like this: </para>
- <itemizedlist>
- <listitem>
- <para>[order-rowkey]</para>
- </listitem>
- <listitem>
- <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para>
- </listitem>
- <listitem>
- <para>[line item number] (e.g., 1st lineitem, 2nd, etc.)</para>
- </listitem>
- </itemizedlist>
- <para>Such a normalized model is likely to be the approach with an RDBMS, but that's not
- your only option with HBase. The cons of such an approach is that to retrieve
- information about any Order, you will need: </para>
- <itemizedlist>
- <listitem>
- <para>Get on the ORDER table for the Order</para>
- </listitem>
- <listitem>
- <para>Scan on the SHIPPING_LOCATION table for that order to get the ShippingLocation
- instances</para>
- </listitem>
- <listitem>
- <para>Scan on the LINE_ITEM for each ShippingLocation</para>
- </listitem>
- </itemizedlist>
- <para>... granted, this is what an RDBMS would do under the covers anyway, but since there
- are no joins in HBase you're just more aware of this fact. </para>
- </section>
- <section
- xml:id="schema.casestudies.custorder.obj.rectype">
- <title>Single Table With Record Types</title>
- <para>With this approach, there would exist a single table ORDER that would contain </para>
- <para>The Order rowkey was described above: <xref
- linkend="schema.casestudies.custorder" /></para>
- <itemizedlist>
- <listitem>
- <para>[order-rowkey]</para>
- </listitem>
- <listitem>
- <para>[ORDER record type]</para>
- </listitem>
- </itemizedlist>
- <para>The ShippingLocation composite rowkey would be something like this: </para>
- <itemizedlist>
- <listitem>
- <para>[order-rowkey]</para>
- </listitem>
- <listitem>
- <para>[SHIPPING record type]</para>
- </listitem>
- <listitem>
- <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para>
- </listitem>
- </itemizedlist>
- <para>The LineItem composite rowkey would be something like this: </para>
- <itemizedlist>
- <listitem>
- <para>[order-rowkey]</para>
- </listitem>
- <listitem>
- <para>[LINE record type]</para>
- </listitem>
- <listitem>
- <para>[shipping location number] (e.g., 1st location, 2nd, etc.)</para>
- </listitem>
- <listitem>
- <para>[line item number] (e.g., 1st lineitem, 2nd, etc.)</para>
- </listitem>
- </itemizedlist>
- </section>
- <section
- xml:id="schema.casestudies.custorder.obj.denorm">
- <title>Denormalized</title>
- <para>A variant of the Single Table With Record Types approach is to denormalize and
- flatten some of the object hierarchy, such as collapsing the ShippingLocation attributes
- onto each LineItem instance. </para>
- <para>The LineItem composite rowkey would be something like this: </para>
- <itemizedlist>
- <listitem>
- <para>[order-rowkey]</para>
- </listitem>
- <listitem>
- <para>[LINE record type]</para>
- </listitem>
- <listitem>
- <para>[line item number] (e.g., 1st lineitem, 2nd, etc. - care must be taken that
- there are unique across the entire order)</para>
- </listitem>
- </itemizedlist>
- <para>... and the LineItem columns would be something like this: </para>
- <itemizedlist>
- <listitem>
- <para>itemNumber</para>
- </listitem>
- <listitem>
- <para>quantity</para>
- </listitem>
- <listitem>
- <para>price</para>
- </listitem>
- <listitem>
- <para>shipToLine1 (denormalized from ShippingLocation)</para>
- </listitem>
- <listitem>
- <para>shipToLine2 (denormalized from ShippingLocation)</para>
- </listitem>
- <listitem>
- <para>shipToCity (denormalized from ShippingLocation)</para>
- </listitem>
- <listitem>
- <para>shipToState (denormalized from ShippingLocation)</para>
- </listitem>
- <listitem>
- <para>shipToZip (denormalized from ShippingLocation)</para>
- </listitem>
- </itemizedlist>
- <para>The pros of this approach include a less complex object heirarchy, but one of the
- cons is that updating gets more complicated in case any of this information changes.
- </para>
- </section>
- <section
- xml:id="schema.casestudies.custorder.obj.singleobj">
- <title>Object BLOB</title>
- <para>With this approach, the entire Order object graph is treated, in one way or another,
- as a BLOB. For example, the ORDER table's rowkey was described above: <xref
- linkend="schema.casestudies.custorder" />, and a single column called "order" would
- contain an object that could be deserialized that contained a container Order,
- ShippingLocations, and LineItems. </para>
- <para>There are many options here: JSON, XML, Java Serialization, Avro, Hadoop Writables,
- etc. All of them are variants of the same approach: encode the object graph to a
- byte-array. Care should be taken with this approach to ensure backward compatibilty in
- case the object model changes such that older persisted structures can still be read
- back out of HBase. </para>
- <para>Pros are being able to manage complex object graphs with minimal I/O (e.g., a single
- HBase Get per Order in this example), but the cons include the aforementioned warning
- about backward compatiblity of serialization, language dependencies of serialization
- (e.g., Java Serialization only works with Java clients), the fact that you have to
- deserialize the entire object to get any piece of information inside the BLOB, and the
- difficulty in getting frameworks like Hive to work with custom objects like this.
- </para>
- </section>
- </section>
- <!-- cust/order order object -->
- </section>
- <!-- cust/order -->
-
- <section
- xml:id="schema.smackdown">
- <title>Case Study - "Tall/Wide/Middle" Schema Design Smackdown</title>
- <para>This section will describe additional schema design questions that appear on the
- dist-list, specifically about tall and wide tables. These are general guidelines and not
- laws - each application must consider its own needs. </para>
- <section
- xml:id="schema.smackdown.rowsversions">
- <title>Rows vs. Versions</title>
- <para>A common question is whether one should prefer rows or HBase's built-in-versioning.
- The context is typically where there are "a lot" of versions of a row to be retained
- (e.g., where it is significantly above the HBase default of 1 max versions). The
- rows-approach would require storing a timestamp in some portion of the rowkey so that they
- would not overwite with each successive update. </para>
- <para>Preference: Rows (generally speaking). </para>
- </section>
- <section
- xml:id="schema.smackdown.rowscols">
- <title>Rows vs. Columns</title>
- <para>Another common question is whether one should prefer rows or columns. The context is
- typically in extreme cases of wide tables, such as having 1 row with 1 million attributes,
- or 1 million rows with 1 columns apiece. </para>
- <para>Preference: Rows (generally speaking). To be clear, this guideline is in the context
- is in extremely wide cases, not in the standard use-case where one needs to store a few
- dozen or hundred columns. But there is also a middle path between these two options, and
- that is "Rows as Columns." </para>
- </section>
- <section
- xml:id="schema.smackdown.rowsascols">
- <title>Rows as Columns</title>
- <para>The middle path between Rows vs. Columns is packing data that would be a separate row
- into columns, for certain rows. OpenTSDB is the best example of this case where a single
- row represents a defined time-range, and then discrete events are treated as columns. This
- approach is often more complex, and may require the additional complexity of re-writing
- your data, but has the advantage of being I/O efficient. For an overview of this approach,
- see <xref
- linkend="schema.casestudies.log-steroids" />. </para>
- </section>
- </section>
- <!-- note: the following id is not consistent with the others becaus it was formerly in the Case Studies chapter,
- but I didn't want to break backward compatibility of the link. But future entries should look like the above case-study
- links (schema.casestudies. ...) -->
- <section
- xml:id="casestudies.schema.listdata">
- <title>Case Study - List Data</title>
- <para>The following is an exchange from the user dist-list regarding a fairly common question:
- how to handle per-user list data in Apache HBase. </para>
- <para>*** QUESTION ***</para>
- <para> We're looking at how to store a large amount of (per-user) list data in HBase, and we
- were trying to figure out what kind of access pattern made the most sense. One option is
- store the majority of the data in a key, so we could have something like: </para>
-
- <programlisting><![CDATA[
-<FixedWidthUserName><FixedWidthValueId1>:"" (no value)
-<FixedWidthUserName><FixedWidthValueId2>:"" (no value)
-<FixedWidthUserName><FixedWidthValueId3>:"" (no value)
-]]></programlisting>
-
- <para>The other option we had was to do this entirely using:</para>
- <programlisting language="xml"><![CDATA[
-<FixedWidthUserName><FixedWidthPageNum0>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
-<FixedWidthUserName><FixedWidthPageNum1>:<FixedWidthLength><FixedIdNextPageNum><ValueId1><ValueId2><ValueId3>...
- ]]></programlisting>
- <para> where each row would contain multiple values. So in one case reading the first thirty
- values would be: </para>
- <programlisting language="java"><![CDATA[
-scan { STARTROW => 'FixedWidthUsername' LIMIT => 30}
- ]]></programlisting>
- <para>And in the second case it would be </para>
- <programlisting>
-get 'FixedWidthUserName\x00\x00\x00\x00'
- </programlisting>
- <para> The general usage pattern would be to read only the first 30 values of these lists,
- with infrequent access reading deeper into the lists. Some users would have <= 30 total
- values in these lists, and some users would have millions (i.e. power-law distribution) </para>
- <para> The single-value format seems like it would take up more space on HBase, but would
- offer some improved retrieval / pagination flexibility. Would there be any significant
- performance advantages to be able to paginate via gets vs paginating with scans? </para>
- <para> My initial understanding was that doing a scan should be faster if our paging size is
- unknown (and caching is set appropriately), but that gets should be faster if we'll always
- need the same page size. I've ended up hearing different people tell me opposite things
- about performance. I assume the page sizes would be relatively consistent, so for most use
- cases we could guarantee that we only wanted one page of data in the fixed-page-length case.
- I would also assume that we would have infrequent updates, but may have inserts into the
- middle of these lists (meaning we'd need to update all subsequent rows). </para>
- <para> Thanks for help / suggestions / follow-up questions. </para>
- <para>*** ANSWER ***</para>
- <para> If I understand you correctly, you're ultimately trying to store triples in the form
- "user, valueid, value", right? E.g., something like: </para>
- <programlisting>
-"user123, firstname, Paul",
-"user234, lastname, Smith"
- </programlisting>
- <para> (But the usernames are fixed width, and the valueids are fixed width). </para>
- <para> And, your access pattern is along the lines of: "for user X, list the next 30 values,
- starting with valueid Y". Is that right? And these values should be returned sorted by
- valueid? </para>
- <para> The tl;dr version is that you should probably go with one row per user+value, and not
- build a complicated intra-row pagination scheme on your own unless you're really sure it is
- needed. </para>
- <para> Your two options mirror a common question people have when designing HBase schemas:
- should I go "tall" or "wide"? Your first schema is "tall": each row represents one value for
- one user, and so there are many rows in the table for each user; the row key is user +
- valueid, and there would be (presumably) a single column qualifier that means "the value".
- This is great if you want to scan over rows in sorted order by row key (thus my question
- above, about whether these ids are sorted correctly). You can start a scan at any
- user+valueid, read the next 30, and be done. What you're giving up is the ability to have
- transactional guarantees around all the rows for one user, but it doesn't sound like you
- need that. Doing it this way is generally recommended (see here <link
- xlink:href="http://hbase.apache.org/book.html#schema.smackdown">http://hbase.apache.org/book.html#schema.smackdown</link>). </para>
- <para> Your second option is "wide": you store a bunch of values in one row, using different
- qualifiers (where the qualifier is the valueid). The simple way to do that would be to just
- store ALL values for one user in a single row. I'm guessing you jumped to the "paginated"
- version because you're assuming that storing millions of columns in a single row would be
- bad for performance, which may or may not be true; as long as you're not trying to do too
- much in a single request, or do things like scanning over and returning all of the cells in
- the row, it shouldn't be fundamentally worse. The client has methods that allow you to get
- specific slices of columns. </para>
- <para> Note that neither case fundamentally uses more disk space than the other; you're just
- "shifting" part of the identifying information for a value either to the left (into the row
- key, in option one) or to the right (into the column qualifiers in option 2). Under the
- covers, every key/value still stores the whole row key, and column family name. (If this is
- a bit confusing, take an hour and watch Lars George's excellent video about understanding
- HBase schema design: <link
- xlink:href="http://www.youtube.com/watch?v=_HLoH_PgrLk)">http://www.youtube.com/watch?v=_HLoH_PgrLk)</link>. </para>
- <para> A manually paginated version has lots more complexities, as you note, like having to
- keep track of how many things are in each page, re-shuffling if new values are inserted,
- etc. That seems significantly more complex. It might have some slight speed advantages (or
- disadvantages!) at extremely high throughput, and the only way to really know that would be
- to try it out. If you don't have time to build it both ways and compare, my advice would be
- to start with the simplest option (one row per user+value). Start simple and iterate! :)
- </para>
- </section>
- <!-- listdata -->
-
- </section>
- <!-- schema design cases -->
- <section
- xml:id="schema.ops">
- <title>Operational and Performance Configuration Options</title>
- <para>See the Performance section <xref
- linkend="perf.schema" /> for more information operational and performance schema design
- options, such as Bloom Filters, Table-configured regionsizes, compression, and blocksizes.
- </para>
- </section>
-
-</chapter>
-<!-- schema design -->