You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-commits@hadoop.apache.org by Apache Wiki <wi...@apache.org> on 2011/04/02 20:51:50 UTC
[Hadoop Wiki] Update of "Hbase/HbaseArchitecture" by DougMeil

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change notification.

The "Hbase/HbaseArchitecture" page has been changed by DougMeil.
The comment on this change is: Per stack, this page is now directing readers to the HBase book..
http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture?action=diff&rev1=77&rev2=78

--------------------------------------------------

- This document is stale.  It describes HBase circa 0.18.x.  It is badly in need of an update.
+ HBase architecture information can now can be found in the HBase book at [[http://hbase.apache.org/book.html#architecture]]
  
- = Table of Contents =
- 
-  * [[#intro|Introduction]]
-  * [[#datamodel|Data Model]]
-   * [[#conceptual|Conceptual View]]
-   * [[#physical|Physical Storage View]]
-  * [[#arch|Architecture and Implementation]]
-   * [[#master|HBaseMaster]]
-   * [[#hregion|HRegionServer]]
-   * [[#client|HBase Client]]
- 
- <<Anchor(intro)>>
- = Introduction =
- 
- In order to better understand this document, it is highly recommended that the Google [[http://labs.google.com/papers/bigtable.html|Bigtable paper]] be read first.
- 
- HBase is an [[http://apache.org/|Apache]] open source project whose goal is to provide Bigtable-like storage for the Hadoop Distributed Computing Environment. Just as Google's [[http://labs.google.com/papers/bigtable.html|Bigtable]] leverages the distributed data storage provided by the [[http://labs.google.com/papers/gfs.html|Google Distributed File System (GFS)]], HBase provides Bigtable-like capabilities on top of the [[http://hadoop.apache.org/core/docs/current/hdfs_design.html|Hadoop Distributed File System (HDFS)]].
- 
- Data is logically organized into tables, rows and columns.  An iterator-like interface is available for scanning through a row range and, of course, there is the ability to retrieve a column value for a specific row key. Any particular column may have multiple versions for the same row key.
- 
- <<Anchor(datamodel)>>
- = Data Model =
- 
- HBase uses a data model very similar to that of Bigtable. Applications store data rows in labeled tables. A data row has a sortable row key and an arbitrary number of columns. The table is stored sparsely, so that rows in the same table can have widely varying numbers of columns. 
- 
- A column name has the form ''"<family>:<label>"'' where <family> and <label> can be arbitrary byte arrays. A table enforces its set of <family>s (called ''"column families"''). Adjusting the set of families is done by performing administrative operations on the table. However, new <label>s can be used in any write operation without pre-announcing it. HBase stores column families physically close on disk, so the items in a given column family should have roughly the same read/write characteristics and contain similar data.
- 
- Only a single row at a time may be locked by default. Row writes are always atomic, but it is also possible to lock a single row and perform both read and write operations on that row atomically.
- 
- An extension was added recently to allow multi-row locking, but this is not the default behavior and must be explicitly enabled.
- 
- <<Anchor(conceptual)>>
- == Conceptual View ==
- 
- Conceptually a table may be thought of a collection of rows that
- are located by a row key (and optional timestamp) and where any column
- may not have a value for a particular row key (sparse). The following example is a slightly modified form of the one on page 2 of the [[http://labs.google.com/papers/bigtable.html|Bigtable Paper]] (adds a new column family ''"mime:"'').
- 
- <<Anchor(datamodelexample)>>
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents:"'' ||||<:> '''Column''' ''"anchor:"'' ||<:> '''Column''' ''"mime:"'' ||
- ||<^|5> "com.cnn.www" ||<:> t9 || ||<)> "anchor:cnnsi.com" ||<:> "CNN" || ||
- ||<:> t8 || ||<)> "anchor:my.look.ca" ||<:> "CNN.com" || ||
- ||<:> t6 ||<:> "<html>..." || || ||<:> "text/html" ||
- ||<:> t5 ||<:> "<html>..." || || || ||
- ||<:> t3 ||<:> "<html>..." || || || ||
- 
- <<Anchor(physical)>>
- == Physical Storage View ==
- 
- Although at a conceptual level, tables may be viewed as a sparse set of rows, physically they are stored on a per-column family basis. This is an important consideration for schema and application designers to keep in mind.
- 
- Pictorially, the table shown in the [[#datamodelexample|conceptual view]] above would be stored as follows:
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"contents:"'' ||
- ||<^|3> "com.cnn.www" ||<:> t6 ||<:> "<html>..." ||
- ||<:> t5 ||<:> "<html>..." ||
- ||<:> t3 ||<:> "<html>..." ||
- 
- <<BR>>
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' |||| '''Column''' ''"anchor:"'' ||
- ||<^|2> "com.cnn.www" ||<:> t9 ||<)> "anchor:cnnsi.com" ||<:> "CNN" ||
- ||<:> t8 ||<)> "anchor:my.look.ca" ||<:> "CNN.com" ||
- 
- <<BR>>
- 
- ||<:> '''Row Key''' ||<:> '''Time Stamp''' ||<:> '''Column''' ''"mime:"'' ||
- || "com.cnn.www" ||<:> t6 ||<:> "text/html" ||
- 
- <<BR>>
- 
- It is important to note in the diagram above that the empty cells shown in the conceptual view are not stored since they need not be in a column-oriented storage format. Thus a request for the value of the ''"contents:"'' column at time stamp t8 would return no value. Similarly, a request for an ''"anchor:my.look.ca"'' value at time stamp t9 would return no value.
- 
- However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row "com.cnn.www" if no timestamp is specified would be: the value of ''"contents:"'' from time stamp t6, the value of ''"anchor:cnnsi.com"'' from time stamp t9, the value of ''"anchor:my.look.ca"'' from time stamp t8 and the value of ''"mime:"'' from time stamp t6.
- 
- 
- === Row Ranges: Regions ===
- 
- To an application, a table appears to be a list of tuples sorted by row key ascending, column name ascending and timestamp descending.  Physically, tables are broken up into row ranges called ''regions'' (equivalent Bigtable term is ''tablet''). Each row range contains rows from start-key (inclusive) to end-key (exclusive). A set of regions, sorted appropriately, forms an entire table. Unlike Bigtable which identifies a row range by the table name and end-key, HBase identifies a row range by the table name and start-key.
- 
- Each column family in a region is managed by an ''HStore''. Each HStore may have one or more ''!MapFiles'' (a Hadoop HDFS file type) that is very similar to a Google ''SSTable''. Like SSTables, !MapFiles are immutable once closed. !MapFiles are stored in the Hadoop HDFS. Other details are the same, except:
-  * !MapFiles cannot currently be mapped into memory.
-  * !MapFiles maintain the sparse index in a separate file rather than at the end of the file as SSTable does.
-  * HBase extends !MapFile so that a bloom filter can be employed to enhance negative lookup performance. The hash function employed is one developed by Bob Jenkins.
- 
- <<Anchor(arch)>>
- = Architecture and Implementation =
- 
- There are three major components of the HBase architecture:
-  1. The H!BaseMaster (analogous to the Bigtable master server)
-  2. The H!RegionServer (analogous to the Bigtable tablet server)
-  3. The HBase client, defined by org.apache.hadoop.hbase.client.HTable
- 
- Each will be discussed in the following sections.
- 
- <<Anchor(master)>>
- == HBaseMaster ==
- 
- The H!BaseMaster is responsible for assigning regions to H!RegionServers. The first region to be assigned is the ''ROOT region'' which locates all the META regions to be assigned. Each ''META region'' maps a number of user regions which comprise the multiple tables that a particular HBase instance serves. Once all the META regions have been assigned, the master will then assign user regions to the H!RegionServers, attempting to balance the number of regions served by each H!RegionServer.
- 
- It also holds a pointer to the H!RegionServer that is hosting the ROOT region.
- 
- The H!BaseMaster also monitors the health of each H!RegionServer, and if it detects a H!RegionServer is no longer reachable, it will split the H!RegionServer's write-ahead log so that there is now one write-ahead log for each region that the H!RegionServer was serving. After it has accomplished this, it will reassign the regions that were being served by the unreachable H!RegionServer.
- 
- In addition, the H!BaseMaster is also responsible for handling table administrative functions such as on/off-lining of tables, changes to the table schema (adding and removing column families), etc.
- 
- Unlike Bigtable, currently, when the H!BaseMaster dies, the cluster will shut down. In Bigtable, a Tabletserver can still serve Tablets after its connection to the Master has died. We tie them together, because we do not currently use an external lock-management system like Bigtable. The Bigtable Master allocates tablets and a lock manager (''Chubby'') guarantees atomic access by Tabletservers to tablets. HBase uses just a single central point for all H!RegionServers to access: the H!BaseMaster.
- 
- === The META Table ===
- 
- The META table stores information about every user region in HBase which includes a H!RegionInfo object containing information such as the start and end row keys, whether the region is on-line or off-line, etc. and the address of the H!RegionServer that is currently serving the region. The META table can grow as the number of user regions grows.
- 
- === The ROOT Table ===
- 
- The ROOT table is confined to a single region and maps all the regions in the META table. Like the META table, it contains a H!RegionInfo object for each META region and the location of the H!RegionServer that is serving that META region.
- 
- Each row in the ROOT and META tables is approximately 1KB in size. At the default region size of 256MB, this means that the ROOT region can map 2.6 x 10^5^ META regions, which in turn map a total 6.9 x 10^10^ user regions, meaning that approximately 1.8 x 10^19^ (2^64^) bytes of user data.
- 
- <<Anchor(hregion)>>
- == HRegionServer ==
- 
- The H!RegionServer is responsible for handling client read and write requests. It communicates with the H!BaseMaster to get a list of regions to serve and to tell the master that it is alive. Region assignments and other instructions from the master "piggy back" on the heart beat messages.
- 
- === Write Requests ===
- 
- When a write request is received, it is first written to a write-ahead log called a ''HLog''. All write requests for every region the region server is serving are written to the same log. Once the request has been written to the HLog, it is stored in an in-memory cache called the ''Memcache''. There is one Memcache for each HStore.
- 
- === Read Requests ===
- 
- Reads are handled by first checking the Memcache and if the requested data is not found, the !MapFiles are searched for results.
- 
- === Cache Flushes ===
- 
- When the Memcache reaches a configurable size, it is flushed to disk, creating a new !MapFile and a marker is written to the HLog, so that when it is replayed, log entries before the last flush can be skipped. A flush may also be triggered to relieve memory pressure on the region server.
- 
- Cache flushes happen concurrently with the region server processing read and write requests. Just before the new !MapFile is moved into place, reads and writes are suspended until the !MapFile has been added to the list of active !MapFiles for the HStore.
- 
- === Compactions ===
- 
- When the number of !MapFiles exceeds a configurable threshold, a minor compaction is performed which consolidates the most recently written !MapFiles. A major compaction is performed periodically which consolidates all the !MapFiles into a single !MapFile. The reason for not always performing a major compaction is that the oldest !MapFile can be quite large and reading and merging it with the latest !MapFiles, which are much smaller, can be very time consuming due to the amount of I/O involved in reading merging and writing the contents of the largest !MapFile.
- 
- Compactions happen concurrently with the region server processing read and write requests. Just before the new !MapFile is moved into place, reads and writes are suspended until the !MapFile has been added to the list of active !MapFiles for the HStore and the !MapFiles that were merged to create the new !MapFile have been removed.
- 
- === Region Splits ===
- 
- When the aggregate size of the !MapFiles for an HStore reaches a configurable size (currently 256MB), a region split is requested. Region splits divide the row range of the parent region in half and happen very quickly because the child regions read from the parent's !MapFile. 
- 
- The parent region is taken off-line, the region server records the new child regions in the META region and the master is informed that a split has taken place so that it can assign the children to region servers. Should the split message be lost, the master will discover the split has occurred since it periodically scans the META regions for unassigned regions.
- 
- Once the parent region is closed, read and write requests for the region are suspended. The client has a mechanism for detecting a region split and will wait and retry the request when the new children are on-line.
- 
- When a compaction is triggered in a child, the data from the parent is copied to the child. When both children have performed a compaction, the parent region is garbage collected.
- 
- <<Anchor(client)>>
- == HBase Client ==
- 
- The HBase client is responsible for finding H!RegionServers that are serving the particular row range of interest. On instantiation, the HBase client communicates with the H!BaseMaster to find the location of the ROOT region. This is the only communication between the client and the master.
- 
- Once the ROOT region is located, the client contacts that region server and scans the ROOT region to find the META region that will contain the location of the user region that contains the desired row range. It then contacts the region server that is serving that META region and scans that META region to determine the location of the user region.
- 
- After locating the user region, the client contacts the region server serving that region and issues the read or write request.
- 
- This information is cached in the client so that subsequent requests need not go through this process. 
- 
- Should a region be reassigned either by the master for load balancing or because a region server has died, the client will rescan the META table to determine the new location of the user region. If the META region has been reassigned, the client will rescan the ROOT region to determine the new location of the META region. If the ROOT region has been reassigned, the client will contact the master to determine the new ROOT region location and will locate the user region by repeating the original process described above.
- 
- === Client API ===
- 
- See the Javadoc for [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html|HTable]] and [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HBaseAdmin.html|HBaseAdmin]]
- 
- 
- ==== Scanner API ====
- 
- To obtain a scanner, a Cursor-like row 'iterator' that must be closed, [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/HTable.html#HTable(org.apache.hadoop.hbase.HBaseConfiguration,%20java.lang.String)|instantiate an HTable]], and then invoke ''getScanner''.  This method returns an [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html|Scanner]] against which you call [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html#next()|next]] and ultimately [[http://hadoop.apache.org/hbase/docs/current/api/org/apache/hadoop/hbase/client/Scanner.html#close()|close]].
-