You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by mi...@apache.org on 2015/02/17 21:58:28 UTC

hbase git commit: HBASE-12425 Document the region split process

Repository: hbase
Updated Branches:
  refs/heads/master b92ff24b0 -> f1f8c0d43


HBASE-12425 Document the region split process


Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/f1f8c0d4
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/f1f8c0d4
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/f1f8c0d4

Branch: refs/heads/master
Commit: f1f8c0d43000e69b8cddaf84b73c9645ef257ba9
Parents: b92ff24
Author: Misty Stanley-Jones <ms...@cloudera.com>
Authored: Thu Feb 12 14:48:05 2015 +1000
Committer: Misty Stanley-Jones <ms...@cloudera.com>
Committed: Wed Feb 18 06:58:15 2015 +1000

----------------------------------------------------------------------
 src/main/asciidoc/_chapters/architecture.adoc   |  25 +++++++++++++++++++
 .../resources/images/region_split_process.png   | Bin 0 -> 338255 bytes
 2 files changed, 25 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hbase/blob/f1f8c0d4/src/main/asciidoc/_chapters/architecture.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/architecture.adoc b/src/main/asciidoc/_chapters/architecture.adoc
index bae4a23..6de7208 100644
--- a/src/main/asciidoc/_chapters/architecture.adoc
+++ b/src/main/asciidoc/_chapters/architecture.adoc
@@ -858,6 +858,31 @@ For a RegionServer hosting data that can comfortably fit into cache, or if your
 
 The compressed BlockCache is disabled by default. To enable it, set `hbase.block.data.cachecompressed` to `true` in _hbase-site.xml_ on all RegionServers.
 
+[[regionserver_splitting_implementation]]
+=== RegionServer Splitting Implementation
+
+As write requests are handled by the region server, they accumulate in an in-memory storage system called the _memstore_. Once the memstore fills, its content are written to disk as additional store files. This event is called a _memstore flush_. As store files accumulate, the RegionServer will <<compaction,compact>> them into fewer, larger files. After each flush or compaction finishes, the amount of data stored in the region has changed. The RegionServer consults the region split policy to determine if the region has grown too large or should be split for another policy-specific reason. A region split request is enqueued if the policy recommends it.
+
+Logically, the process of splitting a region is simple. We find a suitable point in the keyspace of the region where we should divide the region in half, then split the region's data into two new regions at that point. The details of the process however are not simple.  When a split happens, the newly created _daughter regions_ do not rewrite all the data into new files immediately. Instead, they create small files similar to symbolic link files, named link:http://www.google.com/url?q=http%3A%2F%2Fhbase.apache.org%2Fapidocs%2Forg%2Fapache%2Fhadoop%2Fhbase%2Fio%2FReference.html&sa=D&sntz=1&usg=AFQjCNEkCbADZ3CgKHTtGYI8bJVwp663CA[Reference files], which point to either the top or bottom part of the parent store file according to the split point. The reference file is used just like a regular data file, but only half of the records are considered. The region can only be split if there are no more references to the immutable data files of the parent region. Those reference files are clea
 ned gradually by compactions, so that the region will stop referring to its parents files, and can be split further.
+
+Although splitting the region is a local decision made by the RegionServer, the split process itself must coordinate with many actors. The RegionServer notifies the Master before and after the split, updates the `.META.` table so that clients can discover the new daughter regions, and rearranges the directory structure and data files in HDFS. Splitting is a multi-task process. To enable rollback in case of an error, the RegionServer keeps an in-memory journal about the execution state. The steps taken by the RegionServer to execute the split are illustrated in <<regionserver_split_process_image>>. Each step is labeled with its step number. Actions from RegionServers or Master are shown in red, while actions from the clients are show in green.
+
+[[regionserver_split_process_image]]
+.RegionServer Split Process
+image::region_split_process.png[Region Split Process]
+
+. The RegionServer decides locally to split the region, and prepares the split. *THE SPLIT TRANSACTION IS STARTED.* As a first step, the RegionServer acquires a shared read lock on the table to prevent schema modifications during the splitting process. Then it creates a znode in zookeeper under `/hbase/region-in-transition/region-name`, and sets the znode's state to `SPLITTING`.
+. The Master learns about this znode, since it has a watcher for the parent `region-in-transition` znode.
+. The RegionServer creates a sub-directory named `.splits` under the parent’s `region` directory in HDFS.
+. The RegionServer closes the parent region and marks the region as offline in its local data structures. *THE SPLITTING REGION IS NOW OFFLINE.* At this point, client requests coming to the parent region will throw `NotServingRegionException`. The client will retry with some backoff. The closing region is flushed.
+. The  RegionServer creates region directories under the `.splits` directory, for daughter regions A and B, and creates necessary data structures. Then it splits the store files, in the sense that it creates two link:http://www.google.com/url?q=http%3A%2F%2Fhbase.apache.org%2Fapidocs%2Forg%2Fapache%2Fhadoop%2Fhbase%2Fio%2FReference.html&sa=D&sntz=1&usg=AFQjCNEkCbADZ3CgKHTtGYI8bJVwp663CA[Reference] files per store file in the parent region. Those reference files will point to the parent regions'files.
+. The RegionServer creates the actual region directory in HDFS, and moves the reference files for each daughter.
+. The RegionServer sends a `Put` request to the `.META.` table, to set the parent as offline in the `.META.` table and add information about daughter regions. At this point, there won’t be individual entries in `.META.` for the daughters. Clients will see that the parent region is split if they scan `.META.`, but won’t know about the daughters until they appear in `.META.`. Also, if this `Put` to `.META`. succeeds, the parent will be effectively split. If the RegionServer fails before this RPC succeeds, Master and the next Region Server opening the region will clean dirty state about the region split. After the `.META.` update, though, the region split will be rolled-forward by Master.
+. The RegionServer opens daughters A and B in parallel.
+. The RegionServer adds the daughters A and B to `.META.`, together with information that it hosts the regions. *THE SPLIT REGIONS (DAUGHTERS WITH REFERENCES TO PARENT) ARE NOW ONLINE.* After this point, clients can discover the new regions and issue requests to them. Clients cache the `.META.` entries locally, but when they make requests to the RegionServer or `.META.`, their caches will be invalidated, and they will learn about the new regions from `.META.`.
+. The RegionServer updates znode `/hbase/region-in-transition/region-name` in ZooKeeper to state `SPLIT`, so that the master can learn about it. The balancer can freely re-assign the daughter regions to other region servers if necessary. *THE SPLIT TRANSACTION IS NOW FINISHED.*
+. After the split, `.META.` and HDFS will still contain references to the parent region. Those references will be removed when compactions in daughter regions rewrite the data files. Garbage collection tasks in the master periodically check whether the daughter regions still refer to the parent region's files. If not, the parent region will be removed.
+
 [[wal]]
 === Write Ahead Log (WAL)
 

http://git-wip-us.apache.org/repos/asf/hbase/blob/f1f8c0d4/src/main/site/resources/images/region_split_process.png
----------------------------------------------------------------------
diff --git a/src/main/site/resources/images/region_split_process.png b/src/main/site/resources/images/region_split_process.png
new file mode 100644
index 0000000..2717617
Binary files /dev/null and b/src/main/site/resources/images/region_split_process.png differ