You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hbase.apache.org by st...@apache.org on 2018/04/19 23:20:04 UTC

hbase git commit: HBASE-20059 Make sure documentation is updated for the offheap Bucket cache usage

Repository: hbase
Updated Branches:
  refs/heads/master 7fc6e33be -> 70377babd


HBASE-20059 Make sure documentation is updated for the offheap Bucket cache usage


Project: http://git-wip-us.apache.org/repos/asf/hbase/repo
Commit: http://git-wip-us.apache.org/repos/asf/hbase/commit/70377bab
Tree: http://git-wip-us.apache.org/repos/asf/hbase/tree/70377bab
Diff: http://git-wip-us.apache.org/repos/asf/hbase/diff/70377bab

Branch: refs/heads/master
Commit: 70377babd0baaf88da153d1c24dba9cff0682825
Parents: 7fc6e33
Author: Michael Stack <st...@apache.org>
Authored: Thu Apr 19 16:19:53 2018 -0700
Committer: Michael Stack <st...@apache.org>
Committed: Thu Apr 19 16:19:53 2018 -0700

----------------------------------------------------------------------
 conf/hbase-env.sh                             |   3 +-
 src/main/asciidoc/_chapters/architecture.adoc | 140 +++++++++++++++------
 2 files changed, 105 insertions(+), 38 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/hbase/blob/70377bab/conf/hbase-env.sh
----------------------------------------------------------------------
diff --git a/conf/hbase-env.sh b/conf/hbase-env.sh
index 1ac93cc..c2bf09c 100644
--- a/conf/hbase-env.sh
+++ b/conf/hbase-env.sh
@@ -34,7 +34,8 @@
 # export HBASE_HEAPSIZE=1G
 
 # Uncomment below if you intend to use off heap cache. For example, to allocate 8G of 
-# offheap, set the value to "8G".
+# offheap, set the value to "8G". See http://hbase.apache.org/book.html#direct.memory
+# in the refguide for guidance setting this config.
 # export HBASE_OFFHEAPSIZE=1G
 
 # Extra Java runtime options.

http://git-wip-us.apache.org/repos/asf/hbase/blob/70377bab/src/main/asciidoc/_chapters/architecture.adoc
----------------------------------------------------------------------
diff --git a/src/main/asciidoc/_chapters/architecture.adoc b/src/main/asciidoc/_chapters/architecture.adoc
index 8d0a5b0..d5117db 100644
--- a/src/main/asciidoc/_chapters/architecture.adoc
+++ b/src/main/asciidoc/_chapters/architecture.adoc
@@ -643,44 +643,34 @@ Documentation will eventually move to this reference guide, but the blog is the
 [[block.cache]]
 === Block Cache
 
-HBase provides two different BlockCache implementations: the default on-heap `LruBlockCache` and the `BucketCache`, which is (usually) off-heap.
-This section discusses benefits and drawbacks of each implementation, how to choose the appropriate option, and configuration options for each.
+HBase provides two different BlockCache implementations to cache data read from HDFS:
+the default on-heap `LruBlockCache` and the `BucketCache`, which is (usually) off-heap.
+This section discusses benefits and drawbacks of each implementation, how to choose the
+appropriate option, and configuration options for each.
 
 .Block Cache Reporting: UI
 [NOTE]
 ====
 See the RegionServer UI for detail on caching deploy.
-Since HBase 0.98.4, the Block Cache detail has been significantly extended showing configurations, sizings, current usage, time-in-the-cache, and even detail on block counts and types.
+See configurations, sizings, current usage, time-in-the-cache, and even detail on block counts and types.
 ====
 
 ==== Cache Choices
 
-`LruBlockCache` is the original implementation, and is entirely within the Java heap. `BucketCache` is mainly intended for keeping block cache data off-heap, although `BucketCache` can also keep data on-heap and serve from a file-backed cache.
+`LruBlockCache` is the original implementation, and is entirely within the Java heap.
+`BucketCache` is optional and mainly intended for keeping block cache data off-heap, although `BucketCache` can also be a file-backed cache.
 
-.BucketCache is production ready as of HBase 0.98.6
-[NOTE]
-====
-To run with BucketCache, you need HBASE-11678.
-This was included in 0.98.6.
-====
-
-Fetching will always be slower when fetching from BucketCache, as compared to the native on-heap LruBlockCache.
-However, latencies tend to be less erratic across time, because there is less garbage collection when you use BucketCache since it is managing BlockCache allocations, not the GC.
-If the BucketCache is deployed in off-heap mode, this memory is not managed by the GC at all.
-This is why you'd use BucketCache, so your latencies are less erratic and to mitigate GCs and heap fragmentation.
-See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 101] for comparisons running on-heap vs off-heap tests.
-Also see link:https://people.apache.org/~stack/bc/[Comparing BlockCache Deploys] which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache.
-
-When you enable BucketCache, you are enabling a two tier caching system, an L1 cache which is implemented by an instance of LruBlockCache and an off-heap L2 cache which is implemented by BucketCache.
+When you enable BucketCache, you are enabling a two tier caching system. We used to describe the
+tiers as "L1" and "L2" but have deprecated this terminology as of hbase-2.0.0. The "L1" cache referred to an
+instance of LruBlockCache and "L2" to an off-heap BucketCache. Instead, when BucketCache is enabled,
+all DATA blocks are kept in the BucketCache tier and meta blocks -- INDEX and BLOOM blocks -- are on-heap in the `LruBlockCache`.
 Management of these two tiers and the policy that dictates how blocks move between them is done by `CombinedBlockCache`.
-It keeps all DATA blocks in the L2 BucketCache and meta blocks -- INDEX and BLOOM blocks -- on-heap in the L1 `LruBlockCache`.
-See <<offheap.blockcache>> for more detail on going off-heap.
 
 [[cache.configurations]]
 ==== General Cache Configurations
 
 Apart from the cache implementation itself, you can set some general configuration options to control how the cache performs.
-See https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html.
+See link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig].
 After setting any of these options, restart or rolling restart your cluster for the configuration to take effect.
 Check logs for errors or unexpected behavior.
 
@@ -729,13 +719,13 @@ The way to calculate how much memory is available in HBase for caching is:
 number of region servers * heap size * hfile.block.cache.size * 0.99
 ----
 
-The default value for the block cache is 0.25 which represents 25% of the available heap.
+The default value for the block cache is 0.4 which represents 40% of the available heap.
 The last value (99%) is the default acceptable loading factor in the LRU cache after which eviction is started.
 The reason it is included in this equation is that it would be unrealistic to say that it is possible to use 100% of the available memory since this would make the process blocking from the point where it loads new blocks.
 Here are some examples:
 
-* One region server with the heap size set to 1 GB and the default block cache size will have 253 MB of block cache available.
-* 20 region servers with the heap size set to 8 GB and a default block cache size will have 39.6 of block cache.
+* One region server with the heap size set to 1 GB and the default block cache size will have 405 MB of block cache available.
+* 20 region servers with the heap size set to 8 GB and a default block cache size will have 63.3 of block cache.
 * 100 region servers with the heap size set to 24 GB and a block cache size of 0.5 will have about 1.16 TB of block cache.
 
 Your data is not the only resident of the block cache.
@@ -789,32 +779,59 @@ Since link:https://issues.apache.org/jira/browse/HBASE-4683[HBASE-4683 Always ca
 [[enable.bucketcache]]
 ===== How to Enable BucketCache
 
-The usual deploy of BucketCache is via a managing class that sets up two caching tiers: an L1 on-heap cache implemented by LruBlockCache and a second L2 cache implemented with BucketCache.
+The usual deploy of BucketCache is via a managing class that sets up two caching tiers:
+an on-heap cache implemented by LruBlockCache and a second  cache implemented with BucketCache.
 The managing class is link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CombinedBlockCache.html[CombinedBlockCache] by default.
 The previous link describes the caching 'policy' implemented by CombinedBlockCache.
-In short, it works by keeping meta blocks -- INDEX and BLOOM in the L1, on-heap LruBlockCache tier -- and DATA blocks are kept in the L2, BucketCache tier.
-It is possible to amend this behavior in HBase since version 1.0 and ask that a column family have both its meta and DATA blocks hosted on-heap in the L1 tier by setting `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)` or in the shell, creating or amending column families setting `CACHE_DATA_IN_L1` to true: e.g.
+In short, it works by keeping meta blocks -- INDEX and BLOOM in the on-heap LruBlockCache tier -- and DATA blocks are kept in the BucketCache tier.
+
+====
+Pre-hbase-2.0.0 versions::
+Fetching will always be slower when fetching from BucketCache in pre-hbase-2.0.0,
+as compared to the native on-heap LruBlockCache. However, latencies tend to be less
+erratic across time, because there is less garbage collection when you use BucketCache since it is managing BlockCache allocations, not the GC.
+If the BucketCache is deployed in off-heap mode, this memory is not managed by the GC at all.
+This is why you'd use BucketCache in pre-2.0.0, so your latencies are less erratic,
+to mitigate GCs and heap fragmentation, and so you can safely use more memory.
+See Nick Dimiduk's link:http://www.n10k.com/blog/blockcache-101/[BlockCache 101] for comparisons running on-heap vs off-heap tests.
+Also see link:https://people.apache.org/~stack/bc/[Comparing BlockCache Deploys] which finds that if your dataset fits inside your LruBlockCache deploy, use it otherwise if you are experiencing cache churn (or you want your cache to exist beyond the vagaries of java GC), use BucketCache.
++
+In pre-2.0.0,
+one can configure the BucketCache so it receives the `victim` of an LruBlockCache eviction.
+All Data and index blocks are cached in L1 first. When eviction happens from L1, the blocks (or `victims`) will get moved to L2.
+Set `cacheDataInL1` via `(HColumnDescriptor.setCacheDataInL1(true)` or in the shell, creating or amending column families setting `CACHE_DATA_IN_L1` to true: e.g.
 [source]
 ----
 hbase(main):003:0> create 't', {NAME => 't', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}
 ----
 
-The BucketCache Block Cache can be deployed on-heap, off-heap, or file based.
+hbase-2.0.0+ versions::
+HBASE-11425 changed the HBase read path so it could hold the read-data off-heap avoiding copying of cached data on to the java heap.
+See <<regionserver.offheap.readpath>>. In hbase-2.0.0, off-heap latencies approach those of on-heap cache latencies with the added
+benefit of NOT provoking GC.
++
+From HBase 2.0.0 onwards, the notions of L1 and L2 have been deprecated. When BucketCache is turned on, the DATA blocks will always go to BucketCache and INDEX/BLOOM blocks go to on heap LRUBlockCache. `cacheDataInL1` support hase been removed.
+====
+
+The BucketCache Block Cache can be deployed _off-heap_, _file_ or _mmaped_ file mode.
+
+
 You set which via the `hbase.bucketcache.ioengine` setting.
-Setting it to `heap` will have BucketCache deployed inside the allocated Java heap.
-Setting it to `offheap` will have BucketCache make its allocations off-heap, and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use a file caching (Useful in particular if you have some fast I/O attached to the box such as SSDs).
+Setting it to `offheap` will have BucketCache make its allocations off-heap, and an ioengine setting of `file:PATH_TO_FILE` will direct BucketCache to use file caching (Useful in particular if you have some fast I/O attached to the box such as SSDs). From 2.0.0, it is possible to have more than one file backing the BucketCache. This is very useful specially when the Cache size requirement is high. For multiple backing files, configure ioengine as `files:PATH_TO_FILE1,PATH_TO_FILE2,PATH_TO_FILE3`. BucketCache can be configured to use an mmapped file also. Configure ioengine as `mmap:PATH_TO_FILE` for this.
 
-It is possible to deploy an L1+L2 setup where we bypass the CombinedBlockCache policy and have BucketCache working as a strict L2 cache to the L1 LruBlockCache.
-For such a setup, set `CacheConfig.BUCKET_CACHE_COMBINED_KEY` to `false`.
+It is possible to deploy a tiered setup where we bypass the CombinedBlockCache policy and have BucketCache working as a strict L2 cache to the L1 LruBlockCache.
+For such a setup, set `hbase.bucketcache.combinedcache.enabled` to `false`.
 In this mode, on eviction from L1, blocks go to L2.
 When a block is cached, it is cached first in L1.
 When we go to look for a cached block, we look first in L1 and if none found, then search L2.
 Let us call this deploy format, _Raw L1+L2_.
+NOTE: This L1+L2 mode is removed from 2.0.0. When BucketCache is used, it will be strictly the DATA cache and the LruBlockCache will cache INDEX/META blocks.
 
 Other BucketCache configs include: specifying a location to persist cache to across restarts, how many threads to use writing the cache, etc.
 See the link:https://hbase.apache.org/devapidocs/org/apache/hadoop/hbase/io/hfile/CacheConfig.html[CacheConfig.html] class for configuration options and descriptions.
 
-
+To check it enabled, look for the log line describing cache setup; it will detail how BucketCache has been deployed.
+Also see the UI. It will detail the cache tiering and their configuration.
 
 ====== BucketCache Example Configuration
 This sample provides a configuration for a 4 GB off-heap BucketCache with a 1 GB on-heap cache.
@@ -876,9 +893,10 @@ The following example configures buckets of size 4096 and 8192.
 [NOTE]
 ====
 The default maximum direct memory varies by JVM.
-Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading, the hosted DFSClient will allocate direct memory buffers.
+Traditionally it is 64M or some relation to allocated heap size (-Xmx) or no limit at all (JDK7 apparently). HBase servers use direct memory, in particular short-circuit reading (See <<perf.hdfs.configs.localread>>), the hosted DFSClient will allocate direct memory buffers. How much the DFSClient uses is not easy to quantify; it is the number of open HFiles * `hbase.dfs.client.read.shortcircuit.buffer.size` where `hbase.dfs.client.read.shortcircuit.buffer.size` is set to 128k in HBase -- see _hbase-default.xml_ default configurations.
 If you do off-heap block caching, you'll be making use of direct memory.
-Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in _conf/hbase-env.sh_ is set to some value that is higher than what you have allocated to your off-heap BlockCache (`hbase.bucketcache.size`). It should be larger than your off-heap block cache and then some for DFSClient usage (How much the DFSClient uses is not easy to quantify; it is the number of open HFiles * `hbase.dfs.client.read.shortcircuit.buffer.size` where `hbase.dfs.client.read.shortcircuit.buffer.size` is set to 128k in HBase -- see _hbase-default.xml_ default configurations). Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx.
+The RPCServer uses a ByteBuffer pool. From 2.0.0, these buffers are off-heap ByteBuffers.
+Starting your JVM, make sure the `-XX:MaxDirectMemorySize` setting in _conf/hbase-env.sh_ considers off-heap BlockCache (`hbase.bucketcache.size`), DFSClient usage, RPC side ByteBufferPool max size. This has to be bit higher than sum of off heap BlockCache size and max ByteBufferPool size. Allocating an extra of 1-2 GB for the max direct memory size has worked in tests. Direct memory, which is part of the Java process heap, is separate from the object heap allocated by -Xmx.
 The value allocated by `MaxDirectMemorySize` must not exceed physical RAM, and is likely to be less than the total available RAM due to other memory requirements and system constraints.
 
 You can see how much memory -- on-heap and off-heap/direct -- a RegionServer is configured to use and how much it is using at any one time by looking at the _Server Metrics: Memory_ tab in the UI.
@@ -898,7 +916,7 @@ If the deploy was using CombinedBlockCache, then the LruBlockCache L1 size was c
 where size-of-bucket-cache itself is EITHER the value of the configuration `hbase.bucketcache.size` IF it was specified as Megabytes OR `hbase.bucketcache.size` * `-XX:MaxDirectMemorySize` if `hbase.bucketcache.size` is between 0 and 1.0.
 
 In 1.0, it should be more straight-forward.
-L1 LruBlockCache size is set as a fraction of java heap using `hfile.block.cache.size setting` (not the best name) and L2 is set as above either in absolute Megabytes or as a fraction of allocated maximum direct memory.
+Onheap LruBlockCache size is set as a fraction of java heap using `hfile.block.cache.size setting` (not the best name) and BucketCache is set as above in absolute Megabytes.
 ====
 
 ==== Compressed BlockCache
@@ -911,6 +929,54 @@ For a RegionServer hosting data that can comfortably fit into cache, or if your
 
 The compressed BlockCache is disabled by default. To enable it, set `hbase.block.data.cachecompressed` to `true` in _hbase-site.xml_ on all RegionServers.
 
+[[regionserver.offheap]]
+=== RegionServer Offheap Read/Write Path
+
+[[regionserver.offheap.readpath]]
+==== Offheap read-path
+In hbase-2.0.0, link:https://issues.apache.org/jira/browse/HBASE-11425[HBASE-11425] changed the HBase read path so it
+could hold the read-data off-heap avoiding copying of cached data on to the java heap.
+This reduces GC pauses given there is less garbage made and so less to clear. The off-heap read path has a performance
+that is similar/better to that of the on-heap LRU cache.  This feature is available since HBase 2.0.0.
+If the BucketCache is in `file` mode, fetching will always be slower compared to the native on-heap LruBlockCache.
+Refer to below blogs for more details and test results on off heaped read path
+link:https://blogs.apache.org/hbase/entry/offheaping_the_read_path_in[Offheaping the Read Path in Apache HBase: Part 1 of 2]
+and link:https://blogs.apache.org/hbase/entry/offheap-read-path-in-production[Offheap Read-Path in Production - The Alibaba story]
+
+For an end-to-end off-heaped read-path, first of all there should be an off-heap backed <<offheap.blockcache>>(BC). Configure 'hbase.bucketcache.ioengine' to off-heap in
+_hbase-site.xml_. Also specify the total capacity of the BC using `hbase.bucketcache.size` config. Please remember to adjust value of 'HBASE_OFFHEAPSIZE' in
+_hbase-env.sh_. This is how we specify the max possible off-heap memory allocation for the
+RegionServer java process. This should be bigger than the off-heap BC size. Please keep in mind that there is no default for `hbase.bucketcache.ioengine`
+which means the BC is turned OFF by default (See <<direct.memory>>). 
+
+Next thing to tune is the ByteBuffer pool on the RPC server side.
+The buffers from this pool will be used to accumulate the cell bytes and create a result cell block to send back to the client side.
+`hbase.ipc.server.reservoir.enabled` can be used to turn this pool ON or OFF. By default this pool is ON and available. HBase will create off heap ByteBuffers
+and pool them. Please make sure not to turn this OFF if you want end-to-end off-heaping in read path.
+If this pool is turned off, the server will create temp buffers on heap to accumulate the cell bytes and make a result cell block. This can impact the GC on a highly read loaded server.
+The user can tune this pool with respect to how many buffers are in the pool and what should be the size of each ByteBuffer. 
+Use the config `hbase.ipc.server.reservoir.initial.buffer.size` to tune each of the buffer sizes. Default is 64 KB. 
+
+When the read pattern is a random row read load and each of the rows are smaller in size compared to this 64 KB, try reducing this.
+When the result size is larger than one ByteBuffer size, the server will try to grab more than one buffer and make a result cell block out of these. When the pool is running out of buffers, the server will end up creating temporary on-heap buffers. 
+
+The maximum number of ByteBuffers in the pool can be tuned using the config 'hbase.ipc.server.reservoir.initial.max'. Its value defaults to 64 * region server handlers configured (See the config 'hbase.regionserver.handler.count'). The math is such that by default we consider 2 MB as the result cell block size per read result and each handler will be handling a read. For 2 MB size, we need 32 buffers each of size 64 KB (See default buffer size in pool). So per handler 32 ByteBuffers(BB). We allocate twice this size as the max BBs count such that one handler can be creating the response and handing it to the RPC Responder thread and then handling a new request creating a new response cell block (using pooled buffers). Even if the responder could not send back the first TCP reply immediately, our count should allow that we should still have enough buffers in our pool without having to make temporary buffers on the heap. Again for smaller sized random row reads, tune this max count. Th
 ere are lazily created buffers and the count is the max count to be pooled. 
+
+If you still see GC issues even after making end-to-end read path off-heap, look for issues in the appropriate buffer pool. Check the below RegionServer log with INFO level:
+[source]
+----
+Pool already reached its max capacity : XXX and no free buffers now. Consider increasing the value for 'hbase.ipc.server.reservoir.initial.max' ?
+----
+
+The setting for _HBASE_OFFHEAPSIZE_ in _hbase-env.sh_ should consider this off heap buffer pool at the RPC side also. We need to config this max off heap size for the RegionServer as a bit higher than the sum of this max pool size and the off heap cache size. The TCP layer will also need to create direct bytebuffers for TCP communication. Also the DFS client will need some off-heap to do its workings especially if short-circuit reads are configured. Allocating an extra of 1 - 2 GB for the max direct memory size has worked in tests. 
+
+If you are using co processors and refer the Cells in the read results, DO NOT store reference to these Cells out of the scope of the CP hook methods. Some times the CPs need store info about the cell (Like its row key) for considering in the next CP hook call etc. For such cases, pls clone the required fields of the entire Cell as per the use cases. [ See CellUtil#cloneXXX(Cell) APIs ]
+
+[[regionserver.offheap.writepath]]
+==== Offheap write-path
+
+TODO
+
 [[regionserver_splitting_implementation]]
 === RegionServer Splitting Implementation