You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Allan Yang (JIRA)" <ji...@apache.org> on 2018/11/05 11:45:00 UTC
[jira] [Commented] (HBASE-21436) Getting OOM frequently if hold many regions

    [ https://issues.apache.org/jira/browse/HBASE-21436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16675011#comment-16675011 ] 

Allan Yang commented on HBASE-21436:
------------------------------------

[~gzh1992n] is a colleague of mine, we have found some of our RS instance often abort itself because of OOM, after a careful examination, we found out that the design of Chunk in MSLAB may not friendly to small memory instances. The default size of Chunk is 2MB. That means if any region takes a write, memstore will at least takes 2MB. If too many regions in one RS, it may quickly reaches the limit of heap size. And, when reclaim memory, we only count the size of memstore's data occupied. So no of the regions will be flushed until we reach OOM.
I think we need to consider the check of isAboveHighWaterMark(), we should take the actual size those chunks into count. Several GB memory overhead may be not a big deal for servers with big memory. But for small memory machines, this hurts.

>  Getting OOM frequently if hold many regions
> --------------------------------------------
>
>                 Key: HBASE-21436
>                 URL: https://issues.apache.org/jira/browse/HBASE-21436
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 3.0.0, 1.4.8, 2.0.2
>            Reporter: Zephyr Guo
>            Priority: Major
>         Attachments: HBASE-21436-UT.patch
>
>
> Recently, some feedback reached me from a customer which complains about NotServingRegionException thrown out at intevals. I examined his cluster and found there were quite a lot of OOM logs there but throughtput is in quite low level. In this customer's case, each RS has 3k regions and heap size of 4G. I dumped heap when OOM took place, and found that a lot of Chunk objects (counts as much as 1700) was there.
> Eventually, piecing all these evidences together, I came to the conclusion that:
>  * The root cause is that global flush is triggered by size of all memstores, rather than size of all chunks.
>  * A chunk is always allocated for each region, even we only write a few data to the region.
> And in this case, a total of 3.4G memory was consumed by 1700 chunks, although throughput is very low.
>  Although 3K regions is too much for RS with 4G memory, it is still wise to improve RS stability in such scenario (In fact, most customers buy a small size HBase on cloud side).
>   
>  I provide a patch (only contain UT) to reproduce this case (just send a batch).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)