You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Marc Limotte (JIRA)" <ji...@apache.org> on 2011/03/10 20:28:59 UTC

[jira] Commented: (HBASE-3551) Loaded hfile indexes occupy a good chunk of heap; look into shrinking the amount used and/or evicting unused indices

    [ https://issues.apache.org/jira/browse/HBASE-3551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005272#comment-13005272 ] 

Marc Limotte commented on HBASE-3551:
-------------------------------------

I understand this better now.  I did some poking around with the HFile tool.  Average key length does seem to be around 150 bytes, as I estimated.
 
For one hfile /hbase/foo/fb820ae7002fc96f78165802a0b05e63/metrics/14129209576094096, metadata is:

avgKeyLen=159, avgValueLen=7, entries=49285512, length=615516343
fileinfoOffset=592314718, dataIndexOffset=592315104, dataIndexCount=131869, metaIndexOffset=0, metaIndexCount=0, totalBytes=8653853680, entryCount=49285512, version=1

Size of index = length - dataIndexOffset = 615516343 - 592315104 = 22mb

Index data per Region Server = 22mb * 180 regions = almost 4gb.  Plus the other column family, so this does seem to add up to the 5 to 6gb of HEAP we are seeing.

# of entries per dataindex entry = 49285512 / 131869 = 374
Times the key size (avg 157 bytes for this file) = 59k (close to the block size of 64k).  So, seems to make sense.

I also looked at the keyvalue pairs using the HFile tool (a section of output is below).

We have a few billion rows (2 - 4 billion).  I haven't done a full row count.

What I didn't understand previously is that it's not 374 rows, but 374 "entries".  An entry means a single column entry and the key is repeated for each column value.  Given our fairly large key, that would add up quickly.

Solutions
1) Increase the hbase block size (I did this and it resolved our situation for now)  
2) Modifying our schema to use smaller keys - perhaps IDs instead of string names.
3) Modifying our schema to have fewer columns - we could combine several related columns into one compound value.
4) An LRU cache for storefile indexes

Given the other options, #4 may not be warranted, so I think we can close this issue.


> Loaded hfile indexes occupy a good chunk of heap; look into shrinking the amount used and/or evicting unused indices
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-3551
>                 URL: https://issues.apache.org/jira/browse/HBASE-3551
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: stack
>
> I hung with a user Marc and we were looking over configs and his cluster profile up on ec2.  One thing we noticed was that his 100+ 1G regions of two families had ~2.5G of heap resident.  We did a bit of math and couldn't get to 2.5G so that needs looking into.  Even still, 2.5G is a bunch of heap to give over to indices (He actually OOME'd when he had his RS heap set to just 3G; we shouldn't OOME, we should just run slower).  It sounds like he needs the indices loaded but still, for some cases we should drop indices for unaccessed files.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira