You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Saad Mufti (JIRA)" <ji...@apache.org> on 2018/03/11 17:52:00 UTC

[jira] [Comment Edited] (HBASE-20045) When running compaction, cache recent blocks.

    [ https://issues.apache.org/jira/browse/HBASE-20045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394587#comment-16394587 ] 

Saad Mufti edited comment on HBASE-20045 at 3/11/18 5:51 PM:
-------------------------------------------------------------

I hope I'm not interrupting the whole discussion, but I would like to describe a current use case for which this would be incredibly useful in my current line of work. We are using HBase on AWS EMR where the actual storage is all on S3 using AWS's proprietary EMRFS filesystem. We have configured our bucket cache to be more than large enough to store all our data and more, but the S3 buys us failure recoverability and AWS EMR does not support more than a single master yet, so loss of the master means loss of the entire cluster and having our data in S3 lets us survive cluster failure cleanly. 

We have a heavy read and write load, and performance is more than good enough when everything is coming from the bucket cache (we have set the prefetch on open flag to true in the relevant column families' schema, except for one where we do heavy write but never read from it in the HBase cluster).

Now we come to compaction, both minor and major compaction. We have tuned minor compaction min and max filesize very slow to avoid as much minor compaction as practical, and run a homegrown tool that does major compaction across the whole cluster in batches, with each batch being one region per region server. In past HBase clusters where storage was in HDFS, this served our needs well and we'd like to keep this tool for its operational flexibility and not having to take any downtime to compact. The problem of course is that compaction evicts from the bucket cache blocks from the files being compacted away. In the face of ongoing traffic that does a lot of checkAndPut, this causes the reads to go to S3 and be slow, causing the write lock for checkAndPut to be held for a long time, causing timeouts in other operations trying to get the same row lock. Also our overall performance suffers while the batch is going on due to requests building up in IPC queues. Response time means and other percentiles look like a sawtooth pattern. In our case each batch of compactions lasts roughly 10 minutes or so and the response time sawtooth pattern has the same periodicity.

For now we have worked around this by a) using the new setting in HBase 1.4.0 in our client to avoid one slow region server from blocking all client ops to other region servers, b) accepting some timeouts as the cost of business and requeuing them in a special Kafka retry topic for our upstream system to reprocess. This also limits traffic on the slow region server, letting it clean out its backed up IPC queues instead of being hammered with traffic that doesn't let it recover. Also we run the tool once a day and it finishes in 8-10 hours, so performance is great the rest of the day.

But if we got a setting that would let us tell HBase to always cache even newly compacted files, our performance hit would totally go away. I see the argument above about not bothering to cache a block if all its cells are weeks old. In our case, the data is advertising identifiers and can come in unpredictably, and like I said we have a big enough bucket cache anyway, so why not just cache everything? The old blocks from the compacted away files are going to be evicted anyway, so we should never run out of bucket cache if we have sized it much larger than our entire data size.

Also of course the default behavior would continue, any new configuration around this can come with heavy warnings of the potential consequences and only being used by advanced users etc.

Again I apologize for this long description if it distracts from where the current state of the discussion was. Even a setting to only cache newly compacted blocks if they had "new" cells would still be hugely benficial to us.

Cheers.

 


was (Author: saadmufti):
I hope I'm not interrupting the whole discussion, but I would like to describe a current use case for which this would be incredibly useful in my current line of work. We are using HBase on AWS EMR where the actual storage is all on S3 using AWS's proprietary EMRFS filesystem. We have configured our bucket cache to be more than large enough to store all our data and more, but the S3 buys us failure recoverability and AWS EMR does not support more than a single master yet, so loss of the master means loss of the entire cluster and having our data in S3 lets us survive cluster failure cleanly. 

We have a heavy read and write load, and performance is more than good enough when everything is coming from the bucket cache (we have set the prefetch on open flag to true in the relevant column families' schema, except for one where we do heavy write but never read from it in the HBase cluster).

Now we come to compaction, both minor and major compaction. We have tuned minor compaction min and max filesize very slow to avoid as much minor compaction as practical, and run a homegrown tool that does major compaction across the whole cluster in batches, with each batch being one region per region server. In past HBase clusters where storage was in HDFS, this served our needs well and we'd like to keep this tool for its operational flexibility and not having to take any downtime to compact. The problem of course is that compaction evicts from the bucket cache blocks from the files being compacted away. In the face of ongoing traffic that does a lot of checkAndPut, this causes the reads to go to S3 and be slow, causing the write lock for checkAndPut to be held for a long time, causing timeouts in other operations trying to get the same row lock. Also our overall performance suffers while the batch is going on due to requests building up in IPC queues. Response time means and other percentiles look like a sawtooth pattern. In our case each batch of compactions lasts roughly 10 minutes or so and the response time sawtooth pattern has the same periodicity.

For now we have worked around this by a) using the new setting in HBase 1.4.0 in our client to avoid one slow region server from blocking all client ops to other region servers, b) accepting some timeouts as the cost of business and requeuing them in a special Kafka retry topic for our upstream system to reprocess. This also limits traffic on the slow region server, letting it clean out its backed up IPC queues instead of being hammered with traffic that doesn't let it recover. Also we run the tool once a day and it finishes in 8-10 hours, so performance is great the rest of the day.

But if we got a setting that would let us tell HBase to always cache even newly compacted files, our performance hit would totally go away. I see the argument above about not bothering to cache a block if all its cells are weeks old. In our case, the data is advertising identifiers and can come in unpredictably, and like I said we have a big enough bucket cache anyway, so why not just cache everything? The old blocks from the compacted away files are going to be evicted anyway, so we should never run out of bucket cache if we have sized it much larger than our entire data size.

Again I apologize for this long description if it distracts from where the current state of the discussion was. Even a setting to only cache newly compacted blocks if they had "new" cells would still be hugely benficial to us.

Cheers.

 

> When running compaction, cache recent blocks.
> ---------------------------------------------
>
>                 Key: HBASE-20045
>                 URL: https://issues.apache.org/jira/browse/HBASE-20045
>             Project: HBase
>          Issue Type: New Feature
>          Components: BlockCache, Compaction
>    Affects Versions: 2.0.0-beta-1
>            Reporter: Jean-Marc Spaggiari
>            Priority: Major
>
> HBase already allows to cache blocks on flush. This is very useful for usecases where most queries are against recent data. However, as soon as their is a compaction, those blocks are evicted. It will be interesting to have a table level parameter to say "When compacting, cache blocks less than 24 hours old". That way, when running compaction, all blocks where some data are less than 24h hold, will be automatically cached. 
>  
> Very useful for table design where there is TS in the key but a long history (Like a year of sensor data).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)