You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Adrien Mogenet (JIRA)" <ji...@apache.org> on 2013/08/18 14:50:47 UTC
[jira] [Commented] (HBASE-9260) Timestamp Compactions
[ https://issues.apache.org/jira/browse/HBASE-9260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743194#comment-13743194 ]
Adrien Mogenet commented on HBASE-9260:
---------------------------------------
As I wrote in the ticket, it's currently just a draft to get comments or advices :)
I began writing some codes and doing some tests to see how it's relevant (or not).
> Timestamp Compactions
> ---------------------
>
> Key: HBASE-9260
> URL: https://issues.apache.org/jira/browse/HBASE-9260
> Project: HBase
> Issue Type: New Feature
> Components: Compaction
> Affects Versions: 0.94.10
> Reporter: Adrien Mogenet
> Priority: Minor
> Labels: features, performance
>
> h1.TSCompactions
> h2.The issue
> One of the biggest issue I currently deal with is compacting big
> stores, i.e. when HBase cluster is 80% full on 4 TB nodes (let say
> with a single big table), compactions might take several hours (from
> 15 to 20 in my case).
> In 'time series' workloads, we could avoid compacting everything
> everytime. Think about OpenTSDB-like systems, or write-heavy,
> TTL based workloads where you want to free space everyday, deleting
> oldest data, and you're not concerned about read latency (i.e. read
> into a single bigger StoreFile).
> > Note: in this draft, I currently consider that we get free space from
> > the TTL behavior only, not really from the Delete operations.
> h2.Proposal and benefits
> For such cases, StoreFiles could be organized and managed in a way
> that would compact:
> * recent StoreFiles with recent data
> * oldest StoreFiles that are concerned by TTL eviction
> By the way, it would help when scanning with a timestamp criterion.
> h2.Configuration
> * {{hbase.hstore.compaction.sortByTS}} (boolean, default=false)
> This indicates if new behavior is enabled or not. Set it to
> {{false}} and compactions will remain the same than current ones.
> * {{hbase.hstore.compaction.ts.bucketSize}} (integer)
> If `sortByTS` is enabled, tells to HBase the target size of
> buckets. The lower, the more StoreFiles you'll get, but you should
> save more IO's. Higher values will generate less StoreFiles, but
> theses will be bigger and thus compactions could generate more
> IO's.
> h2.Examples
> Here is how a common store could look like after some flushes and
> perhaps some minor compactions:
> {noformat}
> ,---, ,---, ,---,
> | | | | ,---, | |
> | | | | | | | |
> `---' `---' `---' `---'
> SF1 SF2 SF3 SF4
> \__________ __________/
> V
> for all of these Storefiles,
> let say minimum TS is 01/01/2013
> and maximum TS is 31/03/2013
> {noformat}
> Set the bucket size to 1 month, and that's what we have after
> compaction:
> {noformat}
> ,---, ,---,
> | | | |
> ,---, | | | |
> | | | | | |
> `---' `---' `---'
> SF1 SF2 SF3
> ,-----------------------------,
> | minimum TS | maximum TS |
> ,-----------------------------------'
> | SF1 | 03/03/2013 | 31/03/2013 | most recent, growing
> | SF2 | 31/01/2013 | 02/03/2013 | old data, "sealed"
> | SF3 | 01/01/2013 | 30/01/2013 | oldest data, "sealed"
> '-----------------------------------'
> {noformat}
> h2.StoreFile selection
> * for minor compactions, current algorithm should already do the
> right job. Pick up `n` eldest files that are small enough, and
> write a bigger file. Remember, TSCompaction are designed for time
> series, so this 'minor selection' should leave "sealed" big old
> files as they are.
> * for major compactions, when all the StoreFiles have been selected,
> apply the TTL first. StoreFiles that are entirely out of time just
> don't need to be rewritten. They'll be deleted in one time,
> avoiding lots of IO's.
> h2.New issues and trade-offs
> 1. In that case ({{bucketSize=1 month}}), after 1+ year, we'll have lots
> of StoreFiles (and more generally after `n * bucketSize` seconds) if
> there is no TTL eviction. In any case, a clever threshold should be
> implemented to limit the maximum number of StoreFiles.
> 2. If we later add old data that matches timerange of a StoreFile
> which has already been compacted, this could generate lots of IO's
> to reconstruct a single StoreFile for this time bucket, perhaps just
> to merge a few lines.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira