You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Vladimir Rodionov (JIRA)" <ji...@apache.org> on 2015/10/01 23:28:26 UTC

[jira] [Updated] (HBASE-14477) Compaction improvements: Date tiered compaction policy

     [ https://issues.apache.org/jira/browse/HBASE-14477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vladimir Rodionov updated HBASE-14477:
--------------------------------------
    Summary: Compaction improvements: Date tiered compaction policy  (was: Compaction improvements: Generational compaction policy)

> Compaction improvements: Date tiered compaction policy
> ------------------------------------------------------
>
>                 Key: HBASE-14477
>                 URL: https://issues.apache.org/jira/browse/HBASE-14477
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Vladimir Rodionov
>            Assignee: Vladimir Rodionov
>             Fix For: 2.0.0
>
>
> For immutable and mostly immutable data the current SizeTiered-based compaction policy is not efficient. 
> # There is no need to compact all files into one, because, data is (mostly) immutable and we do not need to collect garbage. (performance reason will be discussed later)
> # Size-tiered compaction is not suitable for applications where most recent data is most important and prevents efficient caching of this data. 
> The idea of generational compaction policy is pretty similar to DateTieredCompaction in Cassandra:
> # Memstore flushes creates files of Gen0.
> # Only store files of the same generation can be compacted. 
> # Once number of files in GenK reaches N (default, 5) they get compacted and one file of Gen(K+1) is created.
> # Compaction stops at predefined generation M (default, 3).
> Simple math. For the sake of simplicity, let us say that flush size is 30MB.
> Gen0: 4*30 = 120MB 
> Gen1: 4*120 = 480MB
> Gen2: 4*480MB = 1.92GB
> Gen3: R * 1.92GB (Gen3 by default is not compacted)
> With 3-4 files in Gen3 we get total Region size 10-12GB, 10-20% (Gen0, Gen1 and most of Gen2) can be kept in a block cache.
> Generational compaction does not limit region size, one can use 100GB or even more because total compaction IO per region can be limited and, generally speaking, does not depend on region size explicitly (as in Size Tiered compaction policy)
> Now, about performance implications:
> SSD-based servers will benefit this policy because they provide more than adequate random IO ... but even HDD-based system can use this policy. Again, simple math: with region size ~ 10GB we will have ~ 16 files, of which, 10-12 can be cached in a block cache. Even if request touches all the files (spans the all time range) it will need to access to only 4-6 files. How to keep always recent data in a block cache is totally separate topic (JIRA). 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)