You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Todd Lipcon (Created) (JIRA)" <ji...@apache.org> on 2011/11/01 18:43:32 UTC

[jira] [Created] (HBASE-4717) More efficient age-off of old data during major compaction

More efficient age-off of old data during major compaction
----------------------------------------------------------

                 Key: HBASE-4717
                 URL: https://issues.apache.org/jira/browse/HBASE-4717
             Project: HBase
          Issue Type: Improvement
          Components: regionserver
    Affects Versions: 0.94.0
            Reporter: Todd Lipcon


Many applications need to implement efficient age-off of old data. We currently only perform age-off during major compaction by scanning through all of the KVs. Instead, we could implement the following:
- Set hbase.hstore.compaction.max.size reasonably small. Thus, older store files contain only smaller finite ranges of time.
- Periodically run an "age-off compaction". This compaction would scan the current list of storefiles. Any store file that falls entirely out of the TTL time range would be dropped. Store files completely within the time range would be un-altered. Those crossing the time-range boundary could either be left alone or compacted using the existing compaction code.

I don't have a design in mind for how exactly this would be implemented, but hope to generate some discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4717) More efficient age-off of old data during major compaction

Posted by "Jonathan Gray (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141445#comment-13141445 ] 

Jonathan Gray commented on HBASE-4717:
--------------------------------------

+1 on this general direction.

We've long talked of special compaction heuristics that would bucketize by time in some way (and you could really take advantage of the TimeRangeTracker file selection stuff for read perf).  We did as you describe and set a small max.size, so once a file reached a certain size, it would never be compacted again.  This allowed us to "age out" the data by keeping old stuff separate from new stuff in files.

We were not trying to actually wipe out the data, only separate it, because this was mostly a read-modify-write workload that needed access to recent data but the old data still needed to be available for user read queries.  It would probably be simple to add a check during compaction time of the time range of each file and if the max is expired, just to wipe out that file.
                
> More efficient age-off of old data during major compaction
> ----------------------------------------------------------
>
>                 Key: HBASE-4717
>                 URL: https://issues.apache.org/jira/browse/HBASE-4717
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.94.0
>            Reporter: Todd Lipcon
>
> Many applications need to implement efficient age-off of old data. We currently only perform age-off during major compaction by scanning through all of the KVs. Instead, we could implement the following:
> - Set hbase.hstore.compaction.max.size reasonably small. Thus, older store files contain only smaller finite ranges of time.
> - Periodically run an "age-off compaction". This compaction would scan the current list of storefiles. Any store file that falls entirely out of the TTL time range would be dropped. Store files completely within the time range would be un-altered. Those crossing the time-range boundary could either be left alone or compacted using the existing compaction code.
> I don't have a design in mind for how exactly this would be implemented, but hope to generate some discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4717) More efficient age-off of old data during major compaction

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141851#comment-13141851 ] 

Todd Lipcon commented on HBASE-4717:
------------------------------------

nice idea - that's probably useful outside of this use case, too. Another idea is maintaining time range histograms for each storefile to estimate whether it's worth doing a "filtration".
                
> More efficient age-off of old data during major compaction
> ----------------------------------------------------------
>
>                 Key: HBASE-4717
>                 URL: https://issues.apache.org/jira/browse/HBASE-4717
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.94.0
>            Reporter: Todd Lipcon
>
> Many applications need to implement efficient age-off of old data. We currently only perform age-off during major compaction by scanning through all of the KVs. Instead, we could implement the following:
> - Set hbase.hstore.compaction.max.size reasonably small. Thus, older store files contain only smaller finite ranges of time.
> - Periodically run an "age-off compaction". This compaction would scan the current list of storefiles. Any store file that falls entirely out of the TTL time range would be dropped. Store files completely within the time range would be un-altered. Those crossing the time-range boundary could either be left alone or compacted using the existing compaction code.
> I don't have a design in mind for how exactly this would be implemented, but hope to generate some discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4717) More efficient age-off of old data during major compaction

Posted by "Todd Lipcon (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141681#comment-13141681 ] 

Todd Lipcon commented on HBASE-4717:
------------------------------------

bq. It would probably be simple to add a check during compaction time of the time range of each file and if the max is expired, just to wipe out that file.

That's one optimization, but only saves on the read of the now-expired file. We still have to read/rewrite all of the rest of the data periodically to do the age-off.

The new idea above is to introduce something more like a "filtration" than a "compaction" -- you would only rewrite files that have a significant amount of data to be aged.
                
> More efficient age-off of old data during major compaction
> ----------------------------------------------------------
>
>                 Key: HBASE-4717
>                 URL: https://issues.apache.org/jira/browse/HBASE-4717
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.94.0
>            Reporter: Todd Lipcon
>
> Many applications need to implement efficient age-off of old data. We currently only perform age-off during major compaction by scanning through all of the KVs. Instead, we could implement the following:
> - Set hbase.hstore.compaction.max.size reasonably small. Thus, older store files contain only smaller finite ranges of time.
> - Periodically run an "age-off compaction". This compaction would scan the current list of storefiles. Any store file that falls entirely out of the TTL time range would be dropped. Store files completely within the time range would be un-altered. Those crossing the time-range boundary could either be left alone or compacted using the existing compaction code.
> I don't have a design in mind for how exactly this would be implemented, but hope to generate some discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HBASE-4717) More efficient age-off of old data during major compaction

Posted by "dhruba borthakur (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-4717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141801#comment-13141801 ] 

dhruba borthakur commented on HBASE-4717:
-----------------------------------------

Is it possible to looks at blooms (that are mostly in block cache) for two Hfiles, estimate how much overlap is there between kvs and then decide whether to compact/merge those two files?
                
> More efficient age-off of old data during major compaction
> ----------------------------------------------------------
>
>                 Key: HBASE-4717
>                 URL: https://issues.apache.org/jira/browse/HBASE-4717
>             Project: HBase
>          Issue Type: Improvement
>          Components: regionserver
>    Affects Versions: 0.94.0
>            Reporter: Todd Lipcon
>
> Many applications need to implement efficient age-off of old data. We currently only perform age-off during major compaction by scanning through all of the KVs. Instead, we could implement the following:
> - Set hbase.hstore.compaction.max.size reasonably small. Thus, older store files contain only smaller finite ranges of time.
> - Periodically run an "age-off compaction". This compaction would scan the current list of storefiles. Any store file that falls entirely out of the TTL time range would be dropped. Store files completely within the time range would be un-altered. Those crossing the time-range boundary could either be left alone or compacted using the existing compaction code.
> I don't have a design in mind for how exactly this would be implemented, but hope to generate some discussion.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira