You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2011/07/22 18:43:57 UTC

[jira] [Updated] (HBASE-4125) WALInputFormat

     [ https://issues.apache.org/jira/browse/HBASE-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-4125:
-------------------------

    Description: 
A coworker suggested doing an hbase backup that was based on WAL logs.  The details still need to be worked out but heres' a couple of notes:

+ Backup would not require our running some additional process with attendant cpu burn and i/o loading over cluster that is to be backed-up.  The WALs have already been written.
+ WALs currently are not compressed.  We could keep them compressed in backup store (this inputformat should take compressed and non-compressed WALs).
+ Hard part is figuring some global sequenceid, or set of sequenceids, from which to start replaying edits.  I'd imagine that you'd want to replay a backup from some particular point (It could be a 'date' for first version but this is a little sloppy especially around counters).  
+ MapReduce jobs replaying WALs could be scoped to a table or even to a region (though, we'd be looking at lots of edits if replaying all WALs from a cluster -- perhaps we need to dump some metadata when we close WALs; e.g. the regions that have edits in a particular WAL)

This input format is needed whether we do backup or not for replay of logs that may have been moved aside in an emergency getting a cluster off the ground again.

We should have a script that can use this input format to replay single digit numbers of WALs w/o resort to mapreduce too.

> WALInputFormat
> --------------
>
>                 Key: HBASE-4125
>                 URL: https://issues.apache.org/jira/browse/HBASE-4125
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: stack
>
> A coworker suggested doing an hbase backup that was based on WAL logs.  The details still need to be worked out but heres' a couple of notes:
> + Backup would not require our running some additional process with attendant cpu burn and i/o loading over cluster that is to be backed-up.  The WALs have already been written.
> + WALs currently are not compressed.  We could keep them compressed in backup store (this inputformat should take compressed and non-compressed WALs).
> + Hard part is figuring some global sequenceid, or set of sequenceids, from which to start replaying edits.  I'd imagine that you'd want to replay a backup from some particular point (It could be a 'date' for first version but this is a little sloppy especially around counters).  
> + MapReduce jobs replaying WALs could be scoped to a table or even to a region (though, we'd be looking at lots of edits if replaying all WALs from a cluster -- perhaps we need to dump some metadata when we close WALs; e.g. the regions that have edits in a particular WAL)
> This input format is needed whether we do backup or not for replay of logs that may have been moved aside in an emergency getting a cluster off the ground again.
> We should have a script that can use this input format to replay single digit numbers of WALs w/o resort to mapreduce too.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira