You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Jesse Yates (JIRA)" <ji...@apache.org> on 2012/06/07 01:01:23 UTC
[jira] [Commented] (HBASE-6180) [brainstorm] Timestamp based snapshots in HBase 0.96

    [ https://issues.apache.org/jira/browse/HBASE-6180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13290554#comment-13290554 ] 

Jesse Yates commented on HBASE-6180:
------------------------------------

Here is what I've been thinking about for doing timestamp based snapshotting, as an extension to the work I've been doing for HBASE-6055.

Timestamp based snapshots are a zero-downtime/non-blocking versions of taking a snapshot across a table in HBase. They should be considered 'fuzzy' because you don't get a global view, but only as close to globally consistent as we can get with timestamps on the region servers (fuzziness is in the NTP different between RS, which defaults to max skew of 60 sec). I'm going to mingle a bit of theory and implementation here, but feel free to ask questions for things that don't make sense.

All the infrastructure from point-in-time snapshots (HBASE-50) is still going to be used here: SnapshotManager on the Master, the RegionSnapshotHandler on the RS, etc. The only change is what actually happens on each of the regions when taking the snapshot and how the snapshot is managed on the Master. Also, on a lower level, the time constraints are much looser on taking the snapshot.

Lets walk throughout some of the changes to the actual implementation. 

>From a high-level, we still tell all the RS to start the snapshot. They will then dump a meta edit into the WAL with the memstore timestamp (not clear if this is necessary, but could be useful for completing snapshots on failed RS). They will then post back to ZK that they are starting the snapshot. Each RS can then go about their business adding references to all the files on the FS for the Regions involved in the snapshot. It gets a little tricky when we try to capture the in-memory state of each RS.

The key here though is that we can use the Memstore's built-in snapshot functionality to avoid doing any work with the WALs and just keep track of HFiles. When flushing the Memstore takes a "snapshot" by just blocking for a moment to switch two pointers between the current and the new memstore. All writes before the switch go into the old memstore. All new writes go into the new memstore. The old memstore can then asynchronously be flushed to disk and on scan we just merge in the results from the old version as well as the on disk files. The benefit of this is that we basically take no down time to flush the memstore (except for corner cases where there are too many HFiles on disk already, but we can ignore that as it part of the overall HBase design). 

We can leverage the same mechanism but instead just make the swap time-based.

When the RS gets the update to take the snapshot, it also has a timespan through which it should split writes between the memstores. For example, say we get a snapshot start notice at 10:15:00 and a prepare phase length of 80 seconds (the max skew in the cluster +20 seconds for safety - just an example). 

For those 80 seconds, each HRegion will then time-flush the memstore. We take a regular memstore snapshot. Just like a regular flush, this ensures that all the outstanding writes to the memstore get committed (waiting on the read point to roll forward). However, instead of immediately writing to the new store, we split writes based on timestamp between the old and the new memstore. This management is handled by the Store, which just does some simple checking on the edits coming through to see which memstore it should direct the writes (admittedly, hand-waving away some of the complexity here).

Conceptually, this is like taking  snapshot, but instead of just having the snapshot be the immutable state (less the rollbacks made), we can just pass that KV set into a new MemStore that acts just like a regular memstore. Since all the high-level edits still go through the mvcc, we can keep track of the ordering in writes and the rollback mechanism on the original MemStore actually keeps its own state and the state of the snapshot-based MemStore in the correct state.

At this point, we can update the master (via ZK) that we have joined the snapshot. This is not strictly necessary, but is nice since we can then track progress of a snapshot. For instance, if a RS hasn't responded in within a certain window, we can immediately fail the snapshot and assume the RS has become inoperable. Since we are using the internal flushing mechanisms to remain mostly non-blocking, we can actually skip doing this update and just notify the master when we have done the write.

An alternative implementation here is to do what Jon has suggested and do a set a meta writes for the beginning and end of a snapshot. Then all you have to do is keep track of the WALs for the snapshot and replay those at the right time. However, that adds some complexity into how to restore a snapshot and may require rolling the WAL after the snapshot has been taken - a worrisome amount of complexity for something that should be entirely immutable. Since the flush can be done async and we don't block writes while waiting, it doesn't seem like a major issue to wait a little longer to complete a snapshot.

Back to the dual-pointer memstore snapshot implementation, once we pass the 80 seconds, we then flush the old store to disk, add a reference to the new HFile, and then just direct all writes to the new store. Conceptually, this all seems to hang together, but the implementation is probably going to take a little more work.

There is a slight overhead to writes during the snapshot window. We will need to check the timestamp of every write going into the memstore, to figure out the store it needs to be written into. However, that is just a simple timestamp comparison and shouldn't be overly burdensome to the write throughput (especially if you can take a snapshot during a low-write period).

After this snapshot window, the state of the memstore will have been snapshotted and a flush will have been started. Now we can just flush this old memstore to disk as another HFile and add a reference to it for the snapshot. Its completely fine if this process takes a while because the server precedes happily, taking reads and writes like nothing is amiss because the semantics are the same as a regular flush. Once the file hits disk (and we have added references for each of the other files) we can consider the snapshot completed on that HRegion. Once that process completes for all involved HRegions on the HRegionServer we can consider the snapshot having completed the snapshot. 

Note that since the in memory state is all written to disk, we don't actually need to keep track of any of the HLogs. There is probably some re-jiggering here around failed Puts and the optimized write path there, but that is an implementation detail.

Once all the HRegionServers have taken the snapshot (passing up the notification by joining the barrier), the Master considers the snapshot completed and can move the snapshot from the .tmp to the .snapshot directory. The complete barrier is then just a barrier for the master, rather than for the region servers since there is not coordination necessary except to determine if a snapshot failed because a RS couldn't complete (which only the master needs to keep track of, to determine if a snapshot is valid or not).

There are some gotcha's with snapshotting with timestamps.

Suppose you are putting writes into the future. On a regular table doing a timestamp based Scan will still not find those futures writes; the same will be true of the snapshotted table - those writes will be directed to the new store and not found in the snapshot. 

The only weirdness that occurs with this form of snapshots is with future/past writes - essentially any time you start messing with the timestamps. Let's look an example. At 10:15:00 you take a snapshot of a table. However, on the same table, you make a Put  - 'row', 'cf', 10:20:00, 'value' - at 10:10:00, a put in the future but made _before_ you take a snapshot. The snapshot then precedes as expected. At some point later, you revive the snapshot and do a scan of the table with a timestamp of 10:15:00.; you won't find that earlier put ('row', 'cf', 10:20:00, 'value'). However, if you just do a scan for the latest version, you *will* find that put!

It gets even odder if instead of making that future put before the snapshot was taken, but instead made it _while_ the snapshot was being taken. In this case, the revived snapshot will give you different semantics. The scan of the snapshot at 10:15:00 will still give you the same answer as before, but the latest version scan _will not find_ the future Put ('row', 'cf', 10:20:00, 'value'). 

Unfortunately, these are the semantics of using timestamps over global consistency. I (and many others) feel that if you are messing with timestamps then its buyer beware. 

That said, there is way to get global consistency if you do mess with timestamps. If you have some centralized timestamp oracle, then this can give out strictly increasing timestamps with a lease for the timestamps.  (I've got a long flight next week where I'm hoping to pump out a basic implementation of this for hbase - no ticket, but just a little something on github). Since you know that the timestamps will expire after a given period, you just set the expiration time + fudge  as the timespan to split the memstore writes. After the expiration period you know that a timestamp is the oldest timestamp, so you can then comfortably flush the old memstore to disk, knowing that you have all the edits from that timestamp back in time. Note that you don't have the same problem as above since you only do scans in terms of the timestamps from the oracle, so future and past are really globally relative - there is no real puts too far into the future or past that are visible because all scans need to be as of a timestamp.
                
> [brainstorm] Timestamp based snapshots in HBase 0.96
> ----------------------------------------------------
>
>                 Key: HBASE-6180
>                 URL: https://issues.apache.org/jira/browse/HBASE-6180
>             Project: HBase
>          Issue Type: Brainstorming
>            Reporter: Jesse Yates
>             Fix For: 0.96.0
>
>
> Discussion ticket around doing timestamp based snapshots in HBase as an extension/follow-on work for HBASE-6055. The implementation in HBASE-6055 (as originally defined) is not sufficient for real-time clusters because it requires downtime to take the snapshot. 
> Time-stamp based snapshots should not require downtime at the cost of achieving global consistency. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira