You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Enis Soztutar (JIRA)" <ji...@apache.org> on 2013/06/07 22:26:20 UTC
[jira] [Commented] (HBASE-8701) distributedLogReplay need to apply wal edits in the receiving order of those edits

    [ https://issues.apache.org/jira/browse/HBASE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13678395#comment-13678395 ] 

Enis Soztutar commented on HBASE-8701:
--------------------------------------

I've put the following comment on the parent jira HBASE-7006. I guess the discussions should continue here, so I am repeating my comments: 
Here is a proposed scheme that can solve this problem:
The region will be opened for replaying as in previous. The normal writes go to the memstore, and the memstore is flushed as usual. The region servers who are reading the WAL and sending to the replaying RS will still be the same, except for the fact that the edits are sent with their seq_ids.
On the replaying RS, for all regions that are in replaying state, there is a single buffer. All edits are appended to this buffer without any sorting. This buffer can be accounted as a memstore, and it will have the memstore flush size as max size. Once this is reached, or due to global memstore pressure, we are asked to flush, we do spill this to disk after sorting. This buffer keeps <kv,seq> pairs, and sorts according to <kv,seq>. If there is not memory pressure, and buffer does not fill up, we don't need to spill to disk.
Once the replaying is finished, and master asks the region server to open the region for reading, then we do a final merge sort for the in-memory sorted buffer, and all on-disk spilled buffers and create an hfile, discarding kv's that have the same kv, but smaller seq_id. This file will be a single hfile that corresponds to a flush. This hfile will have a seq_id that is obtained from the wal edits. Then we add this hfile to the store, and open the region as usual. This kind of keeping an unsorted buffer, and sorting it with qsort with spills and final on-disk merge sort might even be faster, since otherwise, we would be doing an insertion to the memstore, which becomes an insertion sort.
The other thing we need to change is that replayed edits will not go into the wal again, so we keep track of recovering state for the region server, and re-do the work if there is a subsequent failure.
In sort, this will be close to the BigTable's in-memory sort for each WAL file approach, but instead we gather the edits for the region from all WAL files by doing the replay RPC, and do the sort per region. End result, we create a flushed hfile, as if the region just flushed before the crash.
                
> distributedLogReplay need to apply wal edits in the receiving order of those edits
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-8701
>                 URL: https://issues.apache.org/jira/browse/HBASE-8701
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>
> This issue happens in distributedLogReplay mode when recovering multiple puts of the same key + version(timestamp). After replay, the value is nondeterministic of the key
> h5. The original concern situation raised from [~eclark]:
> For all edits the rowkey is the same.
> There's a log with: [ A (ts = 0), B (ts = 0) ]
> Replay the first half of the log.
> A user puts in C (ts = 0)
> Memstore has to flush
> A new Hfile will be created with [ C, A ] and MaxSequenceId = C's seqid.
> Replay the rest of the Log.
> Flush
> The issue will happen in similar situation like Put(key, t=T) in WAL1 and Put(key,t=T) in WAL2
> h5. Below is the option I'd like to use:
> a) During replay, we pass wal file name hash in each replay batch and original wal sequence id of each edit to the receiving RS
> b) Once a wal is recovered, playing RS send a signal to the receiving RS so the receiving RS can flush
> c) In receiving RS, different WAL file of a region sends edits to different memstores.(We can visualize this in high level as sending changes to a new region object with name(origin region name + wal name hash) and use the original sequence Ids.) 
> d) writes from normal traffic(allow writes during recovery) are put in normal memstores as of today and flush normally with new sequenceIds.
> h5. The other alternative options are listed below for references:
> Option one
> a) disallow writes during recovery
> b) during replay, we pass original wal sequence ids
> c) hold flush till all wals of a recovering region are replayed. Memstore should hold because we only recover unflushed wal edits. For edits with same key + version, whichever with larger sequence Id wins.
> Option two
> a) During replay, we pass original wal sequence ids
> b) for each wal edit, we store each edit's original sequence id along with its key. 
> c) during scanning, we use the original sequence id if it's present otherwise its store file sequence Id
> d) compaction can just leave put with max sequence id
> Please let me know if you have better ideas.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira