You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Rushabh Shah (Jira)" <ji...@apache.org> on 2021/08/12 18:47:00 UTC
[jira] [Commented] (HBASE-26195) Data is present in replicated cluster but not present in primary cluster.

    [ https://issues.apache.org/jira/browse/HBASE-26195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17398232#comment-17398232 ] 

Rushabh Shah commented on HBASE-26195:
--------------------------------------

We discussed the following solutions at my day job. Actually it was Bharath, Andrew and Geoffrey so all the credit goes to them.

*Approach 1*: +Add a rollback marker in WAL after we rollback from Memstore.+
After we rollback mutation from memstore, we should add a rollback marker in WAL with the same timestamp as the append timestamp. Setting the rollback mutation timestamp same as the original timestamp is the key here otherwise we will undo the edits happened between the timestamps.
This can cause some correctness issues until we have HBASE-25975. Without HBASE-25975, in case of same timestamps, deletes takes precedence over put mutations.


*Approach 2*:  +Do not replicate entries (for currently active WAL) until WALKey#log_sequence_number <= mvcc#readPoint.+
Today ReplicationSourceWALReaderThread don't have access to HRegion's MVCC object but that can be made available.


Both the above approaches suffers from the same inconsistency in case of RS crash.
Consider the following sequence of events.

Approach 1
1 syncFuture timed out
2 client notified of failure
3 sync succeeded in the background
4 memstore rollback
5 apply rollback marker in the WAL and replicate
 
if RS crashes between 4-5, row is present in peer cluster but not in source cluster.
 

Approach 2
1 syncFuture timed out
2 client notified of failure
3 sync succeeded in the background
4 replication waiting until logSeq <= mvcc#readPoint
5 memstore rollback
6 RS crashes

Now this WAL replication is handled by another RS which has no mvcc state to rely on, so the best can do is replicate, so we are in the same problem again.


*Approach 3:* +Just abort the regionserver instead of doing Memstore Rollback.+
In this approach, client will be notified of the failure but on the server side, the mutation succeeds and the mutation will be replicated to peer cluster also.
This way both the source and peer cluster will be in consistent state.

Cons of Approach 3: If we start aborting RS in case of _not-very-grave_ HDFS issues, then we will lose hbase capacity quickly and cluster will be in unavailable state.

Given that we don't have append-and-sync atomic op in hdfs, there will be edge cases whichever solution we choose.

[~bharathv] [~apurtell] [~gjacoby] Please correct me if I missed anything.

> Data is present in replicated cluster but not present in primary cluster.
> -------------------------------------------------------------------------
>
>                 Key: HBASE-26195
>                 URL: https://issues.apache.org/jira/browse/HBASE-26195
>             Project: HBase
>          Issue Type: Bug
>          Components: Replication, wal
>    Affects Versions: 3.0.0-alpha-1, 1.7.0, 2.5.0
>            Reporter: Rushabh Shah
>            Assignee: Rushabh Shah
>            Priority: Major
>
> We encountered a case where we are seeing some rows (via Phoenix) in replicated cluster but they are not present in source/active cluster.
> Triaging further we found memstore rollback logs in few of the region servers.
> {noformat}
> 2021-07-28 14:17:59,353 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,353 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,354 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [3,queue=3,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,355 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> 2021-07-28 14:17:59,356 DEBUG [,queue=25,port=60020] regionserver.HRegion - rollbackMemstore rolled back 23
> {noformat}
> Looking more into logs, found that there were some hdfs layer issues sync'ing wal to hdfs.
> It was taking around 6 mins to sync wal. Logs below
> {noformat}
> 2021-07-28 14:19:30,511 WARN  [sync.0] hdfs.DataStreamer - Slow waitForAckedSeqno took 391210ms (threshold=30000ms). File being written: /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620, block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.1] hdfs.DataStreamer - Slow waitForAckedSeqno took 391148ms (threshold=30000ms). File being written: /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620, block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,589 WARN  [sync.2] hdfs.DataStreamer - Slow waitForAckedSeqno took 391147ms (threshold=30000ms). File being written: /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620, block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,591 INFO  [sync.0] wal.FSHLog - Slow sync cost: 391289 ms, current pipeline: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 INFO  [sync.1] wal.FSHLog - Slow sync cost: 391227 ms, current pipeline: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.1] wal.FSHLog - Requesting log roll because we exceeded slow sync threshold; time=391227 ms, threshold=10000 ms, current pipeline: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 INFO  [sync.2] wal.FSHLog - Slow sync cost: 391227 ms, current pipeline: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.2] wal.FSHLog - Requesting log roll because we exceeded slow sync threshold; time=391227 ms, threshold=10000 ms, current pipeline: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> 2021-07-28 14:19:30,591 WARN  [sync.3] hdfs.DataStreamer - Slow waitForAckedSeqno took 391217ms (threshold=30000ms). File being written: /hbase/WALs/<rs-node>,60020,1626191371499/<rs-node>%2C60020%2C1626191371499.1627480615620, block: BP-958889176-<namenode>-1567030695029:blk_1689647875_616028364, Write pipeline datanodes: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]].
> 2021-07-28 14:19:30,591 INFO  [sync.3] wal.FSHLog - Slow sync cost: 391217 ms, current pipeline: [DatanodeInfoWithStorage[<dn-1>:50010,DS-b5747702-8ab9-4a5e-916e-5fae6e305738,DISK], DatanodeInfoWithStorage[<dn-2>:50010,DS-505dabb0-0fd6-42d9-b25d-f25e249fe504,DISK], DatanodeInfoWithStorage[<dn-3>:50010,DS-6c585673-d4d0-4ec6-bafe-ad4cd861fb4b,DISK]]
> {noformat}
> Just to add more details, we also found there was a namenode failover. Might be a contributing factor on why sync's were slow but not sure.
> {noformat}
> 2021-07-28 14:19:30,390 WARN  [<rs-host>:60020-3] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491727 Client
> NamenodeProtocolTranslatorPB.mkdirs over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,390 WARN  [<rs-host>:60020-7] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491726 ClientNamenodeProtocolTranslatorPB.mkdirs over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [20%2C1626191371499,1] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491723 ClientNamenodeProtocolTranslatorPB.getBlockLocations over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [0.40:60020.logRoller] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491729 ClientNamenodeProtocolTranslatorPB.getFileInfo over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [MemStoreFlusher.1] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491730 ClientNamenodeProtocolTranslatorPB.getFileInfo over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [371499.1627480615620] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491731 ClientNamenodeProtocolTranslatorPB.addBlock over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [c@prod] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491725 ClientNamenodeProtocolTranslatorPB.renewLease over <namenode-hostname>/<namenode-ip>:8020
> 2021-07-28 14:19:30,391 WARN  [91371499@<Replicatonsource-thread>] retry.RetryInvocationHandler - A failover has occurred since the start of call #8491724 ClientNamenodeProtocolTranslatorPB.getBlockLocations over <namenode-hostname>/<namenode-ip>:8020
> {noformat}
> We use the default timeout for hbase.regionserver.wal.sync.timeout which is 5 mins.
> Since the sync took more than 6 minutes (which is more than hbase.regionserver.wal.sync.timeout), the client received an Exception and we rolled back tje mutations in memstore on the primary cluster.
> But the hdfs sync succeeded  +eventually+, the row was present in wal and was eligible for replication.
> [~bharathv] [~andrew.purtell@gmail.com] [~gjacoby] helped brainstorming for a possible solution. Will update the possible approaches in the comments section.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)