You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2022/07/22 09:48:30 UTC

[GitHub] [hbase] sunhelly commented on a diff in pull request #4641: HBASE-27230 RegionServer should be aborted when WAL.sync throws Timeo…

sunhelly commented on code in PR #4641:
URL: https://github.com/apache/hbase/pull/4641#discussion_r927487734


##########
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java:
##########
@@ -8037,16 +8040,35 @@ private WriteEntry doWALAppend(WALEdit walEdit, BatchOperation<?> batchOp,
     try {
       long txid = this.wal.appendData(this.getRegionInfo(), walKey, walEdit);
       WriteEntry writeEntry = walKey.getWriteEntry();
-      this.attachRegionReplicationInWALAppend(batchOp, miniBatchOp, walKey, walEdit, writeEntry);
       // Call sync on our edit.
       if (txid != 0) {
         sync(txid, batchOp.durability);
       }
+      /**
+       * If above {@link HRegion#sync} throws Exception, the RegionServer should be aborted and
+       * following {@link BatchOperation#writeMiniBatchOperationsToMemStore} will not be executed,
+       * so there is no need to replicate to secondary replica, for this reason here we attach the
+       * region replication action after the {@link HRegion#sync} is successful.
+       */
+      this.attachRegionReplicationInWALAppend(batchOp, miniBatchOp, walKey, walEdit, writeEntry);
       return writeEntry;
     } catch (IOException ioe) {
       if (walKey.getWriteEntry() != null) {
         mvcc.complete(walKey.getWriteEntry());
       }
+
+      /**
+       * If {@link WAL#sync} get a timeout exception, the only correct way is to abort the region
+       * server, as the design of {@link WAL#sync}, is to succeed or die, there is no 'failure'. It

Review Comment:
   Actually we should avoid aborting the RS when the WAL sync is slow or timeout. As in the design of HBASE-22301 we should try to roll slow WALs to connect and rewrite to faster DNs. If you want to recover the flushing of WALs by aborting the RS, you will suffer from a whole MTTR but only the new WAL created by the newly started RS is helpful to the question.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@hbase.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org