You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Guanghao Zhang (Jira)" <ji...@apache.org> on 2020/08/14 08:31:00 UTC
[jira] [Commented] (HBASE-23984) [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown

    [ https://issues.apache.org/jira/browse/HBASE-23984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17177601#comment-17177601 ] 

Guanghao Zhang commented on HBASE-23984:
----------------------------------------

I meet this problem on branch-2.2 too. This case happened because the DelayCloseCP. The event execute order is:
 # Close regiong. But because the DelayCloseCP, it will close after 10 seconds.
 # Finish ut and shutdown cluster.
 # Shutdown master.
 # Shutdown RS. Call waitOnAllRegionsToClose method. But abortRequested is false now.
 # Close region and failed because master is down and report master error. Then abort RegionServer and set abortRequested to ture.
 # waitOnAllRegionsToClose hanged because the online regions cannot be empty.

 

waitOnAllRegionsToClose(final boolean abort) already consider the abort case but the problem is abortRequested is false when call this method. I thought the fix should be that keep to check the abortRequested in waitOnAllRegionsToClose method internal.

 

> [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown
> --------------------------------------------------------------
>
>                 Key: HBASE-23984
>                 URL: https://issues.apache.org/jira/browse/HBASE-23984
>             Project: HBase
>          Issue Type: Test
>    Affects Versions: 2.3.0, 2.4.0
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 3.0.0-alpha-1, 2.3.0
>
>         Attachments: 0001-HBASE-23984-Flakey-Tests-TestMasterAbortAndRSGotKill.patch, Screen Shot 2020-03-17 at 9.46.49 PM.png
>
>
> Its failing with decent frequency of late in shutdown of cluster. Seems basic. There is an unassign/move going on. Test just checks Master can come back up after being killed. Does not check move is done. If on subsequent cluster shutdown, if the move can't report the Master because its shutting down, then the move fails, we abort the server, and then we get a wonky loop where we can't close because server is aborting.
> At the root, there is a misaccounting when the unassign close fails where we don't cleanup references in the regionserver local RIT accounting. Deeper than this, close code is duplicated in three places that I can see; in RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.
> Let me fix this issue and the code dupe.
> Details:
> From https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt
> Here is the unassign handler failing because master went down earlier (Its probably trying to talk to the old Master location)
> {code}
> ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover *****
> Cause:
> java.io.IOException: Failed to report close to master: ede67f9f661acc1241faf468b081d548
> 	at org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> ... then the cluster shutdown tries to close the same Region... but fails because we are aborting because of above.... 
> {code}
> 2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] helpers.MarkerIgnoringBase(159): ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while closing region hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still finishing close *****
> java.io.IOException: Aborting flush because server is aborted...
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
> 	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
> 	at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}
> ....
> And the RS keeps looping trying to close the Region even though we're aborted and there is handling in RS close Regions to deal with abort.
> Trouble seems to be because when UnassignRegionHandler fails its region close, it does not unregister the Region with rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);



--
This message was sent by Atlassian Jira
(v8.3.4#803005)