You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Michael Stack (Jira)" <ji...@apache.org> on 2020/12/05 22:28:00 UTC
[jira] [Commented] (HBASE-25353) [Flakey Tests] branch-2 TestShutdownBackupMaster

    [ https://issues.apache.org/jira/browse/HBASE-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244592#comment-17244592 ] 

Michael Stack commented on HBASE-25353:
---------------------------------------

Oh, thanks for the helpful review [~huaxiangsun]

> [Flakey Tests] branch-2 TestShutdownBackupMaster
> ------------------------------------------------
>
>                 Key: HBASE-25353
>                 URL: https://issues.apache.org/jira/browse/HBASE-25353
>             Project: HBase
>          Issue Type: Sub-task
>          Components: flakies
>    Affects Versions: 2.4.0
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>             Fix For: 2.3.4, 2.5.0, 2.4.1
>
>
> Making this as a sub-issue of parent issue which fails similar to how we are failing now.
> Currently, I see that that TestShutdownBackupMaster test passes usually but it is warped in how it completes. It will do all retries just before the test timesout at 13minutes max...: e.g. you'll see this...
> 2020-12-02 22:07:34,200 DEBUG [master/stack:0:becomeActiveMaster] client.ConnectionImplementation(1009): locateRegionInMeta parentTable='hbase:meta', attempt=44 of 46 failed; retrying after sleep of 46
> ... so we'll do all the retries and then complete so the test looks like it 'succeeded' but it actually ran for Total time: 12:41 min... and the log is full of thread dumps because the cluster won't go down (The time is spent in the test shutdown).
> Often though, we won't complete the retries in time and the test fails. It is in the flakey list.
> Rather, we are supposed to fail out fast when we are shutting down. Below is the type of retry we see.
>  
> {code:java}
> 2020-12-02 10:53:35,540 INFO [Listener at localhost/61609] util.JVMClusterUtil(348): Shutdown of 2 master(s) and 2 regionserver(s) complete
>  2020-12-02 10:53:35,548 DEBUG [master/stack:0:becomeActiveMaster] client.ConnectionImplementation(1009): locateRegionInMeta parentTable='hbase:meta', attempt=2 of 46 failed; retrying after sleep of 46
>  org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x1afa7f5b closed
>  at org.apache.hadoop.hbase.client.ConnectionImplementation.checkClosed(ConnectionImplementation.java:630)
>  at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:815)
>  at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
>  at org.apache.hadoop.hbase.client.ConnectionImplementation.relocateRegion(ConnectionImplementation.java:803)
>  at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.relocateRegion(ConnectionUtils.java:138)
>  at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegionInMeta(ConnectionImplementation.java:933)
>  at org.apache.hadoop.hbase.client.ConnectionImplementation.locateRegion(ConnectionImplementation.java:823)
>  at org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.locateRegion(ConnectionUtils.java:138)
>  at org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:64)
>  at org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:70)
>  at org.apache.hadoop.hbase.client.RegionLocator.getRegionLocation(RegionLocator.java:59)
>  at org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:223)
>  at org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:383)
>  at org.apache.hadoop.hbase.client.HTable.get(HTable.java:357)
>  at org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:141)
>  at org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:278)
>  at org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:103)
>  at org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63)
>  at org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:249)
>  at org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1224)
>  at org.apache.hadoop.hbase.master.TestShutdownBackupMaster$MockHMaster.initClusterSchemaService(TestShutdownBackupMaster.java:68)
>  at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1021)
>  at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2082)
>  at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:506){code}
> See how a master is trying to become active and it won't relent trying to become active master even though this cluster is shutting down? See how we retry but the check for close of the connection is coming back with a DoNotRetryIOException? The exception is being swallowed. We keep going.
> Fix looks simple enough.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)