You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2017/08/16 21:52:00 UTC

[jira] [Created] (HBASE-18613) Race condition between master restart and test code when restoring distributed cluster after integration test

Josh Elser created HBASE-18613:
----------------------------------

             Summary: Race condition between master restart and test code when restoring distributed cluster after integration test
                 Key: HBASE-18613
                 URL: https://issues.apache.org/jira/browse/HBASE-18613
             Project: HBase
          Issue Type: Bug
          Components: integration tests
            Reporter: Josh Elser
            Assignee: Josh Elser
            Priority: Minor
             Fix For: 2.0.0, 1.4.0, 1.3.2, 1.2.7, 1.1.13


Noticed the following in some internal testing (line numbers likely are skewed)

{noformat}
2017-08-16 21:20:25,557| 2017-08-16 21:20:25,553 WARN  [main] client.ConnectionManager$HConnectionImplementation: Checking master connection
2017-08-16 21:20:25,557| com.google.protobuf.ServiceException: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to master1.domain.com/10.0.2.131:16000 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to master1.domain.com/10.0.2.131:16000 is closing. Call id=581, waitTime=1
2017-08-16 21:20:25,557| at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
2017-08-16 21:20:25,558| at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
2017-08-16 21:20:25,560| at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:62739)
2017-08-16 21:20:25,560| at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceState.isMasterRunning(ConnectionManager.java:1448)
2017-08-16 21:20:25,561| at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.isKeepAliveMasterConnectedAndRunning(ConnectionManag
er.java:2124)
2017-08-16 21:20:25,561| at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1712)
2017-08-16 21:20:25,562| at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getMaster(ConnectionManager.java:1701)
2017-08-16 21:20:25,562| at org.apache.hadoop.hbase.DistributedHBaseCluster.getMasterAdminService(DistributedHBaseCluster.java:153)
2017-08-16 21:20:25,563| at org.apache.hadoop.hbase.DistributedHBaseCluster.waitForActiveAndReadyMaster(DistributedHBaseCluster.java:184)
2017-08-16 21:20:25,563| at org.apache.hadoop.hbase.HBaseCluster.waitForActiveAndReadyMaster(HBaseCluster.java:204)
2017-08-16 21:20:25,563| at org.apache.hadoop.hbase.DistributedHBaseCluster.restoreMasters(DistributedHBaseCluster.java:278)
2017-08-16 21:20:25,563| at org.apache.hadoop.hbase.DistributedHBaseCluster.restoreClusterStatus(DistributedHBaseCluster.java:239)
2017-08-16 21:20:25,563| at org.apache.hadoop.hbase.HBaseCluster.restoreInitialStatus(HBaseCluster.java:235)
2017-08-16 21:20:25,564| at org.apache.hadoop.hbase.IntegrationTestingUtility.restoreCluster(IntegrationTestingUtility.java:99)
2017-08-16 21:20:25,564| at org.apache.hadoop.hbase.IntegrationTestBase.cleanUpCluster(IntegrationTestBase.java:200)
2017-08-16 21:20:25,564| at org.apache.hadoop.hbase.IntegrationTestDDLMasterFailover.cleanUpCluster(IntegrationTestDDLMasterFailover.java:146)
2017-08-16 21:20:25,564| at org.apache.hadoop.hbase.IntegrationTestBase.cleanUp(IntegrationTestBase.java:140)
2017-08-16 21:20:25,564| at org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:125)
2017-08-16 21:20:25,565| at org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112)
2017-08-16 21:20:25,565| at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
2017-08-16 21:20:25,565| at org.apache.hadoop.hbase.IntegrationTestDDLMasterFailover.main(IntegrationTestDDLMasterFailover.java:832)
2017-08-16 21:20:25,566| Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Call to master1.domain.com/10.0.2.131:16000 failed on local exception: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to master1.domain.com/10.0.2.131:16000 is closing. Call id=581, waitTime=1
2017-08-16 21:20:25,566| at org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1258)
2017-08-16 21:20:25,566| at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1229)
2017-08-16 21:20:25,566| at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
2017-08-16 21:20:25,566| ... 20 more
2017-08-16 21:20:25,566| Caused by: org.apache.hadoop.hbase.exceptions.ConnectionClosingException: Connection to master1.domain.com/10.0.2.131:16000 is closing. Call id=581, waitTime=1
2017-08-16 21:20:25,567| at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.cleanupCalls(RpcClientImpl.java:1047)
2017-08-16 21:20:25,567| at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.close(RpcClientImpl.java:846)
2017-08-16 21:20:25,567| at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.run(RpcClientImpl.java:574)
{noformat}

This is when the IntegrationTest harness is resetting the state of the distributed cluster. When dealing with "slow" nodes, the restart of the previously active master could be delayed which cause the test code to see a ConnectionClosingException (wrapped in a ServiceException).

I think we want to just consume this Exception, same as MasterNotRunningException and ZooKeeperConnectionException, in {{DistributedHBaseCluster#waitForActiveAndReadyMaster(long)}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)