You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "zhangduo (JIRA)" <ji...@apache.org> on 2015/03/07 10:38:38 UTC

[jira] [Created] (HBASE-13172) TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1

zhangduo created HBASE-13172:
--------------------------------

             Summary: TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
                 Key: HBASE-13172
                 URL: https://issues.apache.org/jira/browse/HBASE-13172
             Project: HBase
          Issue Type: Bug
          Components: test
    Affects Versions: 1.1.0
            Reporter: zhangduo


The direct reason is we are stuck in ServerManager.isServerReachable.

https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/

{noformat}
2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
{noformat}
The interval between first and last retry log is about 1 minute, and we only wait 1 minute so the test is timeout.
Still do not know why this happen.

And at last there are lots of this 
{noformat}
2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
	at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
	at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
	at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
	at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
	at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
	at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
	at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
	at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
	at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
{noformat}
I think the problem is here
{code:title=ServerManager.java}
    while (retryCounter.shouldRetry()) {
        ...
        try {
          retryCounter.sleepUntilNextRetry();
        } catch(InterruptedException ie) {
          Thread.currentThread().interrupt();
        }
        ...
    }
{code}
We need to break out of the while loop when getting InterruptedException, not just mark current thread as interrupted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)