You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "zhangduo (JIRA)" <ji...@apache.org> on 2015/03/07 10:38:38 UTC
[jira] [Created] (HBASE-13172)
TestDistributedLogSplitting.testThreeRSAbort fails several times on
branch-1
zhangduo created HBASE-13172:
--------------------------------
Summary: TestDistributedLogSplitting.testThreeRSAbort fails several times on branch-1
Key: HBASE-13172
URL: https://issues.apache.org/jira/browse/HBASE-13172
Project: HBase
Issue Type: Bug
Components: test
Affects Versions: 1.1.0
Reporter: zhangduo
The direct reason is we are stuck in ServerManager.isServerReachable.
https://builds.apache.org/job/HBase-1.1/253/testReport/org.apache.hadoop.hbase.master/TestDistributedLogSplitting/testThreeRSAbort/
{noformat}
2015-03-06 04:06:19,430 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=0 of 10
2015-03-06 04:07:10,545 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
{noformat}
The interval between first and last retry log is about 1 minute, and we only wait 1 minute so the test is timeout.
Still do not know why this happen.
And at last there are lots of this
{noformat}
2015-03-06 04:07:21,529 DEBUG [AM.-pool300-t1] master.ServerManager(855): Couldn't reach asf906.gq1.ygridcore.net,59366,1425614770146, try=9 of 10
org.apache.hadoop.hbase.ipc.StoppedRpcClientException
at org.apache.hadoop.hbase.ipc.RpcClientImpl.getConnection(RpcClientImpl.java:1261)
at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1146)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.getServerInfo(AdminProtos.java:22031)
at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getServerInfo(ProtobufUtil.java:1797)
at org.apache.hadoop.hbase.master.ServerManager.isServerReachable(ServerManager.java:850)
at org.apache.hadoop.hbase.master.RegionStates.isServerDeadAndNotProcessed(RegionStates.java:843)
at org.apache.hadoop.hbase.master.AssignmentManager.forceRegionStateToOffline(AssignmentManager.java:1969)
at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1576)
at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{noformat}
I think the problem is here
{code:title=ServerManager.java}
while (retryCounter.shouldRetry()) {
...
try {
retryCounter.sleepUntilNextRetry();
} catch(InterruptedException ie) {
Thread.currentThread().interrupt();
}
...
}
{code}
We need to break out of the while loop when getting InterruptedException, not just mark current thread as interrupted.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)