You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Josh Elser (JIRA)" <ji...@apache.org> on 2016/12/13 20:47:59 UTC
[jira] [Commented] (HBASE-17306) IntegrationTestRSGroup#testRegionMove may fail due to region server not online

    [ https://issues.apache.org/jira/browse/HBASE-17306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15746222#comment-15746222 ] 

Josh Elser commented on HBASE-17306:
------------------------------------

bq. Shortly before the test failure, the server was shutdown:

This shutdown/restart was due to ChaosMonkey? My worry would be that your fix would just very quickly retry and fail 3 times, leaving us with the same problem. It looks like the 5 minutes went by before the RS was restarted.

I'm not familiar enough with the RSGroups feature: are groups defined by hostname or the actual ServerName (hostname+port+timestamp)?

I would think it would be more reliable to stop CM (or whatever process is stopping RegionServers) before trying to restore the cluster back to "normal". Granted, we could still run into this in the normal case, but, if RSGroups requires the server to be online to change groups, I'm not coming up with a way to fix the test (as we would have to block until the server came back online for correctness).

> IntegrationTestRSGroup#testRegionMove may fail due to region server not online
> ------------------------------------------------------------------------------
>
>                 Key: HBASE-17306
>                 URL: https://issues.apache.org/jira/browse/HBASE-17306
>             Project: HBase
>          Issue Type: Test
>            Reporter: Ted Yu
>            Priority: Minor
>         Attachments: 17306.v1.txt
>
>
> {code}
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - run()|2) testRegionMove(org.apache.hadoop.hbase.rsgroup.IntegrationTestRSGroup)
> 2016-12-13 05:26:57,965|INFO|MainThread|machine.py:145 - run()|org.apache.hadoop.hbase.constraint.ConstraintException: org.apache.hadoop.hbase.constraint.                    ConstraintException: Server ctr-e77-1481596162056-0240-01-000005.a.com:16020 is not an online server in default group.
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.rsgroup.RSGroupAdminServer.moveServers(RSGroupAdminServer.java:135)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.rsgroup.RSGroupAdminEndpoint.moveServers(RSGroupAdminEndpoint.java:169)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.protobuf.generated.RSGroupAdminProtos$RSGroupAdminService.                          callMethod(RSGroupAdminProtos.java:11136)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.master.MasterRpcServices.execMasterService(MasterRpcServices.java:679)
> 2016-12-13 05:26:57,966|INFO|MainThread|machine.py:145 - run()|at org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2
> {code}
> Shortly before the test failure, the server was shutdown:
> {code}
> 2016-12-13 05:21:25,428 INFO  [MASTER_SERVER_OPERATIONS-ctr-e77-1481596162056-0240-01-000008:20000-4] handler.ServerShutdownHandler: Finished processing of shutdown of ctr-  e77-1481596162056-0240-01-000005.a.com,16020,1481606309159
> ...
> 2016-12-13 05:26:57,935 INFO  [RpcServer.FifoWFPBQ.priority.handler=19,queue=1,port=20000] master.ServerManager: Registering server=ctr-e77-1481596162056-0240-01-000005.hwx. site,16020,1481606803303
> 2016-12-13 05:27:06,219 DEBUG [main-EventThread] zookeeper.RegionServerTracker: Added tracking of RS /hbase-secure/rs/ctr-e77-1481596162056-0240-01-000005.a.com,16020,       1481606803303
> {code}
> The registration of the new server (start code1481606803303) happened shortly after the test failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)