You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2018/03/06 04:43:00 UTC
[jira] [Created] (HBASE-20137) TestRSGroups is flakey

stack created HBASE-20137:
-----------------------------

             Summary: TestRSGroups is flakey
                 Key: HBASE-20137
                 URL: https://issues.apache.org/jira/browse/HBASE-20137
             Project: HBase
          Issue Type: Bug
          Components: flakey
    Affects Versions: 2.0.0-beta-2
            Reporter: stack
            Assignee: stack


It was the single test that failed the hbase-2 nightlies in #440 at the hadoop2 stage.

The failure manifests as a timeout. It actually has an interesting cause calling into question some of the clauses in UnassignProcedure#remoteCallFailed.

We are running a disabletable concurrent with a shutdown. pid=309 is the disable. pid=311 is the interesting one. The below is a little hard to read -- the exception 'message' is the the current procedure as a String... hard to parse, fixing -- but we are trying to unassign as part of a the disabletable. Our RPC fails because the server we are trying to rpc too is currently being processed as crashed (pid=308 is a servercrashprocedure for this server). As part of the processing of the failed RPC we will expire the server -- if we can't RPC to it, it must be gone. The current procedure is then suspended until it gets woken up by the servercrashprocedure triggered by the expire.... only in this case we are shutting down so the expire is ignored... The current procedure is left in its suspend state. This prevents the Master going down. So we time out.

2018-03-05 11:29:22,507 INFO  [PEWorker-13] assignment.RegionTransitionProcedure(213): Dispatch pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.RegionTransitionProcedure(187): Remote call failed pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524; exception=pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] assignment.UnassignProcedure(276): Expiring server pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524; rit=CLOSING, location=1cfd208ff882,40584,1520249102524, exception=org.apache.hadoop.hbase.master.assignment.FailedRemoteDispatchException: pid=311, ppid=309, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=Group_ns:testKillRS, region=de7534c208a06502537cd95c248b3043, server=1cfd208ff882,40584,1520249102524 to 1cfd208ff882,40584,1520249102524
2018-03-05 11:29:22,508 WARN  [PEWorker-13] master.ServerManager(580): Expiration of 1cfd208ff882,40584,1520249102524 but server shutdown already in progress

I need to cater for case where the expire server is rejected.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)