You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Ted Yu (JIRA)" <ji...@apache.org> on 2018/01/05 00:11:00 UTC
[jira] [Updated] (HBASE-19710) hbase:namespace table was stuck in transition

     [ https://issues.apache.org/jira/browse/HBASE-19710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Yu updated HBASE-19710:
---------------------------
    Attachment: master-006.tar.gz
                rs-009.log.tar.gz
                master-005-log.tar.gz

009 was the region server log where namespace table was last open.
006 was the master log which first experienced namespace table getting stuck.
005 was the master which became active master next, with namespace table still stuck.

> hbase:namespace table was stuck in transition
> ---------------------------------------------
>
>                 Key: HBASE-19710
>                 URL: https://issues.apache.org/jira/browse/HBASE-19710
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ted Yu
>            Priority: Critical
>         Attachments: master-005-log.tar.gz, master-006.tar.gz, rs-009.log.tar.gz
>
>
> ITBLL with chaos monkey failed due to namespace table getting stuck in transition.
> From hbase-hbase-master-ctr-e137-1514896590304-3629-01-000006.hwx.site.log , we can see that master closed namespace table on 000009:
> {code}
> 2018-01-04 17:24:35,067 DEBUG [main-EventThread] zookeeper.ZKWatcher: master:20000-0x160c222710c0028, quorum=ctr-e137-1514896590304-3629-01-000011.hwx.site:2181,ctr-e137-      1514896590304-3629-01-000014.hwx.site:2181,ctr-e137-1514896590304-3629-01-000009.hwx.site:2181,ctr-e137-1514896590304-3629-01-000006.hwx.site:2181,ctr-e137-1514896590304-3629- 01-000003.hwx.site:2181,ctr-e137-1514896590304-3629-01-000007.hwx.site:2181,ctr-e137-1514896590304-3629-01-000013.hwx.site:2181,ctr-e137-1514896590304-3629-01-000002.hwx.site: 2181,ctr-e137-1514896590304-3629-01-000012.hwx.site:2181,ctr-e137-1514896590304-3629-01-000008.hwx.site:2181,ctr-e137-1514896590304-3629-01-000010.hwx.site:2181, baseZNode=/   hbase-unsecure Received ZooKeeper Event, type=NodeChildrenChanged, state=SyncConnected, path=/hbase-unsecure/rs
> 2018-01-04 17:24:35,067 INFO  [ProcExecWrkr-5] assignment.RegionStateStore: pid=643 updating hbase:meta row=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.,   regionState=CLOSING, regionLocation=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872
> ...
> 2018-01-04 17:24:35,246 INFO  [ProcExecWrkr-12] procedure.MasterProcedureScheduler: pid=647, ppid=642, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=hbase:     namespace, region=a95ed2d7434a43390fbec73abeeb9fd9 hbase:namespace hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9.
> 2018-01-04 17:25:17,041 DEBUG [ctr-e137-1514896590304-3629-01-000006:20000.masterManager] procedure2.ProcedureExecutor: Loading pid=641, state=WAITING:MOVE_REGION_ASSIGN;      MoveRegionProcedure hri=hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9., source=ctr-e137-1514896590304-3629-01-000009.hwx.site,16020,1515086643872,            destination=
> {code}
> For the move operation, from ctr-e137-1514896590304-3629-01-000009.hwx.site log:
> {code}
> 2018-01-04 17:24:34,855 DEBUG [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] coprocessor.CoprocessorHost: Stop coprocessor org.apache.hadoop.hbase.security.   access.SecureBulkLoadEndpoint
> 2018-01-04 17:24:34,855 INFO  [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] regionserver.HRegion: Closed hbase:namespace,,1515085217343.                      a95ed2d7434a43390fbec73abeeb9fd9.
> 2018-01-04 17:24:34,856 DEBUG [RS_CLOSE_REGION-ctr-e137-1514896590304-3629-01-000009:16020-0] handler.CloseRegionHandler: Closed hbase:namespace,,1515085217343.                a95ed2d7434a43390fbec73abeeb9fd9.
> ...
> 2018-01-04 17:25:47,607 DEBUG [RpcServer.priority.FPBQ.Fifo.handler=18,queue=0,port=16020] ipc.RpcServer: callId: 16 service: ClientService methodName: Get size: 103           connection: 172.27.13.80:36738 deadline: 1515086837568
> org.apache.hadoop.hbase.NotServingRegionException: hbase:namespace,,1515085217343.a95ed2d7434a43390fbec73abeeb9fd9. is not online on ctr-e137-1514896590304-3629-01-000009.hwx. site,16020,1515086719163
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3312)
>         at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:3289)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1354)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2360)
>         at org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:41544)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:403)
> {code}
> We can see that the region server was not serving the region.
> After that, the masters kept thinking namespace table was on 0009, leading to prolonged downtime.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)