You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Alexander Batyrshin <0x...@gmail.com> on 2019/08/13 17:11:01 UTC

hbase:meta not online

 Hello,
We’ve got problem with hbase:meta that cause several servers to die.
Our version is HBase-1.4.8

prod006 - flush hbase:meta and move it: 

Aug 13 15:46:23 prod006 hbase[63435]: 2019-08-13 15:46:23,019 INFO  [prod006,60020,1564942121270_ChoreService_1] regionserver.HRegionServer: prod006,60020,1564942121270-MemstoreFlusherChore requesting flush of hbase:meta,,1.1588230740 because info has an old edit so flush to free WALs after random delay 226,666 ms
Aug 13 15:50:09 prod006 hbase[63435]: 2019-08-13 15:50:09,709 INFO  [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~2.99 KB/3064, currentsize=0 B/0 for region hbase:meta,,1.1588230740 in 24ms, sequenceid=323883, compaction requested=false
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,943 INFO  [StoreCloserThread-hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.-1] regionserver.HStore: Closed info
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,951 INFO  [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore: Closed info
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,976 INFO  [RS_CLOSE_META-prod006:60020-0] regionserver.HRegion: Closed hbase:meta,,1.1588230740
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO  [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegion: Closed hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.
Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO  [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegionServer: Adding moved region record: d3d9ad37f8506c02a84458121965c0b0 to prod023,60020,1565701089060 as of 147

At this time at master we have this:

Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO  [AM.ZK.Worker-pool5-t1878] master.RegionStates: Transition {1588230740 state=PENDING_CLOSE, ts=1565701090850, server=prod006,60020,1564942121270} to {158823074
0 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO  [AM.-pool3-t178] master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO  [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740 state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO  [AM.ZK.Worker-pool5-t1879] master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_CLOSE, ts=1565701090871, server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO  [AM.-pool3-t178] master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to prod023,60020,1565701089060
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO  [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740 state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO  [AM.-pool3-t179] master.AssignmentManager: Setting node as OFFLINED in ZooKeeper for region {ENCODED => d3d9ad37f8506c02a84458121965c0b0, NAME => 'hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.', STARTKEY => '', ENDKEY => ''}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO  [AM.-pool3-t179] master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270}
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO  [AM.-pool3-t179] master.AssignmentManager: Assigning hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to prod023,60020,1565701089060
Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO  [AM.-pool3-t179] master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701090986, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,039 WARN  [AM.-pool3-t179] master.AssignmentManager: Failed assignment of hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to prod023,60020,1565701089060, waiting a little before trying on the same region server try=1 of 10
Aug 13 15:58:11 prod001 hbase[14953]: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server prod023,60020,1565701089060 is not running yet
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Aug 13 15:58:11 prod001 hbase[14953]:         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
Aug 13 15:58:11 prod001 hbase[14953]:         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
Aug 13 15:58:11 prod001 hbase[14953]:         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:95)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:85)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:787)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2248)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1694)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Aug 13 15:58:11 prod001 hbase[14953]:         at java.lang.Thread.run(Thread.java:748)
Aug 13 15:58:11 prod001 hbase[14953]: Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException): org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server prod023,60020,1565701089060 is not running yet
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:386)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:405)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
Aug 13 15:58:11 prod001 hbase[14953]:         at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
Aug 13 15:58:11 prod001 hbase[14953]:         ... 1 more
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,217 INFO  [AM.ZK.Worker-pool5-t1882] master.RegionStates: Transition {1588230740 state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060} to {1588230740 state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,244 INFO  [AM.-pool3-t179] master.AssignmentManager: Assigning hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to prod023,60020,1565701089060
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,391 INFO  [AM.ZK.Worker-pool5-t1883] master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701091244, server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO  [AM.ZK.Worker-pool5-t1884] master.RegionStates: Transition {1588230740 state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060} to {1588230740 state=OPEN, ts=1565701091666, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO  [AM.ZK.Worker-pool5-t1884] coordination.ZkOpenRegionCoordination: Handling OPENED of 1588230740 from prod001,60000,1565179968855; deleting unassigned node
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,668 INFO  [AM.ZK.Worker-pool5-t1886] master.RegionStates: Offlined 1588230740 from prod006,60020,1564942121270
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,805 INFO  [AM.ZK.Worker-pool5-t1887] master.RegionStates: Transition {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391, server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0 state=OPEN, ts=1565701091805, server=prod023,60020,1565701089060}
Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,807 INFO  [AM.ZK.Worker-pool5-t1889] master.RegionStates: Offlined d3d9ad37f8506c02a84458121965c0b0 from prod006,60020,1564942121270


From this moment hbase:meta and hbase:namespace is moved to prod023 from prod006

But some other servers still try to get hbase:meta from prod006:

Aug 13 16:02:24 prod021 hbase[103657]: 2019-08-13 16:02:24,832 INFO  [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] client.RpcRetryingCaller: Call exception, tries=10, retries=350, started=38462 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on prod006,60020,1564942121270
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3086)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2197)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2369)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
Aug 13 16:02:24 prod021 hbase[103657]: #011at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
Aug 13 16:02:24 prod021 hbase[103657]:  row ‘IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.44' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=prod023,60020,1565701089060, seqNum=0

….

Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,136 INFO  [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] client.RpcRetryingCaller: Call exception, tries=68, retries=350, started=1194772 ms ago, cancell
ed=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on prod006,60020,1564942121270
Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,146 INFO  [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] regionserver.SplitRequest: Running rollback/cleanup of failed split of IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.; callTimeout=1200000, callDuration=1214938: org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on prod006,60020,1564942121270
Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,193 FATAL [regionserver/prod021/10.73.65.28:60020-splits-1564981231292] regionserver.HRegionServer: ABORTING region server prod021,60020,1564943354635: Abort; we got an error after point-of-no-return

The same picture on other crashed servers


Any ideas why this happened and how to prevent this in future?

Re: hbase:meta not online

Posted by Alexander Batyrshin <0x...@gmail.com>.
Great thanks for fast and valuable response

> On 14 Aug 2019, at 06:23, OpenInx <op...@gmail.com> wrote:
> 
> I think upgrading  to HBase1.4.10 can fix your problem here.
> Please see: https://issues.apache.org/jira/browse/HBASE-21464
> 
> Thanks.
> 
> On Wed, Aug 14, 2019 at 10:22 AM OpenInx <op...@gmail.com> wrote:
> 
>> Hi Alexander
>> Thanks for the log report & share.  I guess we did not handle
>> the NotServingRegionException for hbase:meta correctly, says when a client
>> encounter a NotServingRegionException from hbase:meta, it won't try to
>> relocate the hbase:meta location. (in 1.4.8)
>> Let me have a check.
>> 
>> On Wed, Aug 14, 2019 at 1:11 AM Alexander Batyrshin <0x...@gmail.com>
>> wrote:
>> 
>>> Hello,
>>> We’ve got problem with hbase:meta that cause several servers to die.
>>> Our version is HBase-1.4.8
>>> 
>>> prod006 - flush hbase:meta and move it:
>>> 
>>> Aug 13 15:46:23 prod006 hbase[63435]: 2019-08-13 15:46:23,019 INFO
>>> [prod006,60020,1564942121270_ChoreService_1] regionserver.HRegionServer:
>>> prod006,60020,1564942121270-MemstoreFlusherChore requesting flush of
>>> hbase:meta,,1.1588230740 because info has an old edit so flush to free WALs
>>> after random delay 226,666 ms
>>> Aug 13 15:50:09 prod006 hbase[63435]: 2019-08-13 15:50:09,709 INFO
>>> [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~2.99
>>> KB/3064, currentsize=0 B/0 for region hbase:meta,,1.1588230740 in 24ms,
>>> sequenceid=323883, compaction requested=false
>>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,943 INFO
>>> [StoreCloserThread-hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.-1]
>>> regionserver.HStore: Closed info
>>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,951 INFO
>>> [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore: Closed
>>> info
>>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,976 INFO
>>> [RS_CLOSE_META-prod006:60020-0] regionserver.HRegion: Closed
>>> hbase:meta,,1.1588230740
>>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO
>>> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegion: Closed
>>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.
>>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO
>>> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegionServer: Adding moved
>>> region record: d3d9ad37f8506c02a84458121965c0b0 to
>>> prod023,60020,1565701089060 as of 147
>>> 
>>> At this time at master we have this:
>>> 
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
>>> [AM.ZK.Worker-pool5-t1878] master.RegionStates: Transition {1588230740
>>> state=PENDING_CLOSE, ts=1565701090850, server=prod006,60020,1564942121270}
>>> to {158823074
>>> 0 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
>>> [AM.-pool3-t178] master.AssignmentManager: Setting node as OFFLINED in
>>> ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
>>> STARTKEY => '', ENDKEY => ''}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
>>> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=CLOSED,
>>> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740
>>> state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
>>> [AM.ZK.Worker-pool5-t1879] master.RegionStates: Transition
>>> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_CLOSE, ts=1565701090871,
>>> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
>>> state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
>>> [AM.-pool3-t178] master.AssignmentManager: Assigning
>>> hbase:meta,,1.1588230740 to prod023,60020,1565701089060
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
>>> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=OFFLINE,
>>> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740
>>> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO
>>> [AM.-pool3-t179] master.AssignmentManager: Setting node as OFFLINED in
>>> ZooKeeper for region {ENCODED => d3d9ad37f8506c02a84458121965c0b0, NAME =>
>>> 'hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.',
>>> STARTKEY => '', ENDKEY => ''}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO
>>> [AM.-pool3-t179] master.RegionStates: Transition
>>> {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984,
>>> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
>>> state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270}
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO
>>> [AM.-pool3-t179] master.AssignmentManager: Assigning
>>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
>>> prod023,60020,1565701089060
>>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO
>>> [AM.-pool3-t179] master.RegionStates: Transition
>>> {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985,
>>> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
>>> state=PENDING_OPEN, ts=1565701090986, server=prod023,60020,1565701089060}
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,039 WARN
>>> [AM.-pool3-t179] master.AssignmentManager: Failed assignment of
>>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
>>> prod023,60020,1565701089060, waiting a little before trying on the same
>>> region server try=1 of 10
>>> Aug 13 15:58:11 prod001 hbase[14953]:
>>> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
>>> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
>>> prod023,60020,1565701089060 is not running yet
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:95)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:85)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:787)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2248)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1694)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> java.lang.Thread.run(Thread.java:748)
>>> Aug 13 15:58:11 prod001 hbase[14953]: Caused by:
>>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
>>> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
>>> prod023,60020,1565701089060 is not running yet
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:386)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:405)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>>> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
>>> Aug 13 15:58:11 prod001 hbase[14953]:         ... 1 more
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,217 INFO
>>> [AM.ZK.Worker-pool5-t1882] master.RegionStates: Transition {1588230740
>>> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
>>> to {1588230740 state=OPENING, ts=1565701091217,
>>> server=prod023,60020,1565701089060}
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,244 INFO
>>> [AM.-pool3-t179] master.AssignmentManager: Assigning
>>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
>>> prod023,60020,1565701089060
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,391 INFO
>>> [AM.ZK.Worker-pool5-t1883] master.RegionStates: Transition
>>> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701091244,
>>> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0
>>> state=OPENING, ts=1565701091391, server=prod023,60020,1565701089060}
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO
>>> [AM.ZK.Worker-pool5-t1884] master.RegionStates: Transition {1588230740
>>> state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060} to
>>> {1588230740 state=OPEN, ts=1565701091666,
>>> server=prod023,60020,1565701089060}
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO
>>> [AM.ZK.Worker-pool5-t1884] coordination.ZkOpenRegionCoordination: Handling
>>> OPENED of 1588230740 from prod001,60000,1565179968855; deleting unassigned
>>> node
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,668 INFO
>>> [AM.ZK.Worker-pool5-t1886] master.RegionStates: Offlined 1588230740 from
>>> prod006,60020,1564942121270
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,805 INFO
>>> [AM.ZK.Worker-pool5-t1887] master.RegionStates: Transition
>>> {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391,
>>> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0
>>> state=OPEN, ts=1565701091805, server=prod023,60020,1565701089060}
>>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,807 INFO
>>> [AM.ZK.Worker-pool5-t1889] master.RegionStates: Offlined
>>> d3d9ad37f8506c02a84458121965c0b0 from prod006,60020,1564942121270
>>> 
>>> 
>>> From this moment hbase:meta and hbase:namespace is moved to prod023 from
>>> prod006
>>> 
>>> But some other servers still try to get hbase:meta from prod006:
>>> 
>>> Aug 13 16:02:24 prod021 hbase[103657]: 2019-08-13 16:02:24,832 INFO
>>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>>> client.RpcRetryingCaller: Call exception, tries=10, retries=350,
>>> started=38462 ms ago, cancelled=false,
>>> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1
>>> is not online on prod006,60020,1564942121270
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3086)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2197)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2369)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>>> Aug 13 16:02:24 prod021 hbase[103657]:  row
>>> ‘IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.44'
>>> on table 'hbase:meta' at region=hbase:meta,,1.1588230740,
>>> hostname=prod023,60020,1565701089060, seqNum=0
>>> 
>>> ….
>>> 
>>> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,136 INFO
>>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>>> client.RpcRetryingCaller: Call exception, tries=68, retries=350,
>>> started=1194772 ms ago, cancell
>>> ed=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region
>>> hbase:meta,,1 is not online on prod006,60020,1564942121270
>>> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,146 INFO
>>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>>> regionserver.SplitRequest: Running rollback/cleanup of failed split of
>>> IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.;
>>> callTimeout=1200000, callDuration=1214938:
>>> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is
>>> not online on prod006,60020,1564942121270
>>> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,193 FATAL
>>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>>> regionserver.HRegionServer: ABORTING region server
>>> prod021,60020,1564943354635: Abort; we got an error after point-of-no-return
>>> 
>>> The same picture on other crashed servers
>>> 
>>> 
>>> Any ideas why this happened and how to prevent this in future?
>> 
>> 


Re: hbase:meta not online

Posted by OpenInx <op...@gmail.com>.
I think upgrading  to HBase1.4.10 can fix your problem here.
Please see: https://issues.apache.org/jira/browse/HBASE-21464

Thanks.

On Wed, Aug 14, 2019 at 10:22 AM OpenInx <op...@gmail.com> wrote:

> Hi Alexander
> Thanks for the log report & share.  I guess we did not handle
> the NotServingRegionException for hbase:meta correctly, says when a client
> encounter a NotServingRegionException from hbase:meta, it won't try to
> relocate the hbase:meta location. (in 1.4.8)
> Let me have a check.
>
> On Wed, Aug 14, 2019 at 1:11 AM Alexander Batyrshin <0x...@gmail.com>
> wrote:
>
>>  Hello,
>> We’ve got problem with hbase:meta that cause several servers to die.
>> Our version is HBase-1.4.8
>>
>> prod006 - flush hbase:meta and move it:
>>
>> Aug 13 15:46:23 prod006 hbase[63435]: 2019-08-13 15:46:23,019 INFO
>> [prod006,60020,1564942121270_ChoreService_1] regionserver.HRegionServer:
>> prod006,60020,1564942121270-MemstoreFlusherChore requesting flush of
>> hbase:meta,,1.1588230740 because info has an old edit so flush to free WALs
>> after random delay 226,666 ms
>> Aug 13 15:50:09 prod006 hbase[63435]: 2019-08-13 15:50:09,709 INFO
>> [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~2.99
>> KB/3064, currentsize=0 B/0 for region hbase:meta,,1.1588230740 in 24ms,
>> sequenceid=323883, compaction requested=false
>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,943 INFO
>> [StoreCloserThread-hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.-1]
>> regionserver.HStore: Closed info
>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,951 INFO
>> [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore: Closed
>> info
>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,976 INFO
>> [RS_CLOSE_META-prod006:60020-0] regionserver.HRegion: Closed
>> hbase:meta,,1.1588230740
>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO
>> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegion: Closed
>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.
>> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO
>> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegionServer: Adding moved
>> region record: d3d9ad37f8506c02a84458121965c0b0 to
>> prod023,60020,1565701089060 as of 147
>>
>> At this time at master we have this:
>>
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
>> [AM.ZK.Worker-pool5-t1878] master.RegionStates: Transition {1588230740
>> state=PENDING_CLOSE, ts=1565701090850, server=prod006,60020,1564942121270}
>> to {158823074
>> 0 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
>> [AM.-pool3-t178] master.AssignmentManager: Setting node as OFFLINED in
>> ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
>> STARTKEY => '', ENDKEY => ''}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
>> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=CLOSED,
>> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740
>> state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
>> [AM.ZK.Worker-pool5-t1879] master.RegionStates: Transition
>> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_CLOSE, ts=1565701090871,
>> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
>> state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
>> [AM.-pool3-t178] master.AssignmentManager: Assigning
>> hbase:meta,,1.1588230740 to prod023,60020,1565701089060
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
>> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=OFFLINE,
>> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740
>> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO
>> [AM.-pool3-t179] master.AssignmentManager: Setting node as OFFLINED in
>> ZooKeeper for region {ENCODED => d3d9ad37f8506c02a84458121965c0b0, NAME =>
>> 'hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.',
>> STARTKEY => '', ENDKEY => ''}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO
>> [AM.-pool3-t179] master.RegionStates: Transition
>> {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984,
>> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
>> state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270}
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO
>> [AM.-pool3-t179] master.AssignmentManager: Assigning
>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
>> prod023,60020,1565701089060
>> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO
>> [AM.-pool3-t179] master.RegionStates: Transition
>> {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985,
>> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
>> state=PENDING_OPEN, ts=1565701090986, server=prod023,60020,1565701089060}
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,039 WARN
>> [AM.-pool3-t179] master.AssignmentManager: Failed assignment of
>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
>> prod023,60020,1565701089060, waiting a little before trying on the same
>> region server try=1 of 10
>> Aug 13 15:58:11 prod001 hbase[14953]:
>> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
>> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
>> prod023,60020,1565701089060 is not running yet
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:95)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:85)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:787)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2248)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1694)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> java.lang.Thread.run(Thread.java:748)
>> Aug 13 15:58:11 prod001 hbase[14953]: Caused by:
>> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
>> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
>> prod023,60020,1565701089060 is not running yet
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:386)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:405)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
>> Aug 13 15:58:11 prod001 hbase[14953]:         at
>> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
>> Aug 13 15:58:11 prod001 hbase[14953]:         ... 1 more
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,217 INFO
>> [AM.ZK.Worker-pool5-t1882] master.RegionStates: Transition {1588230740
>> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
>> to {1588230740 state=OPENING, ts=1565701091217,
>> server=prod023,60020,1565701089060}
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,244 INFO
>> [AM.-pool3-t179] master.AssignmentManager: Assigning
>> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
>> prod023,60020,1565701089060
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,391 INFO
>> [AM.ZK.Worker-pool5-t1883] master.RegionStates: Transition
>> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701091244,
>> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0
>> state=OPENING, ts=1565701091391, server=prod023,60020,1565701089060}
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO
>> [AM.ZK.Worker-pool5-t1884] master.RegionStates: Transition {1588230740
>> state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060} to
>> {1588230740 state=OPEN, ts=1565701091666,
>> server=prod023,60020,1565701089060}
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO
>> [AM.ZK.Worker-pool5-t1884] coordination.ZkOpenRegionCoordination: Handling
>> OPENED of 1588230740 from prod001,60000,1565179968855; deleting unassigned
>> node
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,668 INFO
>> [AM.ZK.Worker-pool5-t1886] master.RegionStates: Offlined 1588230740 from
>> prod006,60020,1564942121270
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,805 INFO
>> [AM.ZK.Worker-pool5-t1887] master.RegionStates: Transition
>> {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391,
>> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0
>> state=OPEN, ts=1565701091805, server=prod023,60020,1565701089060}
>> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,807 INFO
>> [AM.ZK.Worker-pool5-t1889] master.RegionStates: Offlined
>> d3d9ad37f8506c02a84458121965c0b0 from prod006,60020,1564942121270
>>
>>
>> From this moment hbase:meta and hbase:namespace is moved to prod023 from
>> prod006
>>
>> But some other servers still try to get hbase:meta from prod006:
>>
>> Aug 13 16:02:24 prod021 hbase[103657]: 2019-08-13 16:02:24,832 INFO
>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>> client.RpcRetryingCaller: Call exception, tries=10, retries=350,
>> started=38462 ms ago, cancelled=false,
>> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1
>> is not online on prod006,60020,1564942121270
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3086)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2197)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2369)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>> Aug 13 16:02:24 prod021 hbase[103657]: #011at
>> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
>> Aug 13 16:02:24 prod021 hbase[103657]:  row
>> ‘IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.44'
>> on table 'hbase:meta' at region=hbase:meta,,1.1588230740,
>> hostname=prod023,60020,1565701089060, seqNum=0
>>
>> ….
>>
>> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,136 INFO
>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>> client.RpcRetryingCaller: Call exception, tries=68, retries=350,
>> started=1194772 ms ago, cancell
>> ed=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region
>> hbase:meta,,1 is not online on prod006,60020,1564942121270
>> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,146 INFO
>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>> regionserver.SplitRequest: Running rollback/cleanup of failed split of
>> IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.;
>> callTimeout=1200000, callDuration=1214938:
>> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is
>> not online on prod006,60020,1564942121270
>> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,193 FATAL
>> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
>> regionserver.HRegionServer: ABORTING region server
>> prod021,60020,1564943354635: Abort; we got an error after point-of-no-return
>>
>> The same picture on other crashed servers
>>
>>
>> Any ideas why this happened and how to prevent this in future?
>
>

Re: hbase:meta not online

Posted by OpenInx <op...@gmail.com>.
Hi Alexander
Thanks for the log report & share.  I guess we did not handle
the NotServingRegionException for hbase:meta correctly, says when a client
encounter a NotServingRegionException from hbase:meta, it won't try to
relocate the hbase:meta location. (in 1.4.8)
Let me have a check.

On Wed, Aug 14, 2019 at 1:11 AM Alexander Batyrshin <0x...@gmail.com>
wrote:

>  Hello,
> We’ve got problem with hbase:meta that cause several servers to die.
> Our version is HBase-1.4.8
>
> prod006 - flush hbase:meta and move it:
>
> Aug 13 15:46:23 prod006 hbase[63435]: 2019-08-13 15:46:23,019 INFO
> [prod006,60020,1564942121270_ChoreService_1] regionserver.HRegionServer:
> prod006,60020,1564942121270-MemstoreFlusherChore requesting flush of
> hbase:meta,,1.1588230740 because info has an old edit so flush to free WALs
> after random delay 226,666 ms
> Aug 13 15:50:09 prod006 hbase[63435]: 2019-08-13 15:50:09,709 INFO
> [MemStoreFlusher.1] regionserver.HRegion: Finished memstore flush of ~2.99
> KB/3064, currentsize=0 B/0 for region hbase:meta,,1.1588230740 in 24ms,
> sequenceid=323883, compaction requested=false
> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,943 INFO
> [StoreCloserThread-hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.-1]
> regionserver.HStore: Closed info
> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,951 INFO
> [StoreCloserThread-hbase:meta,,1.1588230740-1] regionserver.HStore: Closed
> info
> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,976 INFO
> [RS_CLOSE_META-prod006:60020-0] regionserver.HRegion: Closed
> hbase:meta,,1.1588230740
> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO
> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegion: Closed
> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.
> Aug 13 15:58:10 prod006 hbase[63435]: 2019-08-13 15:58:10,979 INFO
> [RS_CLOSE_REGION-prod006:60020-0] regionserver.HRegionServer: Adding moved
> region record: d3d9ad37f8506c02a84458121965c0b0 to
> prod023,60020,1565701089060 as of 147
>
> At this time at master we have this:
>
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
> [AM.ZK.Worker-pool5-t1878] master.RegionStates: Transition {1588230740
> state=PENDING_CLOSE, ts=1565701090850, server=prod006,60020,1564942121270}
> to {158823074
> 0 state=CLOSED, ts=1565701090982, server=prod006,60020,1564942121270}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
> [AM.-pool3-t178] master.AssignmentManager: Setting node as OFFLINED in
> ZooKeeper for region {ENCODED => 1588230740, NAME => 'hbase:meta,,1',
> STARTKEY => '', ENDKEY => ''}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,982 INFO
> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=CLOSED,
> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740
> state=OFFLINE, ts=1565701090982, server=prod006,60020,1564942121270}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
> [AM.ZK.Worker-pool5-t1879] master.RegionStates: Transition
> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_CLOSE, ts=1565701090871,
> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
> state=CLOSED, ts=1565701090984, server=prod006,60020,1564942121270}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
> [AM.-pool3-t178] master.AssignmentManager: Assigning
> hbase:meta,,1.1588230740 to prod023,60020,1565701089060
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,984 INFO
> [AM.-pool3-t178] master.RegionStates: Transition {1588230740 state=OFFLINE,
> ts=1565701090982, server=prod006,60020,1564942121270} to {1588230740
> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO
> [AM.-pool3-t179] master.AssignmentManager: Setting node as OFFLINED in
> ZooKeeper for region {ENCODED => d3d9ad37f8506c02a84458121965c0b0, NAME =>
> 'hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0.',
> STARTKEY => '', ENDKEY => ''}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,985 INFO
> [AM.-pool3-t179] master.RegionStates: Transition
> {d3d9ad37f8506c02a84458121965c0b0 state=CLOSED, ts=1565701090984,
> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
> state=OFFLINE, ts=1565701090985, server=prod006,60020,1564942121270}
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO
> [AM.-pool3-t179] master.AssignmentManager: Assigning
> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
> prod023,60020,1565701089060
> Aug 13 15:58:10 prod001 hbase[14953]: 2019-08-13 15:58:10,986 INFO
> [AM.-pool3-t179] master.RegionStates: Transition
> {d3d9ad37f8506c02a84458121965c0b0 state=OFFLINE, ts=1565701090985,
> server=prod006,60020,1564942121270} to {d3d9ad37f8506c02a84458121965c0b0
> state=PENDING_OPEN, ts=1565701090986, server=prod023,60020,1565701089060}
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,039 WARN
> [AM.-pool3-t179] master.AssignmentManager: Failed assignment of
> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
> prod023,60020,1565701089060, waiting a little before trying on the same
> region server try=1 of 10
> Aug 13 15:58:11 prod001 hbase[14953]:
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException:
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
> prod023,60020,1565701089060 is not running yet
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.instantiateException(RemoteWithExtrasException.java:95)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException.unwrapRemoteException(RemoteWithExtrasException.java:85)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.makeIOExceptionOfException(ProtobufUtil.java:368)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:330)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:787)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:2248)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1694)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.master.AssignCallable.call(AssignCallable.java:48)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> java.lang.Thread.run(Thread.java:748)
> Aug 13 15:58:11 prod001 hbase[14953]: Caused by:
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.ServerNotRunningYetException):
> org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server
> prod023,60020,1565701089060 is not running yet
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.checkOpen(RSRpcServices.java:1320)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.openRegion(RSRpcServices.java:1707)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$2.callBlockingMethod(AdminProtos.java:22737)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2380)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:386)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:405)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
> Aug 13 15:58:11 prod001 hbase[14953]:         at
> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
> Aug 13 15:58:11 prod001 hbase[14953]:         ... 1 more
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,217 INFO
> [AM.ZK.Worker-pool5-t1882] master.RegionStates: Transition {1588230740
> state=PENDING_OPEN, ts=1565701090984, server=prod023,60020,1565701089060}
> to {1588230740 state=OPENING, ts=1565701091217,
> server=prod023,60020,1565701089060}
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,244 INFO
> [AM.-pool3-t179] master.AssignmentManager: Assigning
> hbase:namespace,,1516784579153.d3d9ad37f8506c02a84458121965c0b0. to
> prod023,60020,1565701089060
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,391 INFO
> [AM.ZK.Worker-pool5-t1883] master.RegionStates: Transition
> {d3d9ad37f8506c02a84458121965c0b0 state=PENDING_OPEN, ts=1565701091244,
> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0
> state=OPENING, ts=1565701091391, server=prod023,60020,1565701089060}
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO
> [AM.ZK.Worker-pool5-t1884] master.RegionStates: Transition {1588230740
> state=OPENING, ts=1565701091217, server=prod023,60020,1565701089060} to
> {1588230740 state=OPEN, ts=1565701091666,
> server=prod023,60020,1565701089060}
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,666 INFO
> [AM.ZK.Worker-pool5-t1884] coordination.ZkOpenRegionCoordination: Handling
> OPENED of 1588230740 from prod001,60000,1565179968855; deleting unassigned
> node
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,668 INFO
> [AM.ZK.Worker-pool5-t1886] master.RegionStates: Offlined 1588230740 from
> prod006,60020,1564942121270
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,805 INFO
> [AM.ZK.Worker-pool5-t1887] master.RegionStates: Transition
> {d3d9ad37f8506c02a84458121965c0b0 state=OPENING, ts=1565701091391,
> server=prod023,60020,1565701089060} to {d3d9ad37f8506c02a84458121965c0b0
> state=OPEN, ts=1565701091805, server=prod023,60020,1565701089060}
> Aug 13 15:58:11 prod001 hbase[14953]: 2019-08-13 15:58:11,807 INFO
> [AM.ZK.Worker-pool5-t1889] master.RegionStates: Offlined
> d3d9ad37f8506c02a84458121965c0b0 from prod006,60020,1564942121270
>
>
> From this moment hbase:meta and hbase:namespace is moved to prod023 from
> prod006
>
> But some other servers still try to get hbase:meta from prod006:
>
> Aug 13 16:02:24 prod021 hbase[103657]: 2019-08-13 16:02:24,832 INFO
> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
> client.RpcRetryingCaller: Call exception, tries=10, retries=350,
> started=38462 ms ago, cancelled=false,
> msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1
> is not online on prod006,60020,1564942121270
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3086)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2197)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2369)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
> Aug 13 16:02:24 prod021 hbase[103657]: #011at
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)
> Aug 13 16:02:24 prod021 hbase[103657]:  row
> ‘IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.44'
> on table 'hbase:meta' at region=hbase:meta,,1.1588230740,
> hostname=prod023,60020,1565701089060, seqNum=0
>
> ….
>
> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,136 INFO
> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
> client.RpcRetryingCaller: Call exception, tries=68, retries=350,
> started=1194772 ms ago, cancell
> ed=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region
> hbase:meta,,1 is not online on prod006,60020,1564942121270
> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,146 INFO
> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
> regionserver.SplitRequest: Running rollback/cleanup of failed split of
> IDX,\x1Abb6f7662-275e-459d-b0f3-7fd6aac9b6a4\x0000000046218537kV0WbJ_\x00\x13\x80\x00\x01k\x90\xFEzH\x00\x00\x00\x00,1562868892011.a215bcc7b396e5d08563f0a4183ba5cc.;
> callTimeout=1200000, callDuration=1214938:
> org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is
> not online on prod006,60020,1564942121270
> Aug 13 16:21:41 prod021 hbase[103657]: 2019-08-13 16:21:41,193 FATAL
> [regionserver/prod021/10.73.65.28:60020-splits-1564981231292]
> regionserver.HRegionServer: ABORTING region server
> prod021,60020,1564943354635: Abort; we got an error after point-of-no-return
>
> The same picture on other crashed servers
>
>
> Any ideas why this happened and how to prevent this in future?