You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Andrew Purtell <ap...@apache.org> on 2010/06/19 22:46:49 UTC

master fails to reassign regions from dead RS?

Sorry, stuck in airport about to get on planes for 24 hours+ with limited time.

Full logs available at:  https://tm-files.s3.amazonaws.com/20100618-test0-webtable-crash.tar.bz2

I witnessed a problem while testing the latest build of our TM-2 HBase+Hadoop internal distribution last night. TM-2 is HBase 0.20.5-dev (the latest RC) with no HBase patches. Where we diverge from upstream for this build is HDFS: we use Hadoop 0.20.2 plus 0.20-append branch changesets plus a couple of other HDFS patches. 

The divergence between our branch and upstream may ONLY be material with regards to why a region server crashed. (Separate issue I am looking into.) What I am reporting now is a possible problem with the master. 

As far as I can see, in this test scenario the master never reassigns regions away from a crashed RS. 


***
CLIENT

The client sees the RS hosting ROOT and META die.

10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 because: Call to /10.222.238.255:60020 failed on local exception: java.io.EOFException
10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for tableName=.META. from cache because of TestTable,9f9a5c1e-f4c3-b955-85b2-f60636a9cb25,99999999999999
10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt 0 of 10 failed; retrying after sleep of 1000 because: Connection refused
10/06/19 07:59:38 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.222.238.255:60020
10/06/19 07:59:38 DEBUG client.HConnectionManager$TableServers: Root region location changed. Sleeping.

***
MASTER

HMaster also sees the RS hosting ROOT and META die.

2010-06-19 07:59:35,424 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: TestTable,8b241b8b-4ab0-f252-8d18-0949c7a918bf,1276948774514 from domU-12-31-38-04-ED-D1.compute-1.internal,60020,1276920183192; 1 of 1
2010-06-19 07:59:35,625 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row TestTable,8b35278f-56fa-4955-93fe-2ff7c4163dc3,1276948774514 in region .META.,,1 with startcode=1276920183192, server=10.220.238.31:60020
2010-06-19 07:59:35,625 INFO org.apache.hadoop.hbase.master.RegionServerOperation: TestTable,8b241b8b-4ab0-f252-8d18-0949c7a918bf,1276948774514 open on 10.220.238.31:60020
2010-06-19 07:59:35,704 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row TestTable,8b241b8b-4ab0-f252-8d18-0949c7a918bf,1276948774514 in region .META.,,1 with startcode=1276920183192, server=10.220.238.31:60020
2010-06-19 08:00:04,370 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.222.238.255:60020, regionname: -ROOT-,,0, startKey: <>}
2010-06-19 08:00:04,373 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan ROOT region
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:843)
	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:715)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
	at $Proxy1.openScanner(Unknown Source)
	at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:177)
	at org.apache.hadoop.hbase.master.RootScanner.scanRoot(RootScanner.java:54)
	at org.apache.hadoop.hbase.master.RootScanner.maintenanceScan(RootScanner.java:79)
	at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
	at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
2010-06-19 08:00:04,532 INFO org.apache.hadoop.hbase.master.ServerManager: 5 region servers, 0 dead, average load 456.4
2010-06-19 08:00:07,399 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.metaScanner scanning meta region {server: 10.222.238.255:60020, regionname: .META.,,1, startKey: <>}
2010-06-19 08:00:07,400 WARN org.apache.hadoop.hbase.master.BaseScanner: Scan one META region: {server: 10.222.238.255:60020, regionname: .META.,,1, startKey: <>}
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
	at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:308)
	at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:843)
	at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:715)
	at org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:333)
	at $Proxy1.openScanner(Unknown Source)
	at org.apache.hadoop.hbase.master.BaseScanner.scanRegion(BaseScanner.java:177)
	at org.apache.hadoop.hbase.master.MetaScanner.scanOneMetaRegion(MetaScanner.java:73)
	at org.apache.hadoop.hbase.master.MetaScanner.maintenanceScan(MetaScanner.java:129)
	at org.apache.hadoop.hbase.master.BaseScanner.chore(BaseScanner.java:153)
	at org.apache.hadoop.hbase.Chore.run(Chore.java:68)
2010-06-19 08:00:07,407 INFO org.apache.hadoop.hbase.master.BaseScanner: All 1 .META. region(s) scanned
2010-06-19 08:00:29,708 INFO org.apache.hadoop.hbase.master.ServerManager: domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076 znode expired
2010-06-19 08:00:29,710 INFO org.apache.hadoop.hbase.master.RegionManager: -ROOT- region unset (but not set to be reassigned)
2010-06-19 08:00:29,710 INFO org.apache.hadoop.hbase.master.RegionManager: META region removed from onlineMetaRegions
2010-06-19 08:00:29,711 INFO org.apache.hadoop.hbase.master.RegionServerOperation: process shutdown of server domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076: logSplit: false, rootRescanned: false, numberOfMetaRegions: 1, onlineMetaRegions.size(): 0
2010-06-19 08:00:29,756 INFO org.apache.hadoop.hbase.regionserver.HLog: Splitting 38 hlog(s) in hdfs://domU-12-31-38-06-EC-81.compute-1.internal:8020/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076
2010-06-19 08:00:34,979 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:00:38,418 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:00:42,021 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:00:47,539 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:00:49,649 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:00:51,761 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:00:57,992 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:00,297 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:02,206 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:04,827 INFO org.apache.hadoop.hbase.master.ServerManager: 4 region servers, 1 dead, average load 456.25[domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076]
2010-06-19 08:01:07,690 INFO org.apache.hadoop.hbase.master.BaseScanner: All 0 .META. region(s) scanned
2010-06-19 08:01:08,042 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:09,670 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:11,710 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:16,659 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:19,058 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:20,908 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:25,805 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:27,674 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:29,533 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:33,806 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:36,320 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:38,512 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:44,046 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:46,109 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:48,316 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:54,333 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:56,089 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:01:58,738 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:03,269 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:04,829 INFO org.apache.hadoop.hbase.master.ServerManager: 4 region servers, 1 dead, average load 456.25[domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076]
2010-06-19 08:02:04,908 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:07,420 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:07,692 INFO org.apache.hadoop.hbase.master.BaseScanner: All 0 .META. region(s) scanned
2010-06-19 08:02:12,709 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:15,076 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:18,160 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:23,213 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:25,326 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:27,388 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:31,922 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:32,584 INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://domU-12-31-38-06-EC-81.compute-1.internal/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573 for DFSClient_1937332980 on client 10.222.239.111, because this file is already being created by NN_Recovery on 10.222.238.255
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1166)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1258)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1044)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1040)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1038)

	at org.apache.hadoop.ipc.Client.call(Client.java:740)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
	at $Proxy0.append(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy0.append(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:521)
	at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
	at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
	at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1203)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:959)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:872)
	at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
	at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:497)
	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432)
2010-06-19 08:02:33,591 INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://domU-12-31-38-06-EC-81.compute-1.internal/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573 for DFSClient_1937332980 on client 10.222.239.111, because this file is already being created by NN_Recovery on 10.222.238.255
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1166)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1258)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1044)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1040)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1038)

	at org.apache.hadoop.ipc.Client.call(Client.java:740)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
	at $Proxy0.append(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy0.append(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:521)
	at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
	at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
	at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1203)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:959)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:872)
	at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
	at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:497)
	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432)
2010-06-19 08:02:34,597 INFO org.apache.hadoop.hbase.regionserver.HLog: Failed open for append, waiting on lease recovery: hdfs://domU-12-31-38-06-EC-81.compute-1.internal/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573
org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076/hlog.dat.1276948769573 for DFSClient_1937332980 on client 10.222.239.111, because this file is already being created by NN_Recovery on 10.222.238.255
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1166)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1258)
	at org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:396)
	at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1044)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1040)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1038)

	at org.apache.hadoop.ipc.Client.call(Client.java:740)
	at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
	at $Proxy0.append(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
	at $Proxy0.append(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:521)
	at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:186)
	at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:530)
	at org.apache.hadoop.hbase.regionserver.HLog.recoverLog(HLog.java:1203)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:959)
	at org.apache.hadoop.hbase.regionserver.HLog.splitLog(HLog.java:872)
	at org.apache.hadoop.hbase.master.ProcessServerShutdown.process(ProcessServerShutdown.java:298)
	at org.apache.hadoop.hbase.master.HMaster.processToDoQueue(HMaster.java:497)
	at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:432)
2010-06-19 08:02:36,330 INFO org.apache.hadoop.hbase.regionserver.HLog: Past out lease recovery
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.regionserver.HLog: hlog file splitting completed in 139783 millis for hdfs://domU-12-31-38-06-EC-81.compute-1.internal:8020/hbase/.logs/domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Log split complete, meta reassignment and scanning:

*** Log recovery complete.

2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown reassigning ROOT region
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionManager: -ROOT- region unset (but not set to be reassigned)
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionManager: ROOT inserted into regionsInTransition
2010-06-19 08:02:49,498 INFO org.apache.hadoop.hbase.master.RegionServerOperation: ProcessServerShutdown setting to unassigned: {server: 10.222.238.255:60020, regionname: .META.,,1, startKey: <>}
2010-06-19 08:02:49,724 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region -ROOT-,,0 to domU-12-31-38-01-71-B1.compute-1.internal,60020,1276920193543
2010-06-19 08:02:49,796 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: -ROOT-,,0 from domU-12-31-38-01-71-B1.compute-1.internal,60020,1276920193543; 1 of 1

*** ROOT is reopened

2010-06-19 08:02:49,800 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.253.118.63:60020, regionname: -ROOT-,,0, startKey: <>}
2010-06-19 08:02:49,854 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {server: 10.253.118.63:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-06-19 08:02:49,854 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scanning meta region {server: 10.253.118.63:60020, regionname: -ROOT-,,0, startKey: <>}
2010-06-19 08:02:49,861 INFO org.apache.hadoop.hbase.master.BaseScanner: RegionManager.rootScanner scan of 1 row(s) of meta region {server: 10.253.118.63:60020, regionname: -ROOT-,,0, startKey: <>} complete
2010-06-19 08:02:49,888 INFO org.apache.hadoop.hbase.master.RegionManager: Assigning region .META.,,1 to domU-12-31-38-04-ED-D1.compute-1.internal,60020,1276920183192
2010-06-19 08:03:04,831 INFO org.apache.hadoop.hbase.master.ServerManager: 4 region servers, 1 dead, average load 456.5[domU-12-31-38-06-EC-F1.compute-1.internal,60020,1276920180076]
2010-06-19 08:03:07,693 INFO org.apache.hadoop.hbase.master.BaseScanner: All 0 .META. region(s) scanned
2010-06-19 08:03:11,618 INFO org.apache.hadoop.hbase.master.ServerManager: Processing MSG_REPORT_OPEN: .META.,,1 from domU-12-31-38-04-ED-D1.compute-1.internal,60020,1276920183192; 1 of 1
2010-06-19 08:03:11,619 INFO org.apache.hadoop.hbase.master.RegionServerOperation: .META.,,1 open on 10.220.238.31:60020
2010-06-19 08:03:11,647 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row .META.,,1 in region -ROOT-,,0 with startcode=1276920183192, server=10.220.238.31:60020

*** META is reopened


***
CLIENT

After ROOT and META are reopened

*** Finds new ROOT

10/06/19 08:02:53 DEBUG client.HConnectionManager$TableServers: Wake. Retry finding root region.
10/06/19 08:02:53 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.253.118.63:60020
10/06/19 08:02:53 DEBUG client.HConnectionManager$TableServers: Found ROOT at 10.253.118.63:60020
10/06/19 08:02:53 DEBUG zookeeper.ZooKeeperWrapper: Read ZNode /hbase/root-region-server got 10.253.118.63:60020
10/06/19 08:02:53 DEBUG client.HConnectionManager$TableServers: Found ROOT at 10.253.118.63:60020

*** Finds new META

10/06/19 08:03:15 DEBUG client.HConnectionManager$TableServers: locateRegionInMeta attempt 8 of 10 failed; retrying after sleep of 16000 because: Connection refused
10/06/19 08:03:15 DEBUG client.HConnectionManager$TableServers: Removed .META.,,1 for tableName=.META. from cache because of TestTable,aad6db21-f4c3-e255-96d1-3bc9a9339734,99999999999999
10/06/19 08:03:15 DEBUG client.HConnectionManager$TableServers: Cached location for .META.,,1 is 10.220.238.31:60020

*** Tries to relocate 00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 which was hosted on dead RS

10/06/19 08:04:10 DEBUG client.HConnectionManager$TableServers: Removed TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 for tableName=TestTable from cache because of 0094d9bd-41ad-8c51-a266-1e54a277d247
10/06/19 08:04:10 DEBUG client.HConnectionManager$TableServers: Cached location for TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 is 10.222.238.255:60020
10/06/19 08:04:10 DEBUG client.HConnectionManager$TableServers: Removed TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 for tableName=TestTable from cache because of 008169ff-7ad8-f05d-854f-954bc10de3f3

*** Still trying hours later (this region among others)

10/06/19 11:38:08 DEBUG client.HConnectionManager$TableServers: Cached location for TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 is 10.222.238.255:60020
10/06/19 11:38:08 DEBUG client.HConnectionManager$TableServers: Removed TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 for tableName=TestTable from cache because of 008169ff-7ad8-f05d-854f-954bc10de3f3
10/06/19 11:38:08 DEBUG client.HConnectionManager$TableServers: Cached location for TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 is 10.222.238.255:60020


***
MASTER

So what happend with TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367?

This is the last entry in the master log regarding this region:

2010-06-19 04:25:55,646 INFO org.apache.hadoop.hbase.master.RegionServerOperation: TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 open on 10.222.238.255:60020
2010-06-19 04:25:55,663 INFO org.apache.hadoop.hbase.master.RegionServerOperation: Updated row TestTable,00771560-c9e1-7a50-9183-ac99e0be03b8,1276935951367 in region .META.,,1 with startcode=1276920180076, server=10.222.238.255:60020

The RS failure happened around 10/06/19 07:59

The master never tries to reassign this or other regions hosted on the dead RS.

Master processes MSG_REPORT_SPLIT_INCLUDES_DAUGHTERS that come in over time from this point. However, never a MSG_REPORT_OPEN, or CLOSE either.

Do I have this right?

I admit I've gone through these logs hastily as I'm pressed for time.

   - Andy