You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2018/12/11 00:06:00 UTC
[jira] [Created] (HBASE-21576) master should proactively reassign meta when killing a RS with it

Sergey Shelukhin created HBASE-21576:
----------------------------------------

             Summary: master should proactively reassign meta when killing a RS with it
                 Key: HBASE-21576
                 URL: https://issues.apache.org/jira/browse/HBASE-21576
             Project: HBase
          Issue Type: Bug
            Reporter: Sergey Shelukhin


Master has killed an RS that was hosting meta due to some internal error (still need to see if it's a separate bug or just a machine/HDFS issue, I've lost the RS logs due to HBASE-21575).
RS took a very long time to die (again, might be a separate bug, I'll file if I see repro), and a long time to restart; meanwhile master never tried to reassign meta, and eventually killed itself not being able to update it.
It seems like a RS on a bad machine would be especially prone to slow abort/startup, as well as to issues causing master to kill it, so it would make sense for master to immediately relocate meta once meta-hosting RS is dead; or even when killing the RS. In the former case (if the RS needs to die for meta to be reassigned safely), perhaps the RS hosting meta in particular should try to die fast in such circumstances, and not do any cleanup.
{noformat}
2018-12-08 04:52:55,144 WARN  [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal error:
***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL required. Forcing server shutdown *****
.... [aborting for ~7 minutes]
2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server <server1>,17020,1544264858183 aborting, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [starting for ~5]
2018-12-08 04:59:58,574 INFO  [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: connection timed out: <server1>, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [re-initializing for at least ~7]
2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41137 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server <server1>,17020,1544274145387 is not running yet
...
2018-12-08 05:11:18,470 ERROR [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... state=OPEN *****^M
{noformat}

There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)