You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2018/12/11 00:07:00 UTC
[jira] [Updated] (HBASE-21576) master should proactively reassign meta when killing a RS with it

     [ https://issues.apache.org/jira/browse/HBASE-21576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin updated HBASE-21576:
-------------------------------------
    Description: 
Master has killed an RS that was hosting meta due to some internal error (still need to see if it's a separate bug or just a machine/HDFS issue, I've lost the RS logs due to HBASE-21575).
RS took a very long time to die (again, might be a separate bug, I'll file if I see repro), and a long time to restart; meanwhile master never tried to reassign meta, and eventually killed itself not being able to update it.
It seems like a RS on a bad machine would be especially prone to slow abort/startup, as well as to issues causing master to kill it, so it would make sense for master to immediately relocate meta once meta-hosting RS is dead after a kill; or even when killing the RS. In the former case (if the RS needs to die for meta to be reassigned safely), perhaps the RS hosting meta in particular should try to die fast in such circumstances, and not do any cleanup.
{noformat}
2018-12-08 04:52:55,144 WARN  [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal error:
***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL required. Forcing server shutdown *****
.... [aborting for ~7 minutes]
2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server <server1>,17020,1544264858183 aborting, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [starting for ~5]
2018-12-08 04:59:58,574 INFO  [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: connection timed out: <server1>, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [re-initializing for at least ~7]
2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41137 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server <server1>,17020,1544274145387 is not running yet
...
2018-12-08 05:11:18,470 ERROR [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... state=OPEN *****^M
{noformat}

There are no signs of meta assignment activity at all in master logs

  was:
Master has killed an RS that was hosting meta due to some internal error (still need to see if it's a separate bug or just a machine/HDFS issue, I've lost the RS logs due to HBASE-21575).
RS took a very long time to die (again, might be a separate bug, I'll file if I see repro), and a long time to restart; meanwhile master never tried to reassign meta, and eventually killed itself not being able to update it.
It seems like a RS on a bad machine would be especially prone to slow abort/startup, as well as to issues causing master to kill it, so it would make sense for master to immediately relocate meta once meta-hosting RS is dead; or even when killing the RS. In the former case (if the RS needs to die for meta to be reassigned safely), perhaps the RS hosting meta in particular should try to die fast in such circumstances, and not do any cleanup.
{noformat}
2018-12-08 04:52:55,144 WARN  [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal error:
***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL required. Forcing server shutdown *****
.... [aborting for ~7 minutes]
2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server <server1>,17020,1544264858183 aborting, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [starting for ~5]
2018-12-08 04:59:58,574 INFO  [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: connection timed out: <server1>, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
... [re-initializing for at least ~7]
2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41137 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server <server1>,17020,1544274145387 is not running yet
...
2018-12-08 05:11:18,470 ERROR [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... state=OPEN *****^M
{noformat}

There are no signs of meta assignment activity at all in master logs


> master should proactively reassign meta when killing a RS with it
> -----------------------------------------------------------------
>
>                 Key: HBASE-21576
>                 URL: https://issues.apache.org/jira/browse/HBASE-21576
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Master has killed an RS that was hosting meta due to some internal error (still need to see if it's a separate bug or just a machine/HDFS issue, I've lost the RS logs due to HBASE-21575).
> RS took a very long time to die (again, might be a separate bug, I'll file if I see repro), and a long time to restart; meanwhile master never tried to reassign meta, and eventually killed itself not being able to update it.
> It seems like a RS on a bad machine would be especially prone to slow abort/startup, as well as to issues causing master to kill it, so it would make sense for master to immediately relocate meta once meta-hosting RS is dead after a kill; or even when killing the RS. In the former case (if the RS needs to die for meta to be reassigned safely), perhaps the RS hosting meta in particular should try to die fast in such circumstances, and not do any cleanup.
> {noformat}
> 2018-12-08 04:52:55,144 WARN  [RpcServer.default.FPBQ.Fifo.handler=39,queue=4,port=17000] master.MasterRpcServices: <server1>,17020,1544264858183 reported a fatal error:
> ***** ABORTING region server <server1>,17020,1544264858183: Replay of WAL required. Forcing server shutdown *****
> .... [aborting for ~7 minutes]
> 2018-12-08 04:53:44,190 INFO  [PEWorker-7] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41190 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server <server1>,17020,1544264858183 aborting, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
> ... [starting for ~5]
> 2018-12-08 04:59:58,574 INFO  [RpcServer.default.FPBQ.Fifo.handler=45,queue=0,port=17000] client.RpcRetryingCallerImpl: Call exception, tries=10, retries=61, started=392702 ms ago, cancelled=false, msg=Call to <server1> failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.ConnectTimeoutException: connection timed out: <server1>, details=row '...' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=<server1>,17020,1544264858183, seqNum=-1
> ... [re-initializing for at least ~7]
> 2018-12-08 05:04:17,271 INFO  [hconnection-0x4d58bcd4-shared-pool3-t1877] client.RpcRetryingCallerImpl: Call exception, tries=6, retries=61, started=41137 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.ipc.ServerNotRunningYetException: Server <server1>,17020,1544274145387 is not running yet
> ...
> 2018-12-08 05:11:18,470 ERROR [RpcServer.default.FPBQ.Fifo.handler=38,queue=3,port=17000] master.HMaster: ***** ABORTING master ...,17000,1544230401860: FAILED persisting region=... state=OPEN *****^M
> {noformat}
> There are no signs of meta assignment activity at all in master logs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)