You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Allan Yang (JIRA)" <ji...@apache.org> on 2019/01/21 04:18:00 UTC
[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable

    [ https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16747647#comment-16747647 ] 

Allan Yang commented on HBASE-21742:
------------------------------------

What Version of HBase are you using? I don't recall the issue number, in previous versions, the default retry number for region assign is 10 times, so if the SCP fails 10 times, it will give up. And since the WAL dirs are already deleted, no SCP later will handle the assignment even if restart master or RS.
For the newest version of HBase, the retry number now changed to Integer.MAX, so it won't be a problem.
For the cluster already hit the problem, HBCK2 is required, or just delete the meta-region ZNode in the zk, and restart the master, it will re-trigger a assignment of meta table.

> master can create bad procedures during abort, making entire cluster unusable
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-21742
>                 URL: https://issues.apache.org/jira/browse/HBASE-21742
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING master master,17000,1547604554447: FAILED [blah] *****
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... {noformat}
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] master.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> This SCP gets persisted, so when the next master starts, it waits forever for meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the information here, as it often does in bugs I've found recently, but some split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)