You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2018/09/06 00:58:00 UTC

[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

    [ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605123#comment-16605123 ] 

stack commented on HBASE-21035:
-------------------------------

Late to the party....

I've been playing with removing the procedure WALs and doing other damage to the cluster to see how well we recover. I ran into this issue here that [~allan163] talks of where there is no assign for meta region on startup; in my case, I'd removed the procedure WAL dirs as Allan does in his test but different from Allan, there was no WAL dir for the server that had been carrying meta; I think it'd been removed across restarts (I didn't check) but I was able to repro by removing the empty WAL dir manually (empty because there'd been a clean shutdown).

After reading the above healthy back and forth, while the system seems pretty robust as is -- I had trouble breaking it removing stuff -- and Allan's patch would catch a particular case not covered now, I agree that we need the "assign meta" fix-it in our hbck2 vocabulary. Let me add scheduling of a meta assign (and log search and recovery) as a hbck2 option to the list of fix-its we need in hbck2 as we discussed in person a few weeks back. 

In my investigations, it seems like we need similar for hbase:namespace table. It can get banjaxed similarly and if not online the cluster is a mess.

Master initialization gets stuck trying to read from meta to populate the TableStates. It later gets stuck trying to initialize the TableNamespaceManager if the namespace table is not online.

I filed HBASE-21156

> Meta Table should be able to online even if all procedures are lost
> -------------------------------------------------------------------
>
>                 Key: HBASE-21035
>                 URL: https://issues.apache.org/jira/browse/HBASE-21035
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>         Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will only check WAL dirs and compare to Zookeeper RS nodes to decide which server need to expire. For servers which's dir is ending with 'SPLITTING', we assure that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and if all the procedure wals are lost (due to bug, or deleted manually, whatever), the new restarted master will be stuck when initing. Since no one will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)