You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/05 20:48:00 UTC

[jira] [Comment Edited] (HBASE-21844) Master could get stuck in initializing state while waiting for meta

    [ https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761197#comment-16761197 ] 

Sergey Shelukhin edited comment on HBASE-21844 at 2/5/19 8:47 PM:
------------------------------------------------------------------

We are basically running master (that's the version [~bahramch] is referring to, ".0.4" is a gimmick due to some internal build system restrictions). In lieu of HBCK it's me (and others) doing manual recovery by updating meta, restarting master, and nuking proc WAL in various combinations :)

My 50k ft level take is that we don't need procedures for anything less complicated than at least a split-merge (i.e. all of the assignment).. 20kft I think it's a good idea to make master startup in particular resilient to procedure issues when the state is redundant, because it's the place where many bugs/races manifest, and in our experience that would allow vast majority of issues to be fixed by master restart not resulting in extra cluster downtime or manual recovery. 
10kft level there is this patch ;)
Perhaps we should put it behind a general purpose config, so that this and any other such features could be enabled together, and be off by default.



was (Author: sershe):
We are basically running master (that's the version [~bahramch] is referring to, ".0.4" is a gimmick due to some internal build system restrictions). In lieu of HBCK it's me (and others) doing manual recovery by updating meta, restarting master, and nuking proc WAL in various combinations :)

My 50k ft level take is that we don't need procedures anything less complicated than at least a split (i.e. all of the assignment).. 20kft I think it's a good idea to make master startup in particular resilient to procedure issues when the state is redundant, because it's the place where many bugs/races manifest, and in our experience that would allow vast majority of issues to be fixed by master restart not resulting in extra cluster downtime or manual recovery. 
10kft level there is this patch ;)
Perhaps we should put it behind a general purpose config, so that this and any other such features could be enabled together, and be off by default.


> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
>                 Key: HBASE-21844
>                 URL: https://issues.apache.org/jira/browse/HBASE-21844
>             Project: HBase
>          Issue Type: Bug
>          Components: master, meta
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: Bahram Chehrazy
>            Priority: Major
>         Attachments: 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance of master getting into a state where the ZK says meta is OPEN, but the server is dead and there is no active SCP to recover it (perhaps the SCP has aborted and the procWALs were corrupted). In this case the waitForMetaOnline never returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/************:16000:becomeActiveMaster] master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, but nevertheless the master should be able to recover during a restart by initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)