You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bahram Chehrazy (JIRA)" <ji...@apache.org> on 2019/02/07 19:00:00 UTC
[jira] [Issue Comment Deleted] (HBASE-21844) Master could get stuck in initializing state while waiting for meta

     [ https://issues.apache.org/jira/browse/HBASE-21844?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bahram Chehrazy updated HBASE-21844:
------------------------------------
    Comment: was deleted

(was: I have a similar repro of this. This time the meta is stuck in the OPENING state, but the server carrying the meta has already come on-line. The *OpenRegionProcedure* doesn't seem to be reporting back, because I don't see any more line with *pid=32701.* Of course this patch doesn't handle this case.

 

 

2019-02-06 03:11:32,421 INFO  [PEWorker-2] procedure.ServerCrashProcedure: Start *pid=32695*, state=RUNNABLE:*SERVER_CRASH_START*, hasLock=true; ServerCrashProcedure server=************,16020,1549448004110, splitWal=true, meta=true

2019-02-06 03:11:37,314 INFO  [PEWorker-11] assignment.TransitRegionStateProcedure: Starting *pid=32700*, ppid=32695, state=RUNNABLE:REGION_*STATE_TRANSITION_GET_ASSIGN_CANDIDATE*, hasLock=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; rit=CLOSING, location=***********,16020,1549448004110; forceNewPlan=false, retain=false

2019-02-06 03:11:37,519 INFO  [PEWorker-7] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as *<server1>,16020,1549450371876*

2019-02-06 03:11:37,525 INFO [PEWorker-7] procedure2.ProcedureExecutor: Initialized subprocedures=[{*pid=32701*, *ppid=32700*, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment*.OpenRegionProcedure*}]
2019-02-06 03:11:39,728 WARN  [master/*************:16000:becomeActiveMaster] master.HMaster: hbase:meta,,1.1588230740 is NOT online; state={1588230740 state=*OPENING*, ts=1549451497525, server=*<server1>,16020,1549450371876*}; ServerCrashProcedures=*true*. Master startup cannot progress, in holding-pattern until region onlined.)

> Master could get stuck in initializing state while waiting for meta
> -------------------------------------------------------------------
>
>                 Key: HBASE-21844
>                 URL: https://issues.apache.org/jira/browse/HBASE-21844
>             Project: HBase
>          Issue Type: Bug
>          Components: master, meta
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: Bahram Chehrazy
>            Priority: Major
>         Attachments: 0001-HBASE-21844-Handling-incorrect-Meta-state-on-Zookeep.patch
>
>
> If the active master crashes after meta server dies, there is a slight chance of master getting into a state where the ZK says meta is OPEN, but the server is dead and there is no active SCP to recover it (perhaps the SCP has aborted and the procWALs were corrupted). In this case the waitForMetaOnline never returns.
>  
> We've seen this happening a few times when there had been a temporary HDFS outage. Following log lines shows this state.
>  
> 2019-01-17 18:55:48,497 WARN  [master/************:16000:becomeActiveMaster] master.HMaster: hbase:meta,,1.1588230740 is NOT online; state=
> {1588230740 *state=*OPEN**, ts=1547780128227, server=*************,16020,1547776821322}
> ; *ServerCrashProcedures=false*. Master startup cannot progress, in holding-pattern until region onlined.
>  
> I'm still investigating why and how to prevent getting into this bad state, but nevertheless the master should be able to recover during a restart by initiating a new SCP to fix the meta.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)