You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Jack Ye (Jira)" <ji...@apache.org> on 2020/04/29 17:21:00 UTC

[jira] [Created] (HBASE-24286) HMaster won't become healthy after after cloning or creating a new cluster pointing at the same file system

Jack Ye created HBASE-24286:
-------------------------------

             Summary: HMaster won't become healthy after after cloning or creating a new cluster pointing at the same file system
                 Key: HBASE-24286
                 URL: https://issues.apache.org/jira/browse/HBASE-24286
             Project: HBase
          Issue Type: Bug
          Components: master, Region Assignment
    Affects Versions: 2.2.3
            Reporter: Jack Ye


h1. How to reproduce:
 # user starts an HBase cluster on top of a file system
 # user performs some operations and shuts down the cluster, all the data are still persisted in the file system
 # user creates a new HBase cluster using a different set of servers on top of the same file system with the same root directory
 # HMaster cannot initialize

h1. Root cause:

During HMaster initialization phase, the following happens:
 # HMaster waits for namespace table online
 # AssignmentManager gets all namespace table regions info
 # region servers of namespace table are already dead, online check fails
 # HMaster waits for namespace regions online, keep retrying for 1000 times which means forever

Code waiting for namespace table to be online: https://github.com/apache/hbase/blob/rel/2.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1102
h1. Stack trace (running on S3):

2020-04-23 08:15:57,185 WARN [master/ip-10-12-13-14:16000:becomeActiveMaster] master.HMaster: hbase:namespace,,1587628169070.d34b65b91a52644ed3e77c5fbb065c2b. is NOT online; state=\{d34b65b91a52644ed3e77c5fbb065c2b state=OPEN, ts=1587629742129, server=ip-10-12-13-14.ec2.internal,16020,1587628031614}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined.

where ip-10-12-13-14.ec2.internal is the old region server hosting the region of hbase:namespace.
h1. Discussion for the fix

We see there is a fix for this at branch-3: https://issues.apache.org/jira/browse/HBASE-21154. Before we provide a patch, we would like to know from the community if we should backport this change to branch-2, or if we should just perform a fix with minimum code change.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)