You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Rushabh Shah (Jira)" <ji...@apache.org> on 2022/09/21 20:55:00 UTC

[jira] [Created] (HBASE-27383) Add dead region server to SplitLogManager#deadWorkers set as the first step.

Rushabh Shah created HBASE-27383:
------------------------------------

             Summary: Add dead region server to SplitLogManager#deadWorkers set as the first step.
                 Key: HBASE-27383
                 URL: https://issues.apache.org/jira/browse/HBASE-27383
             Project: HBase
          Issue Type: Bug
    Affects Versions: 1.7.2, 1.6.0
            Reporter: Rushabh Shah
            Assignee: Rushabh Shah


Currently we add a dead region server to +SplitLogManager#deadWorkers+ set in SERVER_CRASH_SPLIT_LOGS state. 
Consider a case where a region server is handling split log task for hbase:meta table and SplitLogManager has exhausted all the retries and won't try any more region server. 
The region server which is handling split log task has died. 
We have a check in SplitLogManager where if a region server is declared dead and if that region server is responsible for split log task then we forcefully resubmit split log task. See the code [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitLogManager.java#L721-L726].

But we add a region server to SplitLogManager#deadWorkers set in [SERVER_CRASH_SPLIT_LOGS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L252] state. 
Before that it runs [SERVER_CRASH_GET_REGIONS|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/ServerCrashProcedure.java#L214] state  and checks if hbase:meta table is up. In this case, hbase:meta table was not online and that prevented SplitLogManager to add this RS to deadWorkers list. This created a deadlock and hbase cluster was completely down for an extended period of time until we failed over active hmaster. See HBASE-27382 for more details.

Improvements:
1.  We should a dead region server to +SplitLogManager#deadWorkers+ list as the first step.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)