You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Caroline (Jira)" <ji...@apache.org> on 2020/12/01 20:42:00 UTC

[jira] [Comment Edited] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

    [ https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17241847#comment-17241847 ] 

Caroline edited comment on HBASE-25032 at 12/1/20, 8:41 PM:
------------------------------------------------------------

[~anoop.hbase] [~apurtell]

Had some discussion with [~sandeep.guggilam] about possible fixes for this issue:
 # Move `reportForDuty()` after replication setup in `HRegionServer.java`. I think it would be moving [this line|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1546] out of the `handleReportForDutyResponse()` method and above the `reportForDuty()` line [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1033].
 # Postpone adding the regionserver to master's online servers list until the regionserver's `online` flag has been set to true (i.e. all of the regionserver's initialization steps have completed). I believe that would be replacing [this line|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L280] with a thread or thread pool executor which asynchronously polls regionserver info (call [ServerManager.isServerReachable()|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L998]), and only calls `ServerManager.checkAndRecordNewServer()` after a response is received. We could create a new single thread pool executor every time `ServerManager.regionServerStartup()` is called, use the `MASTER_SERVER_OPERATIONS` service thread, or create a new executor service/thread pool/something else with configured x number of threads for this kind of task. Any thoughts on how we should configure the thread pool here?
 # Do not force region state to offline in the bulk assign method [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java#L1762]. I haven't investigated the implications of this.


was (Author: caroliney14):
[~anoop.hbase] [~apurtell]

Had some discussion with [~sandeep.guggilam] about possible fixes for this issue:
 # Move `reportForDuty()` after replication setup in `HRegionServer.java`. I think it would be moving [this line|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1546] out of the `handleReportForDutyResponse()` method and above the `reportForDuty()` line [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/HRegionServer.java#L1033].
 # Postpone adding the regionserver to master's online servers list until the regionserver's `online` flag has been set to true (i.e. all of the regionserver's initialization steps have completed). I believe that would be replacing [this line|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L280] with a thread or thread pool executor which asynchronously polls regionserver info (call [ServerManager.isServerReachable()|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/ServerManager.java#L998]), and only calls `checkAndRecordNewServer()` after a response is received. We could create a new single thread pool executor every time `regionServerStartup()` is called, use the `MASTER_SERVER_OPERATIONS` service thread, or create a new executor service/thread pool/something else with configured x number of threads for this kind of task. Any thoughts on how we should configure the thread pool here?
 # Do not force region state to offline in the bulk assign method [here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java#L1762]. I haven't investigated the implications of this.

> Wait for region server to become online before adding it to online servers in Master
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-25032
>                 URL: https://issues.apache.org/jira/browse/HBASE-25032
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sandeep Guggilam
>            Assignee: Caroline
>            Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges the request and adds it to the onlineServers list for further assigning any regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS does a bunch of stuff like initializing replication sources etc before becoming online. However, sometimes there could be an issue with initializing replication sources when it is unable to connect to peer clusters because of some kerberos configuration and there would be a delay of around 20 mins in becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails with ServerNotRunningYet exception, then the master tries to unassign which again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the assignment requests before adding it to online servers list which would account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)