You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bharath Vissapragada (Jira)" <ji...@apache.org> on 2021/01/06 02:01:00 UTC

[jira] [Commented] (HBASE-25032) Wait for region server to become online before adding it to online servers in Master

    [ https://issues.apache.org/jira/browse/HBASE-25032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259347#comment-17259347 ] 

Bharath Vissapragada commented on HBASE-25032:
----------------------------------------------

Thanks [~caroliney14] for pinging me on the PR, just caught up with the comments and read the code.

bq. The approach taken in the PRs is to leave the RS reportForDuty/handleReportForDuty logic as is, and change the Master-side logic so that Master asynchronously polls 

I think this complicates the state machine further, I don't think we should take this approach. Essentially the root problem here is that RS is not actually ready for duty when it makes this RPC `regionServerStartup()`. This misunderstanding between master and regionserver needs to be fixed, meaning the RS should be marked online (in master's state) only when it is actually online and ready to accept region requests.

However, regionServerStartup() does more than what the name says. It returns some configuration that the RS should use for init. So I think we should break the initialization of the RS into the following steps.

1. RPC getRegionServerStartupConfiguration()
2. Do the usual init (ZK ephemeral node, replication init etc)
3. RPC handleReportForDuty()

Break down the RPCs accordingly. This should work, right?

> Wait for region server to become online before adding it to online servers in Master
> ------------------------------------------------------------------------------------
>
>                 Key: HBASE-25032
>                 URL: https://issues.apache.org/jira/browse/HBASE-25032
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sandeep Guggilam
>            Assignee: Caroline
>            Priority: Major
>
> As part of RS start up, RS reports for duty to Master . Master acknowledges the request and adds it to the onlineServers list for further assigning any regions to the RS
> Once Master acknowledges the reportForDuty and sends back the response, RS does a bunch of stuff like initializing replication sources etc before becoming online. However, sometimes there could be an issue with initializing replication sources when it is unable to connect to peer clusters because of some kerberos configuration and there would be a delay of around 20 mins in becoming online.
>  
> Since master considers it online, it tries to assign regions and which fails with ServerNotRunningYet exception, then the master tries to unassign which again fails with the same exception leading the region to FAILED_CLOSE state.
>  
> It would be good to have a check to see if the RS is ready to accept the assignment requests before adding it to online servers list which would account for any such delays as described above



--
This message was sent by Atlassian Jira
(v8.3.4#803005)