You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/05/01 01:53:00 UTC

[jira] [Commented] (HBASE-22081) master shutdown: close RpcServer and procWAL first thing

    [ https://issues.apache.org/jira/browse/HBASE-22081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830810#comment-16830810 ] 

Sergey Shelukhin commented on HBASE-22081:
------------------------------------------

[~Apache9] does this patch make sense to you? it moves Rpc server and proc closing to the beginning of the shutdown to limit potential race conditions with incorrect state/new requests.

> master shutdown: close RpcServer and procWAL first thing
> --------------------------------------------------------
>
>                 Key: HBASE-22081
>                 URL: https://issues.apache.org/jira/browse/HBASE-22081
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HBASE-22081.01.patch, HBASE-22081.02.patch, HBASE-22081.03.patch, HBASE-22081.patch
>
>
> I had a master get stuck due to HBASE-22079 and noticed it was logging RS abort messages during shutdown.
> [~bahramch] found some issues where messages are processed by old master during shutdown due to a race condition in RS cache (or it could also happen due to a network race).
> Previously I found some bug where SCP was created during master shutdown that had incorrect state (because some structures already got cleaned).
> I think before master fencing is implemented we can at least make these issues much less likely by thinking about shutdown order.
> 1) First kill RCP server so we don't receive any more messages. There's no need to receive messages when we are shutting down. Server heartbeats could be impacted I guess, but I don't think they will be cause we currently only kill RS on ZK timeout.
> 2) Then do whatever cleanup we think is needed that requires proc wal.
> 3) Then close proc WAL so no errant threads can create more procs.
> 4) Then do whatever other cleanup.
> 5) Finally delete znode.
> Right now znode is deleted somewhat early I think, and RpcServer is closed very late.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)