You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Duo Zhang (JIRA)" <ji...@apache.org> on 2018/12/19 14:53:00 UTC

[jira] [Commented] (HBASE-21611) REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash procedure

    [ https://issues.apache.org/jira/browse/HBASE-21611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16725066#comment-16725066 ] 

Duo Zhang commented on HBASE-21611:
-----------------------------------

This is by design I'd say, we have to retry until the SCP interrupts us. Checking for SCP maybe possible but it will lead to more complicated logic, and also more possible races and bugs... And does it spam the logs? Maybe the problem is that the backoff logic is broken? Otherwise it will soon become seconds or even minutes interval.

> REGION_STATE_TRANSITION_CONFIRM_CLOSED should interact better with crash procedure
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-21611
>                 URL: https://issues.apache.org/jira/browse/HBASE-21611
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> 1) Not a bug per se, since HDFS is not supposed to lose files, just a bit fragile.
> When a dead server's WAL directory is deleted (due to a manual intervention, or some issue with HDFS) while some regions are in CLOSING state on that server, they get stuck forever in REGION_STATE_TRANSITION_CONFIRM_CLOSED - REGION_STATE_TRANSITION_CLOSE - "give up and mark the procedure as complete, the parent procedure will take care of this" loop. There's no crash procedure for the server so nobody ever takes care of that.
> 2) Under normal circumstances, when a large WAL is being split, this same loop keeps spamming the logs and wasting resources for no reason, until the crash procedure completes. There's no reason for it to retry - it should just wait for crash procedure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)