You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bahram Chehrazy (JIRA)" <ji...@apache.org> on 2019/02/07 18:59:00 UTC

[jira] [Comment Edited] (HBASE-21788) OpenRegionProcedure (after recovery?) is unreliable and needs to be improved

    [ https://issues.apache.org/jira/browse/HBASE-21788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762968#comment-16762968 ] 

Bahram Chehrazy edited comment on HBASE-21788 at 2/7/19 6:58 PM:
-----------------------------------------------------------------

Quote from @[~Apache9]:

 

The design for OpenRegionProcedure/CloseRegionProcedure is that, if we successfully send the request out, then the RS must send a report back to wake us up, unless it is dead, then SCP will wake us up.

So if you see a OpenRegionProcedure stuck, the problem maybe:

1. Something wrong with the RSProcedureDispatcher, where we fail to send request out, and also fail to tell the procedure about this.
 2. Something wrong at RS side, where we do not send report back, but the RS is not aborted.\
 3. There are races, which causes we miss the report from RS. Maybe something like HBASE-21811?


was (Author: bahramch):
The design for OpenRegionProcedure/CloseRegionProcedure is that, if we successfully send the request out, then the RS must send a report back to wake us up, unless it is dead, then SCP will wake us up.

So if you see a OpenRegionProcedure stuck, the problem maybe:

1. Something wrong with the RSProcedureDispatcher, where we fail to send request out, and also fail to tell the procedure about this.
2. Something wrong at RS side, where we do not send report back, but the RS is not aborted.\
3. There are races, which causes we miss the report from RS. Maybe something like HBASE-21811?

> OpenRegionProcedure (after recovery?) is unreliable and needs to be improved
> ----------------------------------------------------------------------------
>
>                 Key: HBASE-21788
>                 URL: https://issues.apache.org/jira/browse/HBASE-21788
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> Not much for this one yet.
> I repeatedly see the cases when the region is stuck in OPENING, and after master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure (also recovered) is stuck in Runnable and never does anything for hours. I cannot find logs on the target server indicating that it ever tried to do anything after master restart.
> This procedure needs at the very least logging of what it's trying to do, and maybe a timeout so it unconditionally fails after a configurable period (1 hour?).
> I may also investigate why it doesn't do anything and file a separate bug. I wonder if it's somehow related to the region status check, but this is just a hunch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)