You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by "Nick Dimiduk (Jira)" <ji...@apache.org> on 2020/09/17 18:41:00 UTC

[jira] [Commented] (HBASE-25059) TransitionRegionStateProcedure should timeout, rollback, retry instead of waiting infinitely on CONFIRMED_OPEN

    [ https://issues.apache.org/jira/browse/HBASE-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17197898#comment-17197898 ] 

Nick Dimiduk commented on HBASE-25059:
--------------------------------------

In this case, after 220 attempts receiving {{CallTimeoutException}}, the region server starts responding with {{CallQueueTooBigException}}. Still we never give up.

Down in {{RSProcedureDispatcher$ExecuteProceduresRemoteCall#scheduleForRetry}}, I see no consideration for {{CallTimeoutException}}. There is handling for {{CallQueueTooBigException}}, but it's a highly specialized case.

{noformat}
      // This exception is thrown in the rpc framework, where we can make sure that the call has not
      // been executed yet, so it is safe to mark it as fail. Especially for open a region, we'd
      // better choose another region server.
      // Notice that, it is safe to quit only if this is the first time we send request to region
      // server. Maybe the region server has accepted our request the first time, and then there is
      // a network error which prevents we receive the response, and the second time we hit a
      // CallQueueTooBigException, obviously it is not safe to quit here, otherwise it may lead to a
      // double assign...
      if (e instanceof CallQueueTooBigException && numberOfAttemptsSoFar == 0) {
        LOG.warn("request to {} failed due to {}, try={}, this usually because" +
          " server is overloaded, give up", serverName, e.toString(), numberOfAttemptsSoFar);
        return false;
      }
{noformat}

> TransitionRegionStateProcedure should timeout, rollback, retry instead of waiting infinitely on CONFIRMED_OPEN
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-25059
>                 URL: https://issues.apache.org/jira/browse/HBASE-25059
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.3.2
>            Reporter: Nick Dimiduk
>            Priority: Major
>
> Testing 2.3.2RC1 with ITBLL. The region server assigned to open meta locked up due to HBASE-24896. Meanwhile, the master waits indefinitely on a procedure {{pid=176583, ppid=176532, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN}}.
> AssignmentManager needs a way to rescind assignment when a RS fails to complete within a reasonable timeout window, roll back the procedure, and try again with a new target.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)