You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Rushabh Shah (Jira)" <ji...@apache.org> on 2020/05/10 02:40:00 UTC
[jira] [Commented] (HBASE-24292) A "stuck" master should not idle as active without taking action

    [ https://issues.apache.org/jira/browse/HBASE-24292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17103597#comment-17103597 ] 

Rushabh Shah commented on HBASE-24292:
--------------------------------------

 {code:title=HMaster.java|borderStyle=solid}
 private boolean isRegionOnline(RegionInfo ri) throws InterruptedException {
    RetryCounter rc = null;
    while (!isStopped()) {
      RegionState rs = this.assignmentManager.getRegionStates().getRegionState(ri);
      if (rs.isOpened()) {
        if (this.getServerManager().isServerOnline(rs.getServerName())) {
          return true;
        }
      }
      // Region is not OPEN.
      Optional<Procedure<MasterProcedureEnv>> optProc = this.procedureExecutor.getProcedures().
          stream().filter(p -> p instanceof ServerCrashProcedure).findAny();
      // TODO: Add a page to refguide on how to do repair. Have this log message point to it.
      // Page will talk about loss of edits, how to schedule at least the meta WAL recovery, and
      // then how to assign including how to break region lock if one held.
      LOG.warn("{} is NOT online; state={}; ServerCrashProcedures={}. Master startup cannot " +
          "progress, in holding-pattern until region onlined.",
          ri.getRegionNameAsString(), rs, optProc.isPresent());
      // Check once-a-minute.
      if (rc == null) {
        rc = new RetryCounterFactory(1000).create();
      }
      Threads.sleep(rc.getBackoffTimeAndIncrementAttempts());
    }
    return false;
  }
{code}

If I understand the code correctly, the code sleeps until hbase:meta region comes online and *doesn't* give up. Only one thing I see problematic is the sleeps time never max out. It will grow exponentially without limit. Maybe we should cap it at 1 or 2 minutes.

> A "stuck" master should not idle as active without taking action
> ----------------------------------------------------------------
>
>                 Key: HBASE-24292
>                 URL: https://issues.apache.org/jira/browse/HBASE-24292
>             Project: HBase
>          Issue Type: Bug
>          Components: master, Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> The master schedules a SCP for the region server hosting meta. However, due to a misconfiguration, the cluster cannot make progress. After fixing the configuration issue and restarting, the cluster still cannot make progress. After the configured period (15 minuets), the master enters a "holding pattern" where it retains Active master status, but isn't taking any action.
> This "brown-out" state is toxic. It should either keep trying to make progress, or it should abort. Staying up and not doing anything is the wrong thing to do.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)