You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Duo Zhang (Jira)" <ji...@apache.org> on 2020/06/08 09:05:00 UTC
[jira] [Commented] (HBASE-24117) If move target RS crashes, move fails if concurrent master crash

    [ https://issues.apache.org/jira/browse/HBASE-24117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128045#comment-17128045 ] 

Duo Zhang commented on HBASE-24117:
-----------------------------------

OK, I found the root cause.

Still something wrong with the shutdown code.

{noformat}
2020-05-08 15:57:51,275 INFO  [M:0;localhost:51555] assignment.AssignmentManager(287): Stopping assignment manager
2020-05-08 15:57:51,277 INFO  [M:0;localhost:51555] procedure2.RemoteProcedureDispatcher(113): Stopping procedure remote dispatcher
2020-05-08 15:57:51,277 INFO  [PEWorker-4] procedure.ServerCrashProcedure(476): pid=13, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure server=localhost,51560,1588978659320, splitWal=true, meta=false found a region state=OFFLINE, location=null, table=Backoff, region=3db619df21f59db7441495065e782264 which is no longer on us localhost,51560,1588978659320, give up assigning...
{noformat}

In AssignmentManager.stop, we will cleanup the regionStates, and then we run into SCP.assignRegions, we will recreate the RegionStateNode since it has already been cleared. Then we will give up assigning by this check

{code}
        // This is possible, as when a server is dead, TRSP will fail to schedule a RemoteProcedure
        // and then try to assign the region to a new RS. And before it has updated the region
        // location to the new RS, we may have already called the am.getRegionsOnServer so we will
        // consider the region is still on this crashed server. Then before we arrive here, the
        // TRSP could have updated the region location, or even finished itself, so the region is
        // no longer on this crashed server any more. We should not try to assign it again. Please
        // see HBASE-23594 for more details.
        // UPDATE: HBCKServerCrashProcedure overrides isMatchingRegionLocation; this check can get
        // in the way of our clearing out 'Unknown Servers'.
        if (!isMatchingRegionLocation(regionNode)) {
          LOG.info("{} found {} whose regionLocation no longer matches {}, skipping assign...",
            this, regionNode, serverName);
          continue;
        }
{code}

Typically, we should not shutdown AssignmentManager before all the procedures are quited...

Let me think how to fix this...

> If move target RS crashes, move fails if concurrent master crash
> ----------------------------------------------------------------
>
>                 Key: HBASE-24117
>                 URL: https://issues.apache.org/jira/browse/HBASE-24117
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>            Reporter: Michael Stack
>            Assignee: Michael Stack
>            Priority: Major
>         Attachments: org.apache.hadoop.hbase.master.assignment.TestCloseRegionWhileRSCrash-output.txt
>
>
> I saw this on TestCloseRegionWithRSCrash. The Region 788a516d1f86af98e0a16bcc1afe4fa1 was being moved to RS  example.com,62652,1586032098445 just after it was killed. The Move Close fails because the RS has no node in the Master. The Move then tries to 'confirm' the close but it fails because no remote RS. We are then to wait in this state until operator or some other procedure intervenes to 'fix' the state. Normally a ServerCrashProcedure would do the job but in this test the Master is restarted after the RS is killed, a condition we do not accommodate.
> Let me attach the test log.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)