You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/01/15 21:49:00 UTC

[jira] [Commented] (HBASE-21627) race condition between a recovered RIT for meta replica, and master startup

    [ https://issues.apache.org/jira/browse/HBASE-21627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743424#comment-16743424 ] 

Sergey Shelukhin commented on HBASE-21627:
------------------------------------------

Update: we've seen another instance of this or similar race where the meta replica actually gets assigned by one of the procedures, but then the other messes something up and while RS thinks the replica is opened, master thinks it's in opening state, and it's stuck forever like that, including for some reason across master restarts (may be a persistent bad state, or it's racing repeatedly).
We kinda gave up on meta replicas at this point, so I'm not investigating further. But I think it needs to be redone to solve all the races and crashes (also HBASE-21624 :) Right now it seems uncoordinated with anything.

> race condition between a recovered RIT for meta replica, and master startup
> ---------------------------------------------------------------------------
>
>                 Key: HBASE-21627
>                 URL: https://issues.apache.org/jira/browse/HBASE-21627
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Master recovers RIT for a meta replica
> {noformat}
> 2018-12-14 23:16:12,008 INFO  [master/...:17000:becomeActiveMaster] assignment.AssignmentManager: Attach pid=83796, ppid=83788, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, hasLock=false; TransitRegionStateProcedure table=hbase:meta, region=(region), ASSIGN to rit=OFFLINE, location=null, table=hbase:meta, region=(region) to restore RIT
> 2018-12-14 23:16:16,475 WARN  [PEWorker-8] assignment.TransitRegionStateProcedure: No location specified for {ENCODED => (region), NAME => 'hbase:meta,,1_0001', STARTKEY => '', ENDKEY => '', REPLICA_ID => 1}, jump back to state REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE to get one
> ...
> 2018-12-14 23:16:30,010 INFO  [PEWorker-16] procedure2.ProcedureExecutor: Finished pid=83796, ppid=83788, state=SUCCESS, hasLock=false; TransitRegionStateProcedure table=hbase:meta, region=(region), ASSIGN in 8mins, 23.39sec
> {noformat}
> Then tries to assign replicas..
> {noformat}
> 2018-12-14 23:16:36,091 ERROR [master/...:17000:becomeActiveMaster] master.HMaster: Failed to become active master
> org.apache.hadoop.hbase.client.DoNotRetryRegionException: Unexpected state for rit=OPEN, location=server,17020,1544858156805, table=hbase:meta, region=(region)
>                 at org.apache.hadoop.hbase.master.assignment.AssignmentManager.preTransitCheck(AssignmentManager.java:548)
>                 at org.apache.hadoop.hbase.master.assignment.AssignmentManager.assign(AssignmentManager.java:563)
>                 at org.apache.hadoop.hbase.master.MasterMetaBootstrap.assignMetaReplicas(MasterMetaBootstrap.java:84)
>                 at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1146)
> {noformat}
> Unfortunately I misplaced the log from this after copy-pasting a grep result so that's all I have for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)