You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/12 23:31:00 UTC
[jira] [Comment Edited] (HBASE-21623) ServerCrashProcedure can stomp on a RIT for a wrong server

    [ https://issues.apache.org/jira/browse/HBASE-21623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766575#comment-16766575 ] 

Sergey Shelukhin edited comment on HBASE-21623 at 2/12/19 11:30 PM:
--------------------------------------------------------------------

Can you elaborate, which pieces of code would race? Procedure is set under lock; the problem here is that the SCP updates an unrelated procedure. 
It's possible that the procedure update for RIT somewhere resets the procedure without checking, but I'm not sure how it will affect SCP in particular. There might be a different race condition, but that would be a separate bug


was (Author: sershe):
Can you elaborate, which pieces of code would race? Procedure is set under lock; the problem here is that the SCP updates an unrelated procedure. 
It's possible that the procedure update for RIT somewhere resets the procedure without checking, but I'm not sure how it will affect SCP in particular. There might be a different race condition.

> ServerCrashProcedure can stomp on a RIT for a wrong server
> ----------------------------------------------------------
>
>                 Key: HBASE-21623
>                 URL: https://issues.apache.org/jira/browse/HBASE-21623
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: HBASE-21623.patch
>
>
> A server died while some region was being opened on it; eventually the open failed, and the RIT procedure started retrying on a different server.
> However, by then SCP for the dying server had already obtained the region from the list of regions on the old server, and proceeded to overwrite whatever the RIT was doing with a new server.
> {noformat}
> 2018-12-18 23:06:03,160 INFO  [PEWorker-14] procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=151404, ppid=151104, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> ...
> 2018-12-18 23:06:38,208 INFO  [PEWorker-10] procedure.ServerCrashProcedure: Start pid=151632, state=RUNNABLE:SERVER_CRASH_START, hasLock=true; ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, meta=false
> ...
> 2018-12-18 23:06:41,953 WARN  [RSProcedureDispatcher-pool4-t115] assignment.RegionRemoteProcedureBase: The remote operation pid=151404, ppid=151104, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region {ENCODED => region1, ... } to server oldServer,17020,1545202098577 failed
> org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: org.apache.hadoop.hbase.regionserver.RegionServerAbortedException: Server oldServer,17020,1545202098577 aborting
> 2018-12-18 23:06:42,485 INFO  [PEWorker-5] procedure2.ProcedureExecutor: Finished subprocedure(s) of pid=151104, ppid=150875, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; resume parent processing.
> 2018-12-18 23:06:42,485 INFO  [PEWorker-13] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=151104, ppid=150875, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, location=oldServer,17020,1545202098577
> 2018-12-18 23:06:42,500 INFO  [PEWorker-13] assignment.TransitRegionStateProcedure: Starting pid=151104, ppid=150875, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, location=null; forceNewPlan=true, retain=false
> 2018-12-18 23:06:42,657 INFO  [PEWorker-2] assignment.RegionStateStore: pid=151104 updating hbase:meta row=region1, regionState=OPENING, regionLocation=newServer,17020,1545202111238
> ...
> 2018-12-18 23:06:43,094 INFO  [PEWorker-4] procedure.ServerCrashProcedure: pid=151632, state=RUNNABLE:SERVER_CRASH_ASSIGN, hasLock=true; ServerCrashProcedure server=oldServer,17020,1545202098577, splitWal=true, meta=false found RIT  pid=151104, ppid=150875, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=t1, region=region1, ASSIGN; rit=OPENING, location=newServer,17020,1545202111238, table=t1, region=region1
> 2018-12-18 23:06:43,094 INFO  [PEWorker-4] assignment.RegionStateStore: pid=151104 updating hbase:meta row=region1, regionState=ABNORMALLY_CLOSED
> {noformat}
> Later, the RIT overwrote the state again, it seems, and then the region got stuck in OPENING state forever, but I'm not sure yet if that's just due to this bug or if there was another bug after that. For now this can be addressed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)