You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/11 19:51:00 UTC
[jira] [Comment Edited] (HBASE-21862) region can be assigned to 2 servers due to a timed-out call or an unknown exception

    [ https://issues.apache.org/jira/browse/HBASE-21862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765338#comment-16765338 ] 

Sergey Shelukhin edited comment on HBASE-21862 at 2/11/19 7:50 PM:
-------------------------------------------------------------------

Can you please not take over JIRAs like that and change the root cause analysis? If you think there's a different bug, please file a different JIRA unless there's some discussion... I cloned this for the new fix. I disagree it actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because as the comment mentions we don't know how such errors can manifest and what exceptions are network exceptions. If there's some other error we didn't account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead of just reassigning to another one. Network timeouts can be very long and there's no way for current model to terminate one. If we were to base on whitelist per (1), we'd also retry for some exception that we didn't account for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling "regionOpened" again (as per the WARN it logs that the region is already opened), so master will get stuck. This is a fundamental issue that assignment is based on actions and not target states; that may not be necessary to resolve here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case as is, before all the other complications. It seems like it should work, but I'm not sure there can't be a combination of errors (e.g. - the same situation happens as here, but on retry we get a non-network error from RS,like a BUSY/call queue full - looks like we'd count a message as failed, even though it didn't).




was (Author: sershe):
Can you please not take over JIRAs like that and change the root cause analysis? If you think there's a different bug, please file a different JIRA unless there's some discussion... I cloned this for the new fix. I disagree it actually properly addresses the issue. 
(1) At the very least, it should be based on whitelist, not blacklist - because as the comment mentions we don't know how such errors can manifest and what exceptions are network exceptions. If there's some other error we didn't account for in the network error list, double assignment can happen again.
(2) Even now, with a network error we'd waste time retrying to a bad RS instead of just reassigning to another one. Network timeouts can be very long and there's no way for current model to terminate one. If we were to base on whitelist per (1), we'd also retry for some exception that we didn't account for that would never go away (e.g. some bug in RS specific to this one region).
(3) If the open works, the retry on RS will result in RS not calling "regionOpened" again (as per the WARN it logs that the region is already opened), so master will get stuck. This is a fundamental issue that assignment is based on actions and not target states; it may not be necessary to resolve here, but to adopt the infinite-retry fix at least this should be addressed. 
(4) I am not even convinced that infinite retry solves the straightforward case as is, before all the other complications. It seems like it should work, but I'm not sure there can't be a combination of errors (e.g. - the same situation happens as here, but on retry we get a non-network error from RS,like a BUSY/call queue full - looks like we'd count a message as failed, even though it didn't).



> region can be assigned to 2 servers due to a timed-out call or an unknown exception
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-21862
>                 URL: https://issues.apache.org/jira/browse/HBASE-21862
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0, 2.2.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Blocker
>             Fix For: 3.0.0, 2.2.0, 2.1.3, 2.0.5, 2.3.0
>
>
> It's a classic bug, sort of... the call times out to open the region, but RS actually processes it alright. It could also happen if the response didn't make it back due to a network issue.
> As a result region is opened on two servers.
> There are some mitigations possible to narrow down the race window.
> 1) Don't process expired open calls, fail them. Won't help for network issues.
> 2) Don't ignore invalid RS state, kill it (YouAreDead exception) - but that will require fixing other network races where master kills RS, which would require adding state versioning to the protocol.
> The fundamental fix though would require either
> 1) an unknown failure from open to ascertain the state of the region from the server. Again, this would probably require protocol changes to make sure we ascertain the region is not opened, and also that the already-failed-on-master open is NOT going to be processed if it's some queue or even in transit on the network (via a nonce-like mechanism)?
> 2) some form of a distributed lock per region, e.g. in ZK
> 3) some form of 2PC? but the participant list cannot be determined in a manner that's both scalable and guaranteed correct. Theoretically it could be all RSes.
> {noformat}
> 2019-02-08 03:21:31,715 INFO  [PEWorker-7] procedure.MasterProcedureScheduler: Took xlock for pid=260626, ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=table, region=d0214809147e43dc6870005742d5d204, ASSIGN
> 2019-02-08 03:21:31,758 INFO  [PEWorker-7] assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=table, region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPEN, location=server1,17020,1549567999303; forceNewPlan=false, retain=true
> 2019-02-08 03:21:31,984 INFO  [PEWorker-13] assignment.RegionStateStore: pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, regionState=OPENING, regionLocation=server1,17020,1549623714617
> 2019-02-08 03:22:32,552 WARN  [RSProcedureDispatcher-pool4-t3451] assignment.RegionRemoteProcedureBase: The remote operation pid=260637, ppid=260626, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region ... to server server1,17020,1549623714617 failed
> java.io.IOException: Call to server1/...:17020 failed on local exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, waitTime=60145, rpcTimeout=60000^M
>         at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:185)^M
>         at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)^M
> ...
> Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call id=27191, waitTime=60145, rpcTimeout=60000^M
>         at org.apache.hadoop.hbase.ipc.RpcConnection$1.run(RpcConnection.java:200)^M
>         ... 4 more^M
>     {noformat}
> RS:
> {noformat}
> hbase-regionserver.log:2019-02-08 03:22:41,131 INFO  [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: Open ...d0214809147e43dc6870005742d5d204.
> ...
> hbase-regionserver.log:2019-02-08 03:25:44,751 INFO  [RS_OPEN_REGION-regionserver/server1:17020-2] handler.AssignRegionHandler: Opened ...d0214809147e43dc6870005742d5d204.
> {noformat}
> Retry:
> {noformat}
> 2019-02-08 03:22:32,967 INFO  [PEWorker-6] assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=260626, ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=table, region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=server1,17020,1549623714617
> 2019-02-08 03:22:33,084 INFO  [PEWorker-6] assignment.TransitRegionStateProcedure: Starting pid=260626, ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=table, region=d0214809147e43dc6870005742d5d204, ASSIGN; rit=OPENING, location=null; forceNewPlan=true, retain=false
> 2019-02-08 03:22:33,238 INFO  [PEWorker-7] assignment.RegionStateStore: pid=260626 updating hbase:meta row=d0214809147e43dc6870005742d5d204, regionState=OPENING, regionLocation=server2,17020,1549569075319
> {noformat}
> The ignore-message
> {noformat}
> 2019-02-08 03:25:44,754 WARN  [RpcServer.default.FPBQ.Fifo.handler=34,queue=4,port=17000] assignment.TransitRegionStateProcedure: Received report OPENED transition from server1,17020,1549623714617 for rit=OPENING, location=server2,17020,1549569075319, table=table, region=d0214809147e43dc6870005742d5d204, pid=260626 but the region is not on it, should be a retry, ignore
> {noformat}
> The 2nd assignment
> {noformat}
> 2019-02-08 03:26:18,915 INFO  [PEWorker-7] procedure2.ProcedureExecutor: Finished subprocedure(s) of pid=260626, ppid=260595, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, hasLock=true; TransitRegionStateProcedure table=table, region=d0214809147e43dc6870005742d5d204, ASSIGN; resume parent processing.
> 2019-02-08 03:26:18,971 INFO  [PEWorker-11] procedure2.ProcedureExecutor: Finished pid=260626, ppid=260595, state=SUCCESS, hasLock=false; TransitRegionStateProcedure table=table, region=d0214809147e43dc6870005742d5d204, ASSIGN in 4mins, 47.347sec
> {noformat}
> ======= by Duo Zhang ======
> The actual problem here is that, in IPCUtil.wrapException, we want to add the remote server address in the exception message so it will be easier for debugging, and there are several instanceof checks in it which is for keeping the original exception type, since upper layer may depend on the exception type for error recovery. But we do not check for CallTimeoutException in this method so it will be wrapped by an IOException, which makes the code in RSProcedureDispatcher broken, and causes the double assign.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)