You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2018/12/01 01:44:00 UTC
[jira] [Comment Edited] (HBASE-21464) Splitting blocked with meta NSRE during split transaction

    [ https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16705565#comment-16705565 ] 

Andrew Purtell edited comment on HBASE-21464 at 12/1/18 1:43 AM:
-----------------------------------------------------------------

I don't think recursive region relocation works the way we are all expecting, that when we NSRE on meta we will always end up in ConnectionManager#locateMeta with useCache = false. The sum of recursive region relocation code is hard to understand and should be rewritten. I'm not going to do that today. What I do have is a patch that works reliably to fix the issue in my test environment when meta is moved during split activity while preserving the intents of HBASE-10785 (don't overload zookeeper with lookups by looking up meta every time) and HBASE-19260 (don't overload zookeeper with unnecessary concurrent lookups). There is a new limit on cache entry age for meta, hardcoded to 10 seconds (should it be configurable? I don't think it matters much...), to prevent getting stuck on a stale meta location. Consider it a safety valve we need while continuing to look at this problem.

How to reproduce:
 * Run a load client. I use YCSB with 100 threads. The test table is named "test".
 * In the HBase shell: while true ; do sleep 30 ; balancer ; flush 'test'; compact 'test' ; split 'test' ; balancer ; done

You've hit the problem when the result of the shell 'balancer' command is always false. Go to the master, you'll find a split in progress that can't finish. Go to the regionserver attempting the split and you'll find the split worker going back again and again to the regionserver no longer hosting meta looking for meta.


was (Author: apurtell):
I don't think recursive region relocation works the way we are all expecting, that when we NSRE on meta we will always end up in ConnectionManager#locateRegion with useCache = false. The sum of recursive region relocation code is hard to understand and should be rewritten. I'm not going to do that today. What I do have is a patch that works reliably to fix the issue in my test environment when meta is moved during split activity while preserving the intents of HBASE-10785 (don't overload zookeeper with lookups by looking up meta every time) and HBASE-19260 (don't overload zookeeper with unnecessary concurrent lookups). There is a new limit on cache entry age for meta, hardcoded to 10 seconds (should it be configurable? I don't think it matters much...), to prevent getting stuck on a stale meta location. Consider it a safety valve we need while continuing to look at this problem.

How to reproduce:
 * Run a load client. I use YCSB with 100 threads. The test table is named "test".
 * In the HBase shell: while true ; do sleep 30 ; balancer ; flush 'test'; compact 'test' ; split 'test' ; balancer ; done

You've hit the problem when the result of the shell 'balancer' command is always false. Go to the master, you'll find a split in progress that can't finish. Go to the regionserver attempting the split and you'll find the split worker going back again and again to the regionserver no longer hosting meta looking for meta.

> Splitting blocked with meta NSRE during split transaction
> ---------------------------------------------------------
>
>                 Key: HBASE-21464
>                 URL: https://issues.apache.org/jira/browse/HBASE-21464
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Blocker
>             Fix For: 1.5.0, 1.4.9
>
>         Attachments: HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch, HBASE-21464-branch-1.patch
>
>
> Splitting is blocked during split transaction. The split worker is trying to update meta but isn't able to relocate it after NSRE:
> {noformat}
> 2018-11-09 17:50:45,277 INFO  [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434] client.RpcRetryingCaller: Call exception, tries=13, retries=350, started=88590 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
>      at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
>         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 'hbase:meta' at region=hbase:meta,1.1588230740, hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, seqNum=0{noformat}
> Clients, in this case YCSB, are hung with part of the keyspace missing:
> {noformat}
> 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying after sleep of 20158 because: No server address listed in hbase:meta for region test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. containing row user3301635648728421323{noformat}
> Balancing cannot run indefinitely because the split transaction is stuck
> {noformat}
> 2018-11-09 17:49:55,478 DEBUG [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: Not running balancer because 3 region(s) in transition: [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute....{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)