You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2018/11/14 02:07:00 UTC
[jira] [Commented] (HBASE-21464) Splitting blocked with meta NSRE during split transaction

    [ https://issues.apache.org/jira/browse/HBASE-21464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16686000#comment-16686000 ] 

Andrew Purtell commented on HBASE-21464:
----------------------------------------

On the regionserver doing the splitting, we get to the point where its' time to update meta at 18:07:35
{noformat}
2018-11-09 18:07:35,704 DEBUG [regionserver/ip-172-31-13-83.us-west-2.compute.internal/172.31.13.83:8120-splits-1541786530557] regionserver.SplitTransaction: Split storefiles for \
region test,user4112339446054425864,1541786730764.2802f0bfbe9e7d88d530c16539f95cfd. Daughter A: 6 storefiles, Daughter B: 6 storefiles.

...

2018-11-09 18:07:35,757 DEBUG [regionserver/ip-172-31-13-83.us-west-2.compute.internal/172.31.13.83:8120-splits-1541786530557] ipc.BlockingRpcConnection: Connecting to ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120

...

2018-11-09 18:08:14,168 INFO  [regionserver/ip-172-31-13-83.us-west-2.compute.internal/172.31.13.83:8120-splits-1541786530557] client.RpcRetryingCaller: Call exception, tries=10, retries=350, started=38412 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on ip-172-31-5-92.us-west-2.compute.internal,8120,1541786481463{noformat}
However META recently moved. It's not on ip-172-31-5-92 any longer. It moved to ip-172-31-15-225 a minute prior, at 18:06:24.

From master
{noformat}
2018-11-09 18:06:24,690 DEBUG [AM.ZK.Worker-pool5-t64] master.AssignmentManager: Znode hbase:meta,,1.1588230740 deleted, state: {1588230740 state=OPEN, ts=1541786784688, server=ip-172-31-15-225.us-west-2.compute.internal,8120,1541786485409}
{noformat}
From regionserver ip-172-31-15-225:
{noformat}
2018-11-09 18:06:24,686 DEBUG [PostOpenDeployTasks:1588230740] regionserver.HRegionServer: Finished post open deploy task for hbase:meta,,1.1588230740

...

2018-11-09 18:06:24,688 DEBUG [RS_OPEN_META-ip-172-31-15-225:8120-0] handler.OpenRegionHandler: Opened hbase:meta,,1.1588230740 on ip-172-31-15-225.us-west-2.compute.internal,8120,1541786485409
{noformat}
The stuck split happens a minute later after META is redeployed and is live on ip-172-31-15-225. 

The relevant code attempting the update is in SplitTransactionImpl.
{code:java}
    if (!testing && useZKForAssignment) {
      if (metaEntries == null || metaEntries.isEmpty()) {
        MetaTableAccessor.splitRegion(server.getConnection(),
          parent.getRegionInfo(), daughterRegions.getFirst().getRegionInfo(),
          daughterRegions.getSecond().getRegionInfo(), server.getServerName(),
          parent.getTableDesc().getRegionReplication());
      } else {
        offlineParentInMetaAndputMetaEntries(server.getConnection(),
          parent.getRegionInfo(), daughterRegions.getFirst().getRegionInfo(), daughterRegions
              .getSecond().getRegionInfo(), server.getServerName(), metaEntries,
              parent.getTableDesc().getRegionReplication());
      }
{code}
(and not relevant, this bit tells the master directly if using zk-less assignment)
{code:java}
    } else if (services != null && !useZKForAssignment) {
      if (!services.reportRegionStateTransition(TransitionCode.SPLIT_PONR,
          parent.getRegionInfo(), hri_a, hri_b)) {
        // Passed PONR, let SSH clean it up
        throw new IOException("Failed to notify master that split passed PONR: "
          + parent.getRegionInfo().getRegionNameWithoutKeyAsString());
      }
    }
{code}
So we either call MetaTableAccessor.splitRegion here or MetaTableAccessor.mutateMetaTable via offlineParentInMetaAndputMetaEntries.

The question is why the connection (acquired by server.getConnection()) used by MetaTableAccessor is not relocating META's location.

There is one change to MetaTableAccessor between 1.4.2, which does not reproduce, and 1.4.3, which does reproduce the problem, but looking at it I can't see how it it would be related.
{noformat}
commit 0d8fee2158e08bc6d0907d4abbe1215eaded6ce3
Author: Pankaj Kumar <pa...@huawei.com>
Date:   Thu Dec 7 22:51:01 2017 +0530

    HBASE-19364, Truncate_preserve fails with table when replica region > 1
{noformat}
Looking at ConnectionManager now.

> Splitting blocked with meta NSRE during split transaction
> ---------------------------------------------------------
>
>                 Key: HBASE-21464
>                 URL: https://issues.apache.org/jira/browse/HBASE-21464
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.5.0, 1.4.3, 1.4.4, 1.4.5, 1.4.6, 1.4.8, 1.4.7
>            Reporter: Andrew Purtell
>            Priority: Blocker
>             Fix For: 1.5.0, 1.4.9
>
>
> Splitting is blocked during split transaction. The split worker is trying to update meta but isn't able to relocate it after NSRE:
> {noformat}
> 2018-11-09 17:50:45,277 INFO  [regionserver/ip-172-31-5-92.us-west-2.compute.internal/172.31.5.92:8120-splits-1541785709434] client.RpcRetryingCaller: Call exception, tries=13, retries=350, started=88590 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.NotServingRegionException: Region hbase:meta,,1 is not online on ip-172-31-13-83.us-west-2.compute.internal,8120,1541785618832
>      at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionByEncodedName(HRegionServer.java:3088)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.getRegion(RSRpcServices.java:1271)
>         at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:2198)
>         at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:36617)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2396)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:297)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:277)row 'test,,1541785709452.5ba6596f0050c2dab969d152829227c6.44' on table 'hbase:meta' at region=hbase:meta,1.1588230740, hostname=ip-172-31-15-225.us-west-2.compute.internal,8120,1541785640586, seqNum=0{noformat}
> Clients, in this case YCSB, are hung with part of the keyspace missing:
> {noformat}
> 2018-11-09 17:51:06,033 DEBUG [hconnection-0x5739e567-shared--pool1-t165] client.ConnectionManager$HConnectionImplementation: locateRegionInMeta parentTable=hbase:meta, metaLocation=, attempt=14 of 35 failed; retrying after sleep of 20158 because: No server address listed in hbase:meta for region test,user307326104267982763,1541785754600.ef90030b05cb02305b75e9bfbc3ee081. containing row user3301635648728421323{noformat}
> Balancing cannot run indefinitely because the split transaction is stuck
> {noformat}
> 2018-11-09 17:49:55,478 DEBUG [RpcServer.default.FPBQ.Fifo.handler=29,queue=2,port=8100] master.HMaster: Not running balancer because 3 region(s) in transition: [{ef90030b05cb02305b75e9bfbc3ee081 state=SPLITTING_NEW, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute.internal,8120,1541785626417}, {5ba6596f0050c2dab969d152829227c6 state=SPLITTING, ts=1541785754606, server=ip-172-31-5-92.us-west-2.compute....{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)