You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Abhishek Singh Chouhan (JIRA)" <ji...@apache.org> on 2017/02/09 07:15:42 UTC

[jira] [Commented] (HBASE-17069) RegionServer writes invalid META entries for split daughters in some circumstances

    [ https://issues.apache.org/jira/browse/HBASE-17069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859144#comment-15859144 ] 

Abhishek Singh Chouhan commented on HBASE-17069:
------------------------------------------------

Did a bit of debugging around this. I think the issue is as follows:
- We try to split a region A into B and C. SplitTransactionImpl.execute(..) -> SplitTransactionImpl.createDaughters(..) -> MetaTableAccessor.splitRegion(..)
In MetaTableAccessor.splitRegion() we create a multimutate to add entries of B and C in meta, the puts here have hregioninfo. 
- We encounter an exception in MetaTableAccessor.multimutate(..) since we just slayed the RS hosting meta (this was the cached location)
MultiRowMutationProtos.MultiRowMutationService.BlockingInterface service =
      MultiRowMutationProtos.MultiRowMutationService.newBlockingStub(channel);
    try {
      service.mutateRows(null, mmrBuilder.build());
    } catch (ServiceException ex) {
      ProtobufUtil.toIOException(ex);
    }
- We swallow the exception here and move on and try to online the regions using SplitTransactionImpl.openDaughters(..). 
- We successfully open the region a target RS. At the time of opening in HRegionServer.postOpenDeployTasks(..) we do MetaTableAccessor.updateRegionLocation(..). This put has other entries like servername, startcode etc but doesn't have hregioninfo. This request now goes through since meta is now up again on another RS. We end up with partial info in hbase:meta and the region is online on a RS. The clients however can't access the data belonging to this region and we get "java.io.IOException: HRegionInfo was null".
The immediate possible solutions to the problem seems to be either throwing the exception rather than swallowing it, which since we were doing it after PONR would result in failure to rollback the split and eventually RS aborting, or having the hregioninfo in the put that we do during postOpen task so that we have complete information in meta even after the earlier multi fails. Throwing the exception resulting in rs aborting seems radical since the scenario of split+meta moving around can be there quite often especially in case of rolling restarts on a cluster under load. However i'm not very sure about the other fix too since we'll be in a state where we assume we have regioninfo in meta which wont be there untill the daughter opens successfully on a RS (not sure of the failure scenarios between the two events). 
I'm doing some testing with the adding hregioninfo during postOpen task part. I can put up a patch if this sounds ok. [~stack] [~apurtell] [~lhofhansl] Thoughts?

> RegionServer writes invalid META entries for split daughters in some circumstances
> ----------------------------------------------------------------------------------
>
>                 Key: HBASE-17069
>                 URL: https://issues.apache.org/jira/browse/HBASE-17069
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 1.2.4
>            Reporter: Andrew Purtell
>            Priority: Critical
>         Attachments: daughter_1_d55ef81c2f8299abbddfce0445067830.log, daughter_2_08629d59564726da2497f70451aafcdb.log, logs.tar.gz, parent-393d2bfd8b1c52ce08540306659624f2.log
>
>
> I have been seeing frequent ITBLL failures testing various versions of 1.2.x. 
> Over the lifetime of 1.2.x the following issues have been fixed:
> - HBASE-15315 (Remove always set super user call as high priority)
> - HBASE-16093 (Fix splits failed before creating daughter regions leave meta inconsistent)
> And this one is pending:
> - HBASE-17044 (Fix merge failed before creating merged region leaves meta inconsistent)
> I can apply all of the above to branch-1.2 and still see this failure: 
> *The life of stillborn region d55ef81c2f8299abbddfce0445067830*
> *Master sees SPLITTING_NEW*
> {noformat}
> 2016-11-08 04:23:21,186 INFO  [AM.ZK.Worker-pool2-t82] master.RegionStates: Transition null to {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, ts=1478579001186, server=node-3.cluster,16020,1478578389506}
> {noformat}
> *The RegionServer creates it*
> {noformat}
> 2016-11-08 04:23:26,035 INFO  [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created cacheConfig for GomnU: blockCache=LruBlockCache{blockCount=34, currentSize=14996112, freeSize=12823716208, maxSize=12838712320, heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
> 2016-11-08 04:23:26,038 INFO  [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created cacheConfig for big: blockCache=LruBlockCache{blockCount=34, currentSize=14996112, freeSize=12823716208, maxSize=12838712320, heapSize=14996112, minSize=12196776960, minFactor=0.95, multiSize=6098388480, multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
> 2016-11-08 04:23:26,442 INFO  [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created cacheConfig for meta: blockCache=LruBlockCache{blockCount=63, currentSize=17187656, freeSize=12821524664, maxSize=12838712320, heapSize=17187656, minSize=12196776960, minFactor=0.95, multiSize=6098388480, multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
> 2016-11-08 04:23:26,713 INFO  [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created cacheConfig for nwmrW: blockCache=LruBlockCache{blockCount=96, currentSize=19178440, freeSize=12819533880, maxSize=12838712320, heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
> 2016-11-08 04:23:26,715 INFO  [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created cacheConfig for piwbr: blockCache=LruBlockCache{blockCount=96, currentSize=19178440, freeSize=12819533880, maxSize=12838712320, heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
> 2016-11-08 04:23:26,717 INFO  [StoreOpener-d55ef81c2f8299abbddfce0445067830-1] hfile.CacheConfig: Created cacheConfig for tiny: blockCache=LruBlockCache{blockCount=96, currentSize=19178440, freeSize=12819533880, maxSize=12838712320, heapSize=19178440, minSize=12196776960, minFactor=0.95, multiSize=6098388480, multiFactor=0.5, singleSize=3049194240, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
> {noformat}
> *The RegionServer onlines it*
> {noformat}
> 2016-11-08 04:23:27,015 INFO  [node-3.cluster,16020,1478578389506-daughterOpener=d55ef81c2f8299abbddfce0445067830] regionserver.HRegion: Onlined d55ef81c2f8299abbddfce0445067830; next sequenceid=19184
> 2016-11-08 04:23:27,029 INFO  [regionserver/node-3.cluster/192.168.124.4:16020-splits-1478579001099] regionserver.HRegionServer: Post open deploy tasks for IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830.
> 2016-11-08 04:23:27,047 INFO  [regionserver/node-3.cluster/192.168.124.4:16020-splits-1478579001099] hbase.MetaTableAccessor: Updated row IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830. with server=node-3.cluster,16020,1478578389506
> {noformat}
> *The Master transitions state from SPLITTING_NEW to OPEN*
> {noformat}
> 2016-11-08 04:23:27,058 INFO  [AM.ZK.Worker-pool2-t84] master.RegionStates: Transition {d55ef81c2f8299abbddfce0445067830 state=SPLITTING_NEW, ts=1478579007057, server=node-3.cluster,16020,1478578389506} to {d55ef81c2f8299abbddfce0445067830 state=OPEN, ts=1478579007058, server=node-3.cluster,16020,1478578389506}
> 2016-11-08 04:23:27,059 INFO  [AM.ZK.Worker-pool2-t84] master.AssignmentManager: Handled SPLIT event; parent=IntegrationTestBigLinkedList,,1478577020916.393d2bfd8b1c52ce08540306659624f2., daughter a=IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830., daughter b=IntegrationTestBigLinkedList,/\xFB\x14,1478579001155.08629d59564726da2497f70451aafcdb., on node-3.cluster,16020,1478578389506
> {noformat}
> *RegionServer updates META  - BUT APPARENTLY NOT CORRECTLY*
> {noformat}
> 2016-11-08 04:23:27,165 INFO  [regionserver/node-3.cluster/192.168.124.4:16020-splits-1478579001099] regionserver.SplitRequest: Region split, hbase:meta updated, and report to master. Parent=IntegrationTestBigLinkedList,,1478577020916.393d2bfd8b1c52ce08540306659624f2., new regions: IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830., IntegrationTestBigLinkedList,/\xFB\x14,1478579001155.08629d59564726da2497f70451aafcdb.. Split took 6sec
> {noformat}
> *RegionServer delays flush*
> (Is this important?)
> {noformat}
> 2016-11-08 04:24:14,639 WARN  [MemStoreFlusher.0] regionserver.MemStoreFlusher: Region IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830. has too many store files; delaying flush up to 90000ms
> {noformat}
> *Immediate warnings about No serialized HRegionInfo*
> {noformat}
> 2016-11-08 04:24:44,691 WARN  [B.defaultRpcServer.handler=26,queue=2,port=16000] hbase.MetaTableAccessor: No serialized HRegionInfo in keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478579007029/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478579007029/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478579007029/Put/vlen=8/seqid=0}
> {noformat}
> *Master is not happy either*
> {noformat}
> 2016-11-08 04:24:51,148 WARN  [MASTER_TABLE_OPERATIONS-node-1:16000-0] hbase.MetaTableAccessor: No serialized HRegionInfo in keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478579007029/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478579007029/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478579007029/Put/vlen=8/seqid=0}
> {noformat}
> *TestRunner MetaScanner complains about invalid entries in META missing HRegionInfo*
> {noformat}
> (standard input):9086:2016-11-08 05:04:17,230 WARN  [B.defaultRpcServer.handler=4,queue=1,port=16000] hbase.MetaTableAccessor: No serialized HRegionInfo in keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
> {noformat}
> *ITBLL MapReduce tasks fail because part of the keyspace cannot be located:*
> {noformat}
> java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1293)
>         at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1185)
>         at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:410)
>         at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:359)
>         at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:238)
>         at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:154)
>         at org.apache.hadoop.hbase.client.BufferedMutatorImpl.mutate(BufferedMutatorImpl.java:121)
>         at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Generator$GeneratorMapper.persist(IntegrationTestBigLinkedList.java:486)
>         at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Generator$GeneratorMapper.map(IntegrationTestBigLinkedList.java:431)
>         at org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Generator$GeneratorMapper.map(IntegrationTestBigLinkedList.java:375)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>         at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:170)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1719)
>         at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:164)
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000008/syslog:920:java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478580288482/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478580288482/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478580288482/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000010/syslog:920:java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478580288482/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478580288482/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478580288482/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000011/syslog:909:java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478580288482/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478580288482/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478580288482/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000030/syslog:48:java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
> {noformat}
> {noformat}
> ./application_1478574724776_0002/container_1478574724776_0002_01_000048/syslog:48:java.io.IOException: HRegionInfo was null in IntegrationTestBigLinkedList, row=keyvalues={IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:seqnumDuringOpen/1478581041080/Put/vlen=8/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:server/1478581041080/Put/vlen=20/seqid=0, IntegrationTestBigLinkedList,,1478579001155.d55ef81c2f8299abbddfce0445067830./info:serverstartcode/1478581041080/Put/vlen=8/seqid=0}
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)