You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Mikhail Antonov (JIRA)" <ji...@apache.org> on 2017/01/02 23:02:58 UTC

[jira] [Created] (HBASE-17406) Occasional failed splits on branch-1.3

Mikhail Antonov created HBASE-17406:
---------------------------------------

             Summary: Occasional failed splits on branch-1.3
                 Key: HBASE-17406
                 URL: https://issues.apache.org/jira/browse/HBASE-17406
             Project: HBase
          Issue Type: Bug
          Components: Compaction, regionserver
    Affects Versions: 1.3.0
            Reporter: Mikhail Antonov
             Fix For: 1.3.1


We observed occasional (rare) failed splits on branch-1.3 builds, that might be another echo of HBASE-13082.

Essentially here's what seems to be happening:

First on the RS hosting to-be-split parent seems to see some FileNotFoundException's in the logs. It could be simple file not found on some scanner path:

{quote}
16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read
java.io.FileNotFoundException: File does not exist: <path to HFile>
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)

....
	at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715)
	at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454)
	at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
	at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412)
	at org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375)
	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310)
	at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889)
	at org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867)
	at org.apache.hadoop.hbase.regionserver.Stor
{quote}

Or it could be warning from HFileArchiver:

{quote}
16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile, <HFile path> because it does not exist! Skipping and continuing on.
java.io.FileNotFoundException: File/Directory <HFile Path> does not exist.
	at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
	at or
{quote}

Then on the RS hosting parent I'm seeing:

{quote}16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the region=<region name>
16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the <region name>
16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to server
java.lang.InterruptedException
	at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
	at java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060)
	at org.apache.hadoop.ipc.Client.call(Client.java:1455)
	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
	at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
	at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302)
	at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
	at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112)
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
	at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
	at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
	at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171)
	at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:224)
	at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185)
	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926)
	at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
...
16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server <server name>: Abort; we got an error after point-of-no-return
{quote}

So we've got past PONR and abort; then on the RSs where daughters are to be opened I'm seeing:

{quote}
16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= <region name>, starting to roll back the global memstore size.
java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: < HFile name>
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
	at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
	at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
	at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
	at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
	at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
	at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

	at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952)
	at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827)
	at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596)
	at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547)
	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
	at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
{quote}

And regions remains offline. There's no dataloss here as daughters never open up and the failed split could be recovered manually using the following procedure:

 - manually remove daughters from hbase:meta
 - move daughter region HDFS directories out of the way
-  delete the parent region from hbase:meta
 - hbck -fixMeta to add the parent region back
 - failover the active master
 - hbck -fixAssignments after master startup






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)