You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by anil gupta <an...@gmail.com> on 2018/02/10 16:13:33 UTC

[Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Hi Folks,

We are running HBase1.1.2. It seems like we are hitting
https://issues.apache.org/jira/browse/HBASE-16464 in our Production
cluster. Our oldwals folder has grown to 9.5Tb. I am aware that this is
fixed in releases after 2016 but unfortunately we need to operate this
production cluster for few more months. (We are already migrating to a
newer version of HBase).

I have verified that we dont have any snapshots in this cluster. Also, we
removed all the replication_peers from that cluster. We have already
restarted HBase master a few days ago but it didnt help.  We have TB's of
oldwal and tens of thousand of recovered edit files.(assuming recovered
edits files are cleaned up by chore cleaner). Seems like the problem
started happening around mid december but at that time we didnt do any
major thing on this cluster.

I would like to see if there is a workaround for HBASE-16464? Is there any
references left to those deleted snapshots in hdfs or zk? If yes, how can i
clean up?

I keep on seeing this in HMaster logs:
2018-02-07 09:10:08,514 ERROR
[hdpmaster6.bigdataprod1.wh.truecarcorp.com,60000,1517601353645_ChoreService_3]
snapshot.SnapshotHFileCleaner: Exception while checking if files were
valid, keeping them just in case.
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read
snapshot info
from:hdfs://PRODNN/apps/hbase/data/.hbase-snapshot/.tmp/LEAD_SALES-1517979610/.snapshotinfo
    at
org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:313)
    at
org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(SnapshotReferenceUtil.java:328)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.filesUnderSnapshot(SnapshotHFileCleaner.java:85)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getSnapshotsInProgress(SnapshotFileCache.java:303)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.getUnreferencedFiles(SnapshotFileCache.java:194)
    at
org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.getDeletableFiles(SnapshotHFileCleaner.java:62)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteDirectory(CleanerChore.java:180)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:149)
    at
org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124)
    at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
    at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.FileNotFoundException: File does not exist:
/apps/hbase/data/.hbase-snapshot/.tmp/LEAD_SALES-1517979610/.snapshotinfo
    at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
    at
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
    at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
    at
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:652)
    at
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
    at
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
    at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

    at sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown
Source)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at
org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106)
    at
org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:73)
    at
org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:1242)
    at
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
    at
org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
    at
org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBlockLength(DFSInputStream.java:303)
    at
org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
    at org.apache.hadoop.hdfs.DFSInputStream.<init>(DFSInputStream.java:261)
    at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:303)
    at
org.apache.hadoop.hdfs.DistributedFileSystem$3.doCall(DistributedFileSystem.java:299)
    at
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at
org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:299)
    at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
    at
org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.readSnapshotInfo(SnapshotDescriptionUtils.java:306)
    ... 26 more


-- 
Thanks & Regards,
Anil Gupta

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by Ted Yu <yu...@gmail.com>.
bq. Apache projects are supposed to encourage collaboration

I totally agree with you.

Cheers

On Sat, Feb 10, 2018 at 10:32 AM, anil gupta <an...@gmail.com> wrote:

> Thanks Ted. Will try to do the clean-up. Unfortunately, we ran out of
> support for this cluster since its nearing End-of-life. For our new
> clusters we are in process of getting support.
>
> PS: IMO, I agree that i should use vendor forum/list for any vendor
> specific stuff but i think its appropriate to use this mailing Apache HBase
> questions/issues related to HBase. As per my understanding, Apache projects
> are supposed to encourage collaboration rather building boundaries around
> vendors.("collaboration and openness" is one of the reason i like Apache
> Projects)
>
> On Sat, Feb 10, 2018 at 10:11 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > You can cleanup oldwal directory beginning with oldest data.
> >
> > Please open support case with the vendor.
> >
> > On Sat, Feb 10, 2018 at 10:02 AM, anil gupta <an...@gmail.com>
> > wrote:
> >
> > > Hi Ted,
> > >
> > > We cleaned up all the snaphsots around Feb 7-8th. You were right that i
> > > dont see the CorruptedSnapshotException since then. Nice observation!
> > > So, i am again back to square one. Not really, sure why oldwals and
> > > recovered.edits are not getting cleaned up. I have already removed all
> > the
> > > replication peer and deleted all the snapshots.
> > > Is it ok if i just ahead and cleanup oldwal directory manually? Can i
> > also
> > > clean up recovered.edits?
> > >
> > > Thanks,
> > > Anil
> > >
> > > On Sat, Feb 10, 2018 at 9:37 AM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > Can you clarify whether /apps/hbase/data/.hbase-snapshot/.tmp/
> became
> > > > empty
> > > > after 2018-02-07 09:10:08 ?
> > > >
> > > > Do you see CorruptedSnapshotException for file outside of
> > > > /apps/hbase/data/.hbase-snapshot/.tmp/ ?
> > > >
> > > > Cheers
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks & Regards,
> > > Anil Gupta
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by anil gupta <an...@gmail.com>.
Thanks Ted. Will try to do the clean-up. Unfortunately, we ran out of
support for this cluster since its nearing End-of-life. For our new
clusters we are in process of getting support.

PS: IMO, I agree that i should use vendor forum/list for any vendor
specific stuff but i think its appropriate to use this mailing Apache HBase
questions/issues related to HBase. As per my understanding, Apache projects
are supposed to encourage collaboration rather building boundaries around
vendors.("collaboration and openness" is one of the reason i like Apache
Projects)

On Sat, Feb 10, 2018 at 10:11 AM, Ted Yu <yu...@gmail.com> wrote:

> You can cleanup oldwal directory beginning with oldest data.
>
> Please open support case with the vendor.
>
> On Sat, Feb 10, 2018 at 10:02 AM, anil gupta <an...@gmail.com>
> wrote:
>
> > Hi Ted,
> >
> > We cleaned up all the snaphsots around Feb 7-8th. You were right that i
> > dont see the CorruptedSnapshotException since then. Nice observation!
> > So, i am again back to square one. Not really, sure why oldwals and
> > recovered.edits are not getting cleaned up. I have already removed all
> the
> > replication peer and deleted all the snapshots.
> > Is it ok if i just ahead and cleanup oldwal directory manually? Can i
> also
> > clean up recovered.edits?
> >
> > Thanks,
> > Anil
> >
> > On Sat, Feb 10, 2018 at 9:37 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Can you clarify whether /apps/hbase/data/.hbase-snapshot/.tmp/ became
> > > empty
> > > after 2018-02-07 09:10:08 ?
> > >
> > > Do you see CorruptedSnapshotException for file outside of
> > > /apps/hbase/data/.hbase-snapshot/.tmp/ ?
> > >
> > > Cheers
> > >
> >
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by Ted Yu <yu...@gmail.com>.
You can cleanup oldwal directory beginning with oldest data.

Please open support case with the vendor.

On Sat, Feb 10, 2018 at 10:02 AM, anil gupta <an...@gmail.com> wrote:

> Hi Ted,
>
> We cleaned up all the snaphsots around Feb 7-8th. You were right that i
> dont see the CorruptedSnapshotException since then. Nice observation!
> So, i am again back to square one. Not really, sure why oldwals and
> recovered.edits are not getting cleaned up. I have already removed all the
> replication peer and deleted all the snapshots.
> Is it ok if i just ahead and cleanup oldwal directory manually? Can i also
> clean up recovered.edits?
>
> Thanks,
> Anil
>
> On Sat, Feb 10, 2018 at 9:37 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Can you clarify whether /apps/hbase/data/.hbase-snapshot/.tmp/ became
> > empty
> > after 2018-02-07 09:10:08 ?
> >
> > Do you see CorruptedSnapshotException for file outside of
> > /apps/hbase/data/.hbase-snapshot/.tmp/ ?
> >
> > Cheers
> >
>
>
>
> --
> Thanks & Regards,
> Anil Gupta
>

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by anil gupta <an...@gmail.com>.
Hi Ted,

We cleaned up all the snaphsots around Feb 7-8th. You were right that i
dont see the CorruptedSnapshotException since then. Nice observation!
So, i am again back to square one. Not really, sure why oldwals and
recovered.edits are not getting cleaned up. I have already removed all the
replication peer and deleted all the snapshots.
Is it ok if i just ahead and cleanup oldwal directory manually? Can i also
clean up recovered.edits?

Thanks,
Anil

On Sat, Feb 10, 2018 at 9:37 AM, Ted Yu <yu...@gmail.com> wrote:

> Can you clarify whether /apps/hbase/data/.hbase-snapshot/.tmp/ became
> empty
> after 2018-02-07 09:10:08 ?
>
> Do you see CorruptedSnapshotException for file outside of
> /apps/hbase/data/.hbase-snapshot/.tmp/ ?
>
> Cheers
>



-- 
Thanks & Regards,
Anil Gupta

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by Ted Yu <yu...@gmail.com>.
Can you clarify whether /apps/hbase/data/.hbase-snapshot/.tmp/ became empty
after 2018-02-07 09:10:08 ?

Do you see CorruptedSnapshotException for file outside of
/apps/hbase/data/.hbase-snapshot/.tmp/ ?

Cheers

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by anil gupta <an...@gmail.com>.
Hi Ted,

Thanks for your reply. I read the comment of jira. But, in my case
"/apps/hbase/data/.hbase-snapshot/.tmp/" is already empty. So, i am not
really sure what i can sideline. Please let me know if i am missing
something.

~Anil


On Sat, Feb 10, 2018 at 8:35 AM, Ted Yu <yu...@gmail.com> wrote:

> Please the first few review comments of HBASE-16464.
>
> You can sideline the corrupt snapshots (according to master log).
>
> You can also contact the vendor for a HOTFIX.
>
> Cheers
>
> On Sat, Feb 10, 2018 at 8:13 AM, anil gupta <an...@gmail.com> wrote:
>
> > Hi Folks,
> >
> > We are running HBase1.1.2. It seems like we are hitting
> > https://issues.apache.org/jira/browse/HBASE-16464 in our Production
> > cluster. Our oldwals folder has grown to 9.5Tb. I am aware that this is
> > fixed in releases after 2016 but unfortunately we need to operate this
> > production cluster for few more months. (We are already migrating to a
> > newer version of HBase).
> >
> > I have verified that we dont have any snapshots in this cluster. Also, we
> > removed all the replication_peers from that cluster. We have already
> > restarted HBase master a few days ago but it didnt help.  We have TB's of
> > oldwal and tens of thousand of recovered edit files.(assuming recovered
> > edits files are cleaned up by chore cleaner). Seems like the problem
> > started happening around mid december but at that time we didnt do any
> > major thing on this cluster.
> >
> > I would like to see if there is a workaround for HBASE-16464? Is there
> any
> > references left to those deleted snapshots in hdfs or zk? If yes, how
> can i
> > clean up?
> >
> > I keep on seeing this in HMaster logs:
> > 2018-02-07 09:10:08,514 ERROR
> > [hdpmaster6.bigdataprod1.wh.truecarcorp.com,60000,
> > 1517601353645_ChoreService_3]
> > snapshot.SnapshotHFileCleaner: Exception while checking if files were
> > valid, keeping them just in case.
> > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't
> read
> > snapshot info
> > from:hdfs://PRODNN/apps/hbase/data/.hbase-snapshot/.tmp/
> > LEAD_SALES-1517979610/.snapshotinfo
> >     at
> > org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.
> > readSnapshotInfo(SnapshotDescriptionUtils.java:313)
> >     at
> > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(
> > SnapshotReferenceUtil.java:328)
> >     at
> > org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.
> > filesUnderSnapshot(SnapshotHFileCleaner.java:85)
> >     at
> > org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.
> > getSnapshotsInProgress(SnapshotFileCache.java:303)
> >     at
> > org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.
> > getUnreferencedFiles(SnapshotFileCache.java:194)
> >     at
> > org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.
> > getDeletableFiles(SnapshotHFileCleaner.java:62)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(
> > CleanerChore.java:233)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteEntries(
> > CleanerChore.java:157)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> > checkAndDeleteDirectory(CleanerChore.java:180)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteEntries(
> > CleanerChore.java:149)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> > checkAndDeleteDirectory(CleanerChore.java:180)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteEntries(
> > CleanerChore.java:149)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> > checkAndDeleteDirectory(CleanerChore.java:180)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteEntries(
> > CleanerChore.java:149)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> > checkAndDeleteDirectory(CleanerChore.java:180)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteEntries(
> > CleanerChore.java:149)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> > checkAndDeleteDirectory(CleanerChore.java:180)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteEntries(
> > CleanerChore.java:149)
> >     at
> > org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> > chore(CleanerChore.java:124)
> >     at org.apache.hadoop.hbase.ScheduledChore.run(
> ScheduledChore.java:185)
> >     at
> > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> >     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> >     at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> >     at
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(
> > ScheduledThreadPoolExecutor.java:294)
> >     at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1142)
> >     at
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > ThreadPoolExecutor.java:617)
> >     at java.lang.Thread.run(Thread.java:745)
> > Caused by: java.io.FileNotFoundException: File does not exist:
> > /apps/hbase/data/.hbase-snapshot/.tmp/LEAD_SALES-
> 1517979610/.snapshotinfo
> >     at
> > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(
> > INodeFile.java:71)
> >     at
> > org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(
> > INodeFile.java:61)
> >     at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.
> getBlockLocationsInt(
> > FSNamesystem.java:1828)
> >     at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(
> > FSNamesystem.java:1799)
> >     at
> > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(
> > FSNamesystem.java:1712)
> >     at
> > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.
> > getBlockLocations(NameNodeRpcServer.java:652)
> >     at
> > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> > deTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSi
> > deTranslatorPB.java:365)
> >     at
> > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> > ClientNamenodeProtocol$2.callBlockingMethod(
> ClientNamenodeProtocolProtos.
> > java)
> >     at
> > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> > ProtobufRpcEngine.java:616)
> >     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
> >     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
> >     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
> >     at java.security.AccessController.doPrivileged(Native Method)
> >     at javax.security.auth.Subject.doAs(Subject.java:415)
> >     at
> > org.apache.hadoop.security.UserGroupInformation.doAs(
> > UserGroupInformation.java:1657)
> >     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
> >
> >     at sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown
> > Source)
> >     at
> > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> > DelegatingConstructorAccessorImpl.java:45)
> >     at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
> >     at
> > org.apache.hadoop.ipc.RemoteException.instantiateException(
> > RemoteException.java:106)
> >     at
> > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> > RemoteException.java:73)
> >     at
> > org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(
> > DFSClient.java:1242)
> >     at
> > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
> >     at
> > org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
> >     at
> > org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBl
> > ockLength(DFSInputStream.java:303)
> >     at
> > org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
> >     at org.apache.hadoop.hdfs.DFSInputStream.<init>(
> > DFSInputStream.java:261)
> >     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
> >     at
> > org.apache.hadoop.hdfs.DistributedFileSystem$3.
> > doCall(DistributedFileSystem.java:303)
> >     at
> > org.apache.hadoop.hdfs.DistributedFileSystem$3.
> > doCall(DistributedFileSystem.java:299)
> >     at
> > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> > FileSystemLinkResolver.java:81)
> >     at
> > org.apache.hadoop.hdfs.DistributedFileSystem.open(
> > DistributedFileSystem.java:299)
> >     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
> >     at
> > org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.
> > readSnapshotInfo(SnapshotDescriptionUtils.java:306)
> >     ... 26 more
> >
> >
> > --
> > Thanks & Regards,
> > Anil Gupta
> >
>



-- 
Thanks & Regards,
Anil Gupta

Re: [Production Impacted] Any workaround for https://issues.apache.org/jira/browse/HBASE-16464?

Posted by Ted Yu <yu...@gmail.com>.
Please the first few review comments of HBASE-16464.

You can sideline the corrupt snapshots (according to master log).

You can also contact the vendor for a HOTFIX.

Cheers

On Sat, Feb 10, 2018 at 8:13 AM, anil gupta <an...@gmail.com> wrote:

> Hi Folks,
>
> We are running HBase1.1.2. It seems like we are hitting
> https://issues.apache.org/jira/browse/HBASE-16464 in our Production
> cluster. Our oldwals folder has grown to 9.5Tb. I am aware that this is
> fixed in releases after 2016 but unfortunately we need to operate this
> production cluster for few more months. (We are already migrating to a
> newer version of HBase).
>
> I have verified that we dont have any snapshots in this cluster. Also, we
> removed all the replication_peers from that cluster. We have already
> restarted HBase master a few days ago but it didnt help.  We have TB's of
> oldwal and tens of thousand of recovered edit files.(assuming recovered
> edits files are cleaned up by chore cleaner). Seems like the problem
> started happening around mid december but at that time we didnt do any
> major thing on this cluster.
>
> I would like to see if there is a workaround for HBASE-16464? Is there any
> references left to those deleted snapshots in hdfs or zk? If yes, how can i
> clean up?
>
> I keep on seeing this in HMaster logs:
> 2018-02-07 09:10:08,514 ERROR
> [hdpmaster6.bigdataprod1.wh.truecarcorp.com,60000,
> 1517601353645_ChoreService_3]
> snapshot.SnapshotHFileCleaner: Exception while checking if files were
> valid, keeping them just in case.
> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Couldn't read
> snapshot info
> from:hdfs://PRODNN/apps/hbase/data/.hbase-snapshot/.tmp/
> LEAD_SALES-1517979610/.snapshotinfo
>     at
> org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.
> readSnapshotInfo(SnapshotDescriptionUtils.java:313)
>     at
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.getHFileNames(
> SnapshotReferenceUtil.java:328)
>     at
> org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner$1.
> filesUnderSnapshot(SnapshotHFileCleaner.java:85)
>     at
> org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.
> getSnapshotsInProgress(SnapshotFileCache.java:303)
>     at
> org.apache.hadoop.hbase.master.snapshot.SnapshotFileCache.
> getUnreferencedFiles(SnapshotFileCache.java:194)
>     at
> org.apache.hadoop.hbase.master.snapshot.SnapshotHFileCleaner.
> getDeletableFiles(SnapshotHFileCleaner.java:62)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(
> CleanerChore.java:233)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:157)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteDirectory(CleanerChore.java:180)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:149)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteDirectory(CleanerChore.java:180)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:149)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteDirectory(CleanerChore.java:180)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:149)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteDirectory(CleanerChore.java:180)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:149)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> checkAndDeleteDirectory(CleanerChore.java:180)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(
> CleanerChore.java:149)
>     at
> org.apache.hadoop.hbase.master.cleaner.CleanerChore.
> chore(CleanerChore.java:124)
>     at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185)
>     at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>     at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(
> ScheduledThreadPoolExecutor.java:294)
>     at
> java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.FileNotFoundException: File does not exist:
> /apps/hbase/data/.hbase-snapshot/.tmp/LEAD_SALES-1517979610/.snapshotinfo
>     at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(
> INodeFile.java:71)
>     at
> org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(
> INodeFile.java:61)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(
> FSNamesystem.java:1828)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(
> FSNamesystem.java:1799)
>     at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(
> FSNamesystem.java:1712)
>     at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.
> getBlockLocations(NameNodeRpcServer.java:652)
>     at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSi
> deTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSi
> deTranslatorPB.java:365)
>     at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$
> ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.
> java)
>     at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(
> ProtobufRpcEngine.java:616)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
>     at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:415)
>     at
> org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)
>
>     at sun.reflect.GeneratedConstructorAccessor22.newInstance(Unknown
> Source)
>     at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(
> DelegatingConstructorAccessorImpl.java:45)
>     at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>     at
> org.apache.hadoop.ipc.RemoteException.instantiateException(
> RemoteException.java:106)
>     at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(
> RemoteException.java:73)
>     at
> org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(
> DFSClient.java:1242)
>     at
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1227)
>     at
> org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:1215)
>     at
> org.apache.hadoop.hdfs.DFSInputStream.fetchLocatedBlocksAndGetLastBl
> ockLength(DFSInputStream.java:303)
>     at
> org.apache.hadoop.hdfs.DFSInputStream.openInfo(DFSInputStream.java:269)
>     at org.apache.hadoop.hdfs.DFSInputStream.<init>(
> DFSInputStream.java:261)
>     at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1540)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem$3.
> doCall(DistributedFileSystem.java:303)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem$3.
> doCall(DistributedFileSystem.java:299)
>     at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(
> FileSystemLinkResolver.java:81)
>     at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(
> DistributedFileSystem.java:299)
>     at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:767)
>     at
> org.apache.hadoop.hbase.snapshot.SnapshotDescriptionUtils.
> readSnapshotInfo(SnapshotDescriptionUtils.java:306)
>     ... 26 more
>
>
> --
> Thanks & Regards,
> Anil Gupta
>