You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dave Latham <la...@davelink.net> on 2011/08/26 18:44:52 UTC
Re: corrupt WAL and Java Heap Space...
We just hit the same issue. I attached log snippets from the regionserver
and master into https://issues.apache.org/jira/browse/HBASE-4107
I was able to get the log file out of hdfs. Is there a location I can put
it back in to have it picked up?
Dave
On Fri, Jul 15, 2011 at 12:23 PM, Andy Sautins
<an...@returnpath.net>wrote:
>
> I don't have the log still. Not sure what I was thinking deleting it. I
> was a little too aggressive wanting to get my fsck back to having 0 corrupt
> blocks.
>
> What you say is interesting. It's more than possible that I'm
> misunderstanding what is going on.
>
> What we saw with the log file is that we could cat it, but couldn't copy
> the file ( would complain about a bad checksum ). I know that's not hard
> data, but going by that what you say about applying the log up until the
> last sync makes would make sense. What might have thrown me is after a
> re-start the logs ( including the corrupt log ) were still in the .logs
> folder. We did a full shutdown/restart and the following stacktrace was in
> the master logs. After this stacktrace hbase continued to startup, however
> the logs ( all logs up until the corrupt log ) for the region with the
> corrupt log file were left in the .logs directory. When we removed the
> corrupt log file and re-started again all the existing logs were removed
> after successful restart as I would expect.
>
> So is it more likely that the error on shutdown is reasonable and that
> the log cleanup just didn't happen on startup? I suppose it makes sense not
> to remove them if there is an error, but it did throw me that the corrupt
> file as well as previous files were still in the .logs directory.
>
> 2011-07-14 18:07:45,954 ERROR
> org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://
> hdnn.dfs.returnpath.net:8020/user/hbase/.logs/hd31.dfs.returnpath.net,60020,1309294522164
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /blk_-8148723766791273697:of:/user/hbase/.logs/hd31.dfs.returnpath.net
> ,60020,1309294522164/hd31.dfs.returnpath.net%3A60020.1310675410770 at
> 57790464
> at
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
> at
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
> at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
> at
> org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
> at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
> at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1249)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1899)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1951)
> at java.io.DataInputStream.read(DataInputStream.java:132)
> at java.io.DataInputStream.readFully(DataInputStream.java:178)
> at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
> at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
> at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
> at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
> at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
> at
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
> at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, July 15, 2011 12:59 PM
> To: user@hbase.apache.org
> Subject: Re: corrupt WAL and Java Heap Space...
>
> I'd have expected the log to be recoverable up to the last time you
> called sync. What were you seeing? Do you have the log still? (It
> should recover to the last edit)
>
> St.Ack
>
> On Fri, Jul 15, 2011 at 11:32 AM, Andy Sautins
> <an...@returnpath.net> wrote:
> >
> > Thanks. I filed JIRA HBASE-4107 (
> https://issues.apache.org/jira/browse/HBASE-4107 ).
> >
> > It does seem like the OOME is causing a write to the WAL to be left in
> an inconsistent state. I haven't had a chance to look yet, but it would
> seem that the flush isn't atomic, so possibly the data was synced but the
> checksum wasn't able to be updated. If that logic is right then it would be
> an issue in the sync to hdfs.
> >
> > In either case it is sad that the log looks like it could get left in an
> unusable state. That seems like the last thing we'd really want. Not sure
> about keeping a reservoir of memory around. It seems you could free just
> about anything to let the write finish and then exit potentially
> ungracefully. The WAL would need to be recovered, but that's much
> preferable to data loss.
> >
> > I need to look further but it does feel like the full sync is not atomic
> and failing somewhere before the checksum is fully written out can
> potentially lead to WAL corruption. That's a guess. I need to look at it
> further.
> >
> > Thanks
> >
> > Andy
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> > Sent: Friday, July 15, 2011 10:41 AM
> > To: user@hbase.apache.org
> > Subject: Re: corrupt WAL and Java Heap Space...
> >
> > Please file an issue. Sounds like an OOME while writing causes us to
> > exit w/o closing the WAL (You think that the case)? My guess is that
> > in this low memory situation, a close might fail anyways (with another
> > OOME) unless we did some extra gymnastics releasing the little
> > resevoir of memory we keep around to release so cleanup succeeds
> > whenever we see OOME.
> >
> > St.Ack
> >
> > On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
> > <an...@returnpath.net> wrote:
> >>
> >> Yesterday we ran into an interesting issue. We were shutting down our
> HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes
> encountered a Java heap space exception. The bummer is the log file was
> listed as corrupt from hadoop fsck and was unable to be read when
> re-starting the database. We were able to recover in our situation by
> removing the corrupt log and did not appear to lose any data.
> >>
> >> Has anyone else seen this issue? If I'm reading the situation right
> it looks like that a Java heap space error during the WAL checksum write
> could leave the WAL corrupt which doesn't seem like desired behavior.
> >>
> >> I'll looking into it further but any thoughts would be appreciated.
> >>
> >>
> >> 2011-07-14 14:54:53,741 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting
> close of hlog
> >> java.io.IOException: Reflection
> >> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
> >> at
> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
> >> at
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
> >> Caused by: java.lang.reflect.InvocationTargetException
> >> at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
> >> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >> at java.lang.reflect.Method.invoke(Method.java:597)
> >> at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
> >> ... 2 more
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
> >> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
> >> at
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
> >> at
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> >> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
> >> at
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> >> at
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
> >> ... 6 more
> >>
> >>
> >
>
Re: corrupt WAL and Java Heap Space...
Posted by Stack <st...@duboce.net>.
On Fri, Aug 26, 2011 at 9:44 AM, Dave Latham <la...@davelink.net> wrote:
> I was able to get the log file out of hdfs. Is there a location I can put
> it back in to have it picked up?
>
Someone needs to finish up the hbase walplayer: HBASE-3619.
St.Ack