You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dave Latham <la...@davelink.net> on 2011/08/26 18:44:52 UTC

Re: corrupt WAL and Java Heap Space...

We just hit the same issue.  I attached log snippets from the regionserver
and master into https://issues.apache.org/jira/browse/HBASE-4107

I was able to get the log file out of hdfs.  Is there a location I can put
it back in to have it picked up?

Dave

On Fri, Jul 15, 2011 at 12:23 PM, Andy Sautins
<an...@returnpath.net>wrote:

>
>  I don't have the log still.  Not sure what I was thinking deleting it.  I
> was a little too aggressive wanting to get my fsck back to having 0 corrupt
> blocks.
>
>  What you say is interesting.  It's more than possible that I'm
> misunderstanding what is going on.
>
>  What we saw with the log file is that we could cat it, but couldn't copy
> the file ( would complain about a bad checksum ).  I know that's not hard
> data, but going by that what you say about applying the log up until the
> last sync makes would make sense.  What might have thrown me is after a
> re-start the logs ( including the corrupt log ) were still in the .logs
> folder.  We did a full shutdown/restart and the following stacktrace was in
> the master logs. After this stacktrace hbase continued to startup, however
> the logs ( all logs up until the corrupt log ) for the region with the
> corrupt log file were left in the .logs directory.  When we removed the
> corrupt log file and re-started again all the existing logs were removed
> after successful restart as I would expect.
>
>   So is it more likely that the error on shutdown is reasonable and that
> the log cleanup just didn't happen on startup?  I suppose it makes sense not
> to remove them if there is an error, but it did throw me that the corrupt
> file as well as previous files were still in the .logs directory.
>
> 2011-07-14 18:07:45,954 ERROR
> org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://
> hdnn.dfs.returnpath.net:8020/user/hbase/.logs/hd31.dfs.returnpath.net,60020,1309294522164
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /blk_-8148723766791273697:of:/user/hbase/.logs/hd31.dfs.returnpath.net
> ,60020,1309294522164/hd31.dfs.returnpath.net%3A60020.1310675410770 at
> 57790464
>        at
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
>        at
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
>        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>        at
> org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
>        at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1249)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1899)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1951)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
>        at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
>        at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
>        at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
>        at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
>        at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
>        at
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, July 15, 2011 12:59 PM
> To: user@hbase.apache.org
> Subject: Re: corrupt WAL and Java Heap Space...
>
> I'd have expected the log to be recoverable up to the last time you
> called sync.  What were you seeing?  Do you have the log still?  (It
> should recover to the last edit)
>
> St.Ack
>
> On Fri, Jul 15, 2011 at 11:32 AM, Andy Sautins
> <an...@returnpath.net> wrote:
> >
> >  Thanks.  I filed JIRA HBASE-4107 (
> https://issues.apache.org/jira/browse/HBASE-4107 ).
> >
> >  It does seem like the OOME is causing a write to the WAL to be left in
> an inconsistent state.  I haven't had a chance to look yet, but it would
> seem that the flush isn't atomic, so possibly the data was synced but the
> checksum wasn't able to be updated.  If that logic is right then it would be
> an issue in the sync to hdfs.
> >
> >  In either case it is sad that the log looks like it could get left in an
> unusable state.  That seems like the last thing we'd really want.  Not sure
> about keeping a reservoir of memory around.  It seems you could free just
> about anything to let the write finish and then exit potentially
> ungracefully.  The WAL would need to be recovered, but that's much
> preferable to data loss.
> >
> >  I need to look further but it does feel like the full sync is not atomic
> and failing somewhere before the checksum is fully written out can
> potentially lead to WAL corruption.  That's a guess.  I need to look at it
> further.
> >
> >  Thanks
> >
> >  Andy
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> > Sent: Friday, July 15, 2011 10:41 AM
> > To: user@hbase.apache.org
> > Subject: Re: corrupt WAL and Java Heap Space...
> >
> > Please file an issue.  Sounds like an OOME while writing causes us to
> > exit w/o closing the WAL (You think that the case)?  My guess is that
> > in this low memory situation, a close might fail anyways (with another
> > OOME) unless we did some extra gymnastics releasing the little
> > resevoir of memory we keep around to release so cleanup succeeds
> > whenever we see OOME.
> >
> > St.Ack
> >
> > On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
> > <an...@returnpath.net> wrote:
> >>
> >>   Yesterday we ran into an interesting issue.  We were shutting down our
> HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes
> encountered a Java heap space exception.  The bummer is the log file was
> listed as corrupt from hadoop fsck and was unable to be read when
> re-starting the database.  We were able to recover in our situation by
> removing the corrupt log and did not appear to lose any data.
> >>
> >>    Has anyone else seen this issue?  If I'm reading the situation right
> it looks like that a Java heap space error during the WAL checksum write
> could leave the WAL corrupt which doesn't seem like desired behavior.
> >>
> >>    I'll looking into it further but any thoughts would be appreciated.
> >>
> >>
> >> 2011-07-14 14:54:53,741 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting
> close of hlog
> >> java.io.IOException: Reflection
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
> >> Caused by: java.lang.reflect.InvocationTargetException
> >>        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
> >>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
> >>        ... 2 more
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
> >>        at
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
> >>        at
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
> >>        at
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> >>        at
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
> >>        ... 6 more
> >>
> >>
> >
>

Re: corrupt WAL and Java Heap Space...

Posted by Stack <st...@duboce.net>.
On Fri, Aug 26, 2011 at 9:44 AM, Dave Latham <la...@davelink.net> wrote:
> I was able to get the log file out of hdfs.  Is there a location I can put
> it back in to have it picked up?
>
Someone needs to finish up the hbase walplayer: HBASE-3619.
St.Ack