You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Andy Sautins <an...@returnpath.net> on 2011/07/15 18:32:34 UTC

corrupt WAL and Java Heap Space...

   Yesterday we ran into an interesting issue.  We were shutting down our HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes encountered a Java heap space exception.  The bummer is the log file was listed as corrupt from hadoop fsck and was unable to be read when re-starting the database.  We were able to recover in our situation by removing the corrupt log and did not appear to lose any data.

    Has anyone else seen this issue?  If I'm reading the situation right it looks like that a Java heap space error during the WAL checksum write could leave the WAL corrupt which doesn't seem like desired behavior.

    I'll looking into it further but any thoughts would be appreciated.


2011-07-14 14:54:53,741 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
java.io.IOException: Reflection
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
        ... 2 more
Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
        at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
        at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
        ... 6 more

Re: corrupt WAL and Java Heap Space...

Posted by Stack <st...@duboce.net>.

On Fri, Aug 26, 2011 at 9:44 AM, Dave Latham <la...@davelink.net> wrote:
> I was able to get the log file out of hdfs.  Is there a location I can put
> it back in to have it picked up?
>
Someone needs to finish up the hbase walplayer: HBASE-3619.
St.Ack

Re: corrupt WAL and Java Heap Space...

Posted by Dave Latham <la...@davelink.net>.

We just hit the same issue.  I attached log snippets from the regionserver
and master into https://issues.apache.org/jira/browse/HBASE-4107

I was able to get the log file out of hdfs.  Is there a location I can put
it back in to have it picked up?

Dave

On Fri, Jul 15, 2011 at 12:23 PM, Andy Sautins
<an...@returnpath.net>wrote:

>
>  I don't have the log still.  Not sure what I was thinking deleting it.  I
> was a little too aggressive wanting to get my fsck back to having 0 corrupt
> blocks.
>
>  What you say is interesting.  It's more than possible that I'm
> misunderstanding what is going on.
>
>  What we saw with the log file is that we could cat it, but couldn't copy
> the file ( would complain about a bad checksum ).  I know that's not hard
> data, but going by that what you say about applying the log up until the
> last sync makes would make sense.  What might have thrown me is after a
> re-start the logs ( including the corrupt log ) were still in the .logs
> folder.  We did a full shutdown/restart and the following stacktrace was in
> the master logs. After this stacktrace hbase continued to startup, however
> the logs ( all logs up until the corrupt log ) for the region with the
> corrupt log file were left in the .logs directory.  When we removed the
> corrupt log file and re-started again all the existing logs were removed
> after successful restart as I would expect.
>
>   So is it more likely that the error on shutdown is reasonable and that
> the log cleanup just didn't happen on startup?  I suppose it makes sense not
> to remove them if there is an error, but it did throw me that the corrupt
> file as well as previous files were still in the .logs directory.
>
> 2011-07-14 18:07:45,954 ERROR
> org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://
> hdnn.dfs.returnpath.net:8020/user/hbase/.logs/hd31.dfs.returnpath.net,60020,1309294522164
> org.apache.hadoop.fs.ChecksumException: Checksum error:
> /blk_-8148723766791273697:of:/user/hbase/.logs/hd31.dfs.returnpath.net
> ,60020,1309294522164/hd31.dfs.returnpath.net%3A60020.1310675410770 at
> 57790464
>        at
> org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
>        at
> org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
>        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
>        at
> org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
>        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
>        at
> org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1249)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1899)
>        at
> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1951)
>        at java.io.DataInputStream.read(DataInputStream.java:132)
>        at java.io.DataInputStream.readFully(DataInputStream.java:178)
>        at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
>        at
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
>        at
> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
>        at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
>        at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
>        at
> org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
>        at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
>        at
> org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
>        at
> org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
>        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, July 15, 2011 12:59 PM
> To: user@hbase.apache.org
> Subject: Re: corrupt WAL and Java Heap Space...
>
> I'd have expected the log to be recoverable up to the last time you
> called sync.  What were you seeing?  Do you have the log still?  (It
> should recover to the last edit)
>
> St.Ack
>
> On Fri, Jul 15, 2011 at 11:32 AM, Andy Sautins
> <an...@returnpath.net> wrote:
> >
> >  Thanks.  I filed JIRA HBASE-4107 (
> https://issues.apache.org/jira/browse/HBASE-4107 ).
> >
> >  It does seem like the OOME is causing a write to the WAL to be left in
> an inconsistent state.  I haven't had a chance to look yet, but it would
> seem that the flush isn't atomic, so possibly the data was synced but the
> checksum wasn't able to be updated.  If that logic is right then it would be
> an issue in the sync to hdfs.
> >
> >  In either case it is sad that the log looks like it could get left in an
> unusable state.  That seems like the last thing we'd really want.  Not sure
> about keeping a reservoir of memory around.  It seems you could free just
> about anything to let the write finish and then exit potentially
> ungracefully.  The WAL would need to be recovered, but that's much
> preferable to data loss.
> >
> >  I need to look further but it does feel like the full sync is not atomic
> and failing somewhere before the checksum is fully written out can
> potentially lead to WAL corruption.  That's a guess.  I need to look at it
> further.
> >
> >  Thanks
> >
> >  Andy
> >
> > -----Original Message-----
> > From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of
> Stack
> > Sent: Friday, July 15, 2011 10:41 AM
> > To: user@hbase.apache.org
> > Subject: Re: corrupt WAL and Java Heap Space...
> >
> > Please file an issue.  Sounds like an OOME while writing causes us to
> > exit w/o closing the WAL (You think that the case)?  My guess is that
> > in this low memory situation, a close might fail anyways (with another
> > OOME) unless we did some extra gymnastics releasing the little
> > resevoir of memory we keep around to release so cleanup succeeds
> > whenever we see OOME.
> >
> > St.Ack
> >
> > On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
> > <an...@returnpath.net> wrote:
> >>
> >>   Yesterday we ran into an interesting issue.  We were shutting down our
> HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes
> encountered a Java heap space exception.  The bummer is the log file was
> listed as corrupt from hadoop fsck and was unable to be read when
> re-starting the database.  We were able to recover in our situation by
> removing the corrupt log and did not appear to lose any data.
> >>
> >>    Has anyone else seen this issue?  If I'm reading the situation right
> it looks like that a Java heap space error during the WAL checksum write
> could leave the WAL corrupt which doesn't seem like desired behavior.
> >>
> >>    I'll looking into it further but any thoughts would be appreciated.
> >>
> >>
> >> 2011-07-14 14:54:53,741 FATAL
> org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting
> close of hlog
> >> java.io.IOException: Reflection
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
> >> Caused by: java.lang.reflect.InvocationTargetException
> >>        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
> >>        at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>        at
> org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
> >>        ... 2 more
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
> >>        at
> org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
> >>        at
> org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
> >>        at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
> >>        at
> org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
> >>        at
> org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
> >>        ... 6 more
> >>
> >>
> >
>

RE: corrupt WAL and Java Heap Space...

Posted by Andy Sautins <an...@returnpath.net>.

  I don't have the log still.  Not sure what I was thinking deleting it.  I was a little too aggressive wanting to get my fsck back to having 0 corrupt blocks.

  What you say is interesting.  It's more than possible that I'm misunderstanding what is going on.   

  What we saw with the log file is that we could cat it, but couldn't copy the file ( would complain about a bad checksum ).  I know that's not hard data, but going by that what you say about applying the log up until the last sync makes would make sense.  What might have thrown me is after a re-start the logs ( including the corrupt log ) were still in the .logs folder.  We did a full shutdown/restart and the following stacktrace was in the master logs. After this stacktrace hbase continued to startup, however the logs ( all logs up until the corrupt log ) for the region with the corrupt log file were left in the .logs directory.  When we removed the corrupt log file and re-started again all the existing logs were removed after successful restart as I would expect.  

   So is it more likely that the error on shutdown is reasonable and that the log cleanup just didn't happen on startup?  I suppose it makes sense not to remove them if there is an error, but it did throw me that the corrupt file as well as previous files were still in the .logs directory.

2011-07-14 18:07:45,954 ERROR org.apache.hadoop.hbase.master.MasterFileSystem: Failed splitting hdfs://hdnn.dfs.returnpath.net:8020/user/hbase/.logs/hd31.dfs.returnpath.net,60020,1309294522164
org.apache.hadoop.fs.ChecksumException: Checksum error: /blk_-8148723766791273697:of:/user/hbase/.logs/hd31.dfs.returnpath.net,60020,1309294522164/hd31.dfs.returnpath.net%3A60020.1310675410770 at 57790464
        at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
        at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
        at org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java:176)
        at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:193)
        at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
        at org.apache.hadoop.hdfs.DFSClient$BlockReader.read(DFSClient.java:1249)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.readBuffer(DFSClient.java:1899)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1951)
        at java.io.DataInputStream.read(DataInputStream.java:132)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
        at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1945)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1845)
        at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1891)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:198)
        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.next(SequenceFileLogReader.java:172)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.parseHLog(HLogSplitter.java:429)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:262)
        at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLog(HLogSplitter.java:188)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:197)
        at org.apache.hadoop.hbase.master.MasterFileSystem.splitLogAfterStartup(MasterFileSystem.java:181)
        at org.apache.hadoop.hbase.master.HMaster.finishInitialization(HMaster.java:384)
        at org.apache.hadoop.hbase.master.HMaster.run(HMaster.java:283)

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Friday, July 15, 2011 12:59 PM
To: user@hbase.apache.org
Subject: Re: corrupt WAL and Java Heap Space...

I'd have expected the log to be recoverable up to the last time you
called sync.  What were you seeing?  Do you have the log still?  (It
should recover to the last edit)

St.Ack

On Fri, Jul 15, 2011 at 11:32 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>  Thanks.  I filed JIRA HBASE-4107 ( https://issues.apache.org/jira/browse/HBASE-4107 ).
>
>  It does seem like the OOME is causing a write to the WAL to be left in an inconsistent state.  I haven't had a chance to look yet, but it would seem that the flush isn't atomic, so possibly the data was synced but the checksum wasn't able to be updated.  If that logic is right then it would be an issue in the sync to hdfs.
>
>  In either case it is sad that the log looks like it could get left in an unusable state.  That seems like the last thing we'd really want.  Not sure about keeping a reservoir of memory around.  It seems you could free just about anything to let the write finish and then exit potentially ungracefully.  The WAL would need to be recovered, but that's much preferable to data loss.
>
>  I need to look further but it does feel like the full sync is not atomic and failing somewhere before the checksum is fully written out can potentially lead to WAL corruption.  That's a guess.  I need to look at it further.
>
>  Thanks
>
>  Andy
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, July 15, 2011 10:41 AM
> To: user@hbase.apache.org
> Subject: Re: corrupt WAL and Java Heap Space...
>
> Please file an issue.  Sounds like an OOME while writing causes us to
> exit w/o closing the WAL (You think that the case)?  My guess is that
> in this low memory situation, a close might fail anyways (with another
> OOME) unless we did some extra gymnastics releasing the little
> resevoir of memory we keep around to release so cleanup succeeds
> whenever we see OOME.
>
> St.Ack
>
> On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
> <an...@returnpath.net> wrote:
>>
>>   Yesterday we ran into an interesting issue.  We were shutting down our HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes encountered a Java heap space exception.  The bummer is the log file was listed as corrupt from hadoop fsck and was unable to be read when re-starting the database.  We were able to recover in our situation by removing the corrupt log and did not appear to lose any data.
>>
>>    Has anyone else seen this issue?  If I'm reading the situation right it looks like that a Java heap space error during the WAL checksum write could leave the WAL corrupt which doesn't seem like desired behavior.
>>
>>    I'll looking into it further but any thoughts would be appreciated.
>>
>>
>> 2011-07-14 14:54:53,741 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
>> java.io.IOException: Reflection
>>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
>>        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
>>        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
>> Caused by: java.lang.reflect.InvocationTargetException
>>        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
>>        ... 2 more
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
>>        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
>>        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
>>        at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
>>        at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
>>        ... 6 more
>>
>>
>

Re: corrupt WAL and Java Heap Space...

Posted by Stack <st...@duboce.net>.

I'd have expected the log to be recoverable up to the last time you
called sync.  What were you seeing?  Do you have the log still?  (It
should recover to the last edit)

St.Ack

On Fri, Jul 15, 2011 at 11:32 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>  Thanks.  I filed JIRA HBASE-4107 ( https://issues.apache.org/jira/browse/HBASE-4107 ).
>
>  It does seem like the OOME is causing a write to the WAL to be left in an inconsistent state.  I haven't had a chance to look yet, but it would seem that the flush isn't atomic, so possibly the data was synced but the checksum wasn't able to be updated.  If that logic is right then it would be an issue in the sync to hdfs.
>
>  In either case it is sad that the log looks like it could get left in an unusable state.  That seems like the last thing we'd really want.  Not sure about keeping a reservoir of memory around.  It seems you could free just about anything to let the write finish and then exit potentially ungracefully.  The WAL would need to be recovered, but that's much preferable to data loss.
>
>  I need to look further but it does feel like the full sync is not atomic and failing somewhere before the checksum is fully written out can potentially lead to WAL corruption.  That's a guess.  I need to look at it further.
>
>  Thanks
>
>  Andy
>
> -----Original Message-----
> From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
> Sent: Friday, July 15, 2011 10:41 AM
> To: user@hbase.apache.org
> Subject: Re: corrupt WAL and Java Heap Space...
>
> Please file an issue.  Sounds like an OOME while writing causes us to
> exit w/o closing the WAL (You think that the case)?  My guess is that
> in this low memory situation, a close might fail anyways (with another
> OOME) unless we did some extra gymnastics releasing the little
> resevoir of memory we keep around to release so cleanup succeeds
> whenever we see OOME.
>
> St.Ack
>
> On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
> <an...@returnpath.net> wrote:
>>
>>   Yesterday we ran into an interesting issue.  We were shutting down our HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes encountered a Java heap space exception.  The bummer is the log file was listed as corrupt from hadoop fsck and was unable to be read when re-starting the database.  We were able to recover in our situation by removing the corrupt log and did not appear to lose any data.
>>
>>    Has anyone else seen this issue?  If I'm reading the situation right it looks like that a Java heap space error during the WAL checksum write could leave the WAL corrupt which doesn't seem like desired behavior.
>>
>>    I'll looking into it further but any thoughts would be appreciated.
>>
>>
>> 2011-07-14 14:54:53,741 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
>> java.io.IOException: Reflection
>>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
>>        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
>>        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
>> Caused by: java.lang.reflect.InvocationTargetException
>>        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
>>        ... 2 more
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
>>        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
>>        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
>>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
>>        at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
>>        at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
>>        ... 6 more
>>
>>
>

RE: corrupt WAL and Java Heap Space...

Posted by Andy Sautins <an...@returnpath.net>.

  Thanks.  I filed JIRA HBASE-4107 ( https://issues.apache.org/jira/browse/HBASE-4107 ).

  It does seem like the OOME is causing a write to the WAL to be left in an inconsistent state.  I haven't had a chance to look yet, but it would seem that the flush isn't atomic, so possibly the data was synced but the checksum wasn't able to be updated.  If that logic is right then it would be an issue in the sync to hdfs.

  In either case it is sad that the log looks like it could get left in an unusable state.  That seems like the last thing we'd really want.  Not sure about keeping a reservoir of memory around.  It seems you could free just about anything to let the write finish and then exit potentially ungracefully.  The WAL would need to be recovered, but that's much preferable to data loss.

  I need to look further but it does feel like the full sync is not atomic and failing somewhere before the checksum is fully written out can potentially lead to WAL corruption.  That's a guess.  I need to look at it further.

  Thanks

  Andy 

-----Original Message-----
From: saint.ack@gmail.com [mailto:saint.ack@gmail.com] On Behalf Of Stack
Sent: Friday, July 15, 2011 10:41 AM
To: user@hbase.apache.org
Subject: Re: corrupt WAL and Java Heap Space...

Please file an issue.  Sounds like an OOME while writing causes us to
exit w/o closing the WAL (You think that the case)?  My guess is that
in this low memory situation, a close might fail anyways (with another
OOME) unless we did some extra gymnastics releasing the little
resevoir of memory we keep around to release so cleanup succeeds
whenever we see OOME.

St.Ack

On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>   Yesterday we ran into an interesting issue.  We were shutting down our HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes encountered a Java heap space exception.  The bummer is the log file was listed as corrupt from hadoop fsck and was unable to be read when re-starting the database.  We were able to recover in our situation by removing the corrupt log and did not appear to lose any data.
>
>    Has anyone else seen this issue?  If I'm reading the situation right it looks like that a Java heap space error during the WAL checksum write could leave the WAL corrupt which doesn't seem like desired behavior.
>
>    I'll looking into it further but any thoughts would be appreciated.
>
>
> 2011-07-14 14:54:53,741 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
> java.io.IOException: Reflection
>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
>        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
>        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
>        ... 2 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
>        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
>        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
>        at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
>        at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
>        ... 6 more
>
>

Re: corrupt WAL and Java Heap Space...

Posted by Stack <st...@duboce.net>.

Please file an issue.  Sounds like an OOME while writing causes us to
exit w/o closing the WAL (You think that the case)?  My guess is that
in this low memory situation, a close might fail anyways (with another
OOME) unless we did some extra gymnastics releasing the little
resevoir of memory we keep around to release so cleanup succeeds
whenever we see OOME.

St.Ack

On Fri, Jul 15, 2011 at 9:32 AM, Andy Sautins
<an...@returnpath.net> wrote:
>
>   Yesterday we ran into an interesting issue.  We were shutting down our HBase cluster ( 0.90.1 CDH3u0 ) and in the process one of the nodes encountered a Java heap space exception.  The bummer is the log file was listed as corrupt from hadoop fsck and was unable to be read when re-starting the database.  We were able to recover in our situation by removing the corrupt log and did not appear to lose any data.
>
>    Has anyone else seen this issue?  If I'm reading the situation right it looks like that a Java heap space error during the WAL checksum write could leave the WAL corrupt which doesn't seem like desired behavior.
>
>    I'll looking into it further but any thoughts would be appreciated.
>
>
> 2011-07-14 14:54:53,741 FATAL org.apache.hadoop.hbase.regionserver.wal.HLog: Could not append. Requesting close of hlog
> java.io.IOException: Reflection
>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:147)
>        at org.apache.hadoop.hbase.regionserver.wal.HLog.sync(HLog.java:987)
>        at org.apache.hadoop.hbase.regionserver.wal.HLog$LogSyncer.run(HLog.java:964)
> Caused by: java.lang.reflect.InvocationTargetException
>        at sun.reflect.GeneratedMethodAccessor1336.invoke(Unknown Source)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter.sync(SequenceFileLogWriter.java:145)
>        ... 2 more
> Caused by: java.lang.OutOfMemoryError: Java heap space
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$Packet.<init>(DFSClient.java:2375)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:3271)
>        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:150)
>        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
>        at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.sync(DFSClient.java:3354)
>        at org.apache.hadoop.fs.FSDataOutputStream.sync(FSDataOutputStream.java:97)
>        at org.apache.hadoop.io.SequenceFile$Writer.syncFs(SequenceFile.java:944)
>        ... 6 more
>
>