You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/01 00:47:00 UTC

[jira] [Commented] (HBASE-21601) corrupted WAL is not handled in all places (NegativeArraySizeException)

    [ https://issues.apache.org/jira/browse/HBASE-21601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757835#comment-16757835 ] 

Sergey Shelukhin commented on HBASE-21601:
------------------------------------------

So I was able to repro and get a WAL, and then repro with it.
This actually looks like it's also a write-side problem, cause I'd assume no network issue would be able to produce a result like this.
WAL record structure for the corrupted record is intact.
The WALKey, as well as the count and the length of every single of the 476 cells, of the corrupt record matches exactly the same of another, earlier record in the file, that has valid data. 
So the structure of the record is well-formed and that is why none of the IO/EOF exceptions happen.
However every one of the cells of the corrupted record itself has garbage data that seems to be pulled randomly from somewhere else, possibly elsewhere in the file. 

I suspect some buffers are being reused on retry, however I see no errors for this file in the logs of the RS that was writing it. RS did quit unexpectedly

> corrupted WAL is not handled in all places (NegativeArraySizeException)
> -----------------------------------------------------------------------
>
>                 Key: HBASE-21601
>                 URL: https://issues.apache.org/jira/browse/HBASE-21601
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>
> {noformat}
> 2018-12-13 17:01:12,208 ERROR [RS_LOG_REPLAY_OPS-regionserver/...] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
> java.lang.RuntimeException: java.lang.NegativeArraySizeException
> 	at org.apache.hadoop.hbase.wal.WALSplitter$PipelineController.checkForErrors(WALSplitter.java:846)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$OutputSink.finishWriting(WALSplitter.java:1203)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(WALSplitter.java:1267)
> 	at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:349)
> 	at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:196)
> 	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.splitLog(SplitLogWorker.java:178)
> 	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.lambda$new$0(SplitLogWorker.java:90)
> 	at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
> 	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NegativeArraySizeException
> 	at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:113)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.filterCellByStore(WALSplitter.java:1542)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1586)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1560)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1085)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1077)
> 	at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1047)
> {noformat}
> Unfortunately I cannot share the file.
> The issue appears to be straightforward - for whatever reason the family length is negative. Not sure how such a cell got created, I suspect the file was corrupted.
> {code}
> byte[] output = new byte[cell.getFamilyLength()];
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)