You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/01 02:17:00 UTC

[jira] [Updated] (HBASE-21817) skip records with corrupted cells in WAL splitting

     [ https://issues.apache.org/jira/browse/HBASE-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Shelukhin updated HBASE-21817:
-------------------------------------
    Description: 
See HBASE-21601 for context.
I looked at the code a bit but it will take a while to understand, so for now I'm going to mitigate it by skipping such records. Given that this record is bogus, and the lengths are intact, for this scenario it's safe to do so. However, it's possible I guess to have a bug where skipping such record would lead to data loss. Regardless, failure to split the WAL will lead to even more data loss in this case so it should be ok to handle errors where the structure is correct but cells are corrupted.

  was:
{noformat}
2018-12-13 17:01:12,208 ERROR [RS_LOG_REPLAY_OPS-regionserver/...] executor.EventHandler: Caught throwable while processing event RS_LOG_REPLAY
java.lang.RuntimeException: java.lang.NegativeArraySizeException
	at org.apache.hadoop.hbase.wal.WALSplitter$PipelineController.checkForErrors(WALSplitter.java:846)
	at org.apache.hadoop.hbase.wal.WALSplitter$OutputSink.finishWriting(WALSplitter.java:1203)
	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.finishWritingAndClose(WALSplitter.java:1267)
	at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:349)
	at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:196)
	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.splitLog(SplitLogWorker.java:178)
	at org.apache.hadoop.hbase.regionserver.SplitLogWorker.lambda$new$0(SplitLogWorker.java:90)
	at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:70)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NegativeArraySizeException
	at org.apache.hadoop.hbase.CellUtil.cloneFamily(CellUtil.java:113)
	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.filterCellByStore(WALSplitter.java:1542)
	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.appendBuffer(WALSplitter.java:1586)
	at org.apache.hadoop.hbase.wal.WALSplitter$LogRecoveredEditsOutputSink.append(WALSplitter.java:1560)
	at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.writeBuffer(WALSplitter.java:1085)
	at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.doRun(WALSplitter.java:1077)
	at org.apache.hadoop.hbase.wal.WALSplitter$WriterThread.run(WALSplitter.java:1047)
{noformat}

Unfortunately I cannot share the file.
The issue appears to be straightforward - for whatever reason the family length is negative. Not sure how such a cell got created, I suspect the file was corrupted.
{code}
byte[] output = new byte[cell.getFamilyLength()];
{code}


> skip records with corrupted cells in WAL splitting
> --------------------------------------------------
>
>                 Key: HBASE-21817
>                 URL: https://issues.apache.org/jira/browse/HBASE-21817
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Critical
>
> See HBASE-21601 for context.
> I looked at the code a bit but it will take a while to understand, so for now I'm going to mitigate it by skipping such records. Given that this record is bogus, and the lengths are intact, for this scenario it's safe to do so. However, it's possible I guess to have a bug where skipping such record would lead to data loss. Regardless, failure to split the WAL will lead to even more data loss in this case so it should be ok to handle errors where the structure is correct but cells are corrupted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)