You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/02/01 19:02:00 UTC

[jira] [Comment Edited] (HBASE-21817) skip records with corrupted cells in WAL splitting

    [ https://issues.apache.org/jira/browse/HBASE-21817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16758582#comment-16758582 ] 

Sergey Shelukhin edited comment on HBASE-21817 at 2/1/19 7:01 PM:
------------------------------------------------------------------

Currently failure to split log in this case results in RSes crashing with various errors, generally about array offsets, and regions being offline.
I think with skipErrors it's ok to just skip the record like this patch does, I can add the check.
I wonder if a better way would be to write the corrupted records to a separate WAL, and only keep the regions that have corrupted record offline, not all the regions in the WAL. That would be a bigger change though to handle it gracefully on RS as well as master. Then the admin can keep or delete the file when they notice the region is offline.
I can remove the main method; it's not intended to recovery, just for debugging so we don't really need it. 


was (Author: sershe):
Currently failure to split log in this case results in RSes crashing with various errors, generally about array offsets, and region being offline.
I think with skipErrors it's ok to just skip the record like this patch does, I can add the check.
I wonder if a better way would be to write the corrupted records to a separate WAL, and only keep the regions that have corrupted record offline, not all the regions in the WAL. That would be a bigger change though to handle it gracefully on RS as well as master. Then the admin can keep or delete the file when they notice the region is offline.
I can remove the main method; it's not intended to recovery, just for debugging so we don't really need it. 

> skip records with corrupted cells in WAL splitting
> --------------------------------------------------
>
>                 Key: HBASE-21817
>                 URL: https://issues.apache.org/jira/browse/HBASE-21817
>             Project: HBase
>          Issue Type: Bug
>          Components: wal
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: HBASE-21817.patch
>
>
> See HBASE-21601 for context.
> I looked at the code a bit but it will take a while to understand, so for now I'm going to mitigate it by skipping such records. Given that this record is bogus, and the lengths are intact, for this scenario it's safe to do so. However, it's possible I guess to have a bug where skipping such record would lead to data loss. Regardless, failure to split the WAL will lead to even more data loss in this case so it should be ok to handle errors where the structure is correct but cells are corrupted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)