You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2018/06/20 20:19:00 UTC
[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

    [ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16518554#comment-16518554 ] 

stack commented on HBASE-18152:
-------------------------------

Had another incidence here on internal test rig.

Procedure updates seem to have gone in in reversed order.... The 'last update' time and the list of stack ids -- the sub-procedures the root needs to roll back are out of order with the first update having two stack ids where the one that comes after has one (see example below).... A bunch of sub-procedure Assigns of a SCP were written this way.  I'm having difficulty because the update to the Store is under region lock so the updates should be in order.... but timeout doesn't have lock.

The incident that brought this on does not show in the logs because a STUCK spew rolled away the logs. In another issue I tune down the STUCK logging. Hopefully it helps. Would like to figure what the 'event' was. Would help narrow which path brought this about (also my master WAL doesn't show the events that were a problem... They seem to have been cut off... the file was not closed nicely).

{code}
2018-06-14 12:57:16,956 WARN org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormatReader: NOT INCREASING! current=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 864403
proc_id: 916353
submitted_time: 1528730391463
owner: "hbase"
state: RUNNABLE
stack_id: 251
stack_id: 332
last_update: 1528730393598
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: "\b\002\022Y\b\201\327\353\230\271,\0226\n\adefault\022+IntegrationTestBigLinkedList_20180524090611\032\b\274\240D\345\226\ad\020\"\b\274\246\375\243\rc\370\221(\0000\0008\000"
}
, candidate=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 864403
proc_id: 916353
submitted_time: 1528730391463
owner: "hbase"
state: RUNNABLE
stack_id: 251
last_update: 1528730393404
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: "\b\002\022Y\b\201\327\353\230\271,\0226\n\adefault\022+IntegrationTestBigLinkedList_20180524090611\032\b\274\240D\345\226\ad\020\"\b\274\246\375\243\rc\370\221(\0000\0008\000"
}
{code}



> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
>                 Key: HBASE-18152
>                 URL: https://issues.apache.org/jira/browse/HBASE-18152
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.0.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HBASE-18152.master.001.patch, hbase-hbase-master-ctr-e138-1518143905142-221855-01-000002.hwx.site.log.gz, pv2-00000000000000000036.log, pv2-00000000000000000047.log, reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we can get over it but sometimes we can't. It took me a while to capture an instance of corruption. Turns out we are write to the WAL out-of-order which undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we could write out of order. Will try and get more insight. Meantime parking this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)