You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2018/07/28 19:41:00 UTC

[jira] [Commented] (HBASE-19121) HBCK for AMv2 (A.K.A HBCK2)

    [ https://issues.apache.org/jira/browse/HBASE-19121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16560863#comment-16560863 ] 

stack commented on HBASE-19121:
-------------------------------

I think I've asked for this above but here is more detail.

A corrupt Master proc WAL file was responsible for two regions being stuck in OPENING. It looks like this in Master log:

{code}
2018-07-28 12:33:49,724 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=ve0530.halxg.cloudera.com,16020,1532716446468, table=IntegrationTestBigLinkedList, region=8198218a4532a0ee544cb069970f9a77
2018-07-28 12:33:49,724 WARN  [ProcExecTimeout] assignment.AssignmentManager: STUCK Region-In-Transition rit=OPENING, location=ve0530.halxg.cloudera.com,16020,1532716446468, table=IntegrationTestBigLinkedList, region=4459746bcff48c116337e732ac4df705
2018-07-28 12:34:17,532 WARN  [PEWorker-2] assignment.RegionTransitionProcedure: Failed transition, suspend 3600secs pid=14482, ppid=14168, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=IntegrationTestBigLinkedList, region=4459746bcff48c116337e732ac4df705, server=ve0530.halxg.cloudera.com,16020,1532716446468; rit=OPENING, location=ve0530.halxg.cloudera.com,16020,1532716446468; waiting on rectified condition fixed by other Procedure or operator intervention
org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected [SPLITTING, SPLIT, MERGING, OPEN, CLOSING] so could move to CLOSING but current state=OPENING
  at org.apache.hadoop.hbase.master.assignment.RegionStates$RegionStateNode.transitionState(RegionStates.java:164)
  at org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1542)
  at org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:204)
  at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:345)
  at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:95)
  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1474)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
2018-07-28 12:34:17,533 INFO  [PEWorker-2] procedure2.TimeoutExecutorThread: ADDED pid=14482, ppid=14168, state=WAITING_TIMEOUT:REGION_TRANSITION_DISPATCH; UnassignProcedure table=IntegrationTestBigLinkedList, region=4459746bcff48c116337e732ac4df705, server=ve0530.halxg.cloudera.com,16020,1532716446468; timeout=3600000, timestamp=1532810057533
2018-07-28 12:34:19,078 WARN  [PEWorker-12] assignment.RegionTransitionProcedure: Failed transition, suspend 3600secs pid=14373, ppid=14168, state=RUNNABLE:REGION_TRANSITION_DISPATCH; UnassignProcedure table=IntegrationTestBigLinkedList, region=8198218a4532a0ee544cb069970f9a77, server=ve0530.halxg.cloudera.com,16020,1532716446468; rit=OPENING, location=ve0530.halxg.cloudera.com,16020,1532716446468; waiting on rectified condition fixed by other Procedure or operator intervention
org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected [SPLITTING, SPLIT, MERGING, OPEN, CLOSING] so could move to CLOSING but current state=OPENING
  at org.apache.hadoop.hbase.master.assignment.RegionStates$RegionStateNode.transitionState(RegionStates.java:164)
  at org.apache.hadoop.hbase.master.assignment.AssignmentManager.markRegionAsClosing(AssignmentManager.java:1542)
  at org.apache.hadoop.hbase.master.assignment.UnassignProcedure.updateTransition(UnassignProcedure.java:204)
  at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:345)
  at org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure.execute(RegionTransitionProcedure.java:95)
  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:850)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1474)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1249)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$900(ProcedureExecutor.java:76)
  at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1763)
{code}

At least the log is clear no what has to be done. We are seeing the STUCK messages.. and then out comes the prescription on a period.

The Locks&Procedures UI shows that there is an exclusive lock on the two regions making it so no other Procedure can run to do fixup:

{code}
Locks
REGION: 8198218a4532a0ee544cb069970f9a77
Lock type: EXCLUSIVE

Owner procedure: { ID => '14373', PARENT_ID => '14168', STATE => 'WAITING_TIMEOUT', OWNER => 'stack', TYPE => 'UnassignProcedure table=IntegrationTestBigLinkedList, region=8198218a4532a0ee544cb069970f9a77, server=ve0530.halxg.cloudera.com,16020,1532716446468', START_TIME => 'Fri Jul 27 14:25:57 PDT 2018', LAST_UPDATE => 'Sat Jul 28 12:34:19 PDT 2018', PARAMETERS => [ { transitionState => 'REGION_TRANSITION_DISPATCH', regionInfo => { regionId => '1532507299562', tableName => { namespace => 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey => 'WHJsjHk=', endKey => 'WOReaw==', offline => 'false', split => 'false', replicaId => '0' }, hostingServer => { hostName => 've0530.halxg.cloudera.com', port => '16020', startCode => '1532716446468' }, attempt => '34' } ] }

REGION: 4459746bcff48c116337e732ac4df705
Lock type: EXCLUSIVE

Owner procedure: { ID => '14482', PARENT_ID => '14168', STATE => 'WAITING_TIMEOUT', OWNER => 'stack', TYPE => 'UnassignProcedure table=IntegrationTestBigLinkedList, region=4459746bcff48c116337e732ac4df705, server=ve0530.halxg.cloudera.com,16020,1532716446468', START_TIME => 'Fri Jul 27 14:25:57 PDT 2018', LAST_UPDATE => 'Sat Jul 28 12:34:17 PDT 2018', PARAMETERS => [ { transitionState => 'REGION_TRANSITION_DISPATCH', regionInfo => { regionId => '1532510055516', tableName => { namespace => 'ZGVmYXVsdA==', qualifier => 'SW50ZWdyYXRpb25UZXN0QmlnTGlua2VkTGlzdA==' }, startKey => 'h43UQpjowgpeGc6LsEkspQ==', endKey => 'h/+PcQ==', offline => 'false', split => 'false', replicaId => '0' }, hostingServer => { hostName => 've0530.halxg.cloudera.com', port => '16020', startCode => '1532716446468' }, attempt => '34' } ] }
{code}

So we need some way of cleaning up the lock and then assigning.

> HBCK for AMv2 (A.K.A HBCK2)
> ---------------------------
>
>                 Key: HBASE-19121
>                 URL: https://issues.apache.org/jira/browse/HBASE-19121
>             Project: HBase
>          Issue Type: Bug
>          Components: hbck
>            Reporter: stack
>            Assignee: Umesh Agashe
>            Priority: Major
>         Attachments: hbase-19121.master.001.patch
>
>
> We don't have an hbck for the new AM. Old hbck may actually do damage going against AMv2.
> Fix.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)