You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Nick Dimiduk (Jira)" <ji...@apache.org> on 2020/06/09 19:09:00 UTC
[jira] [Updated] (HBASE-24526) Deadlock executing assign meta procedure

     [ https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nick Dimiduk updated HBASE-24526:
---------------------------------
    Description: 
I have what appears to be a deadlock while assigning meta. During recovery, master creates the assign procedure for meta, and immediately marks meta as assigned in zookeeper. It then creates the subprocedure to open meta on the target region. However, the PEWorker pool is full of procedures that are stuck, I think because their calls to update meta are going nowhere. For what it's worth, the balancer is running concurrently, and has calculated a plan size of 41.

From the master log,

{noformat}
2020-06-06 00:34:07,314 INFO org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: Starting pid=17802, ppid=17801, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; state=OPEN, location=null; forceNewPlan=true, retain=false
2020-06-06 00:34:07,465 INFO org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as hbasedn139.example.com,16020,1591403576247
2020-06-06 00:34:07,466 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
{noformat}

{{pid=17803}} is not mentioned again. hbasedn139 never receives an {{openRegion}} RPC.

Meanwhile, additional procedures are scheduled and picked up by workers, each getting "stuck". I see log lines for all 16 PEWorker threads, saying that they are stuck.

{noformat}
2020-06-06 00:34:07,961 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestBigLinkedList, region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
2020-06-06 00:34:07,961 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
...
2020-06-06 00:34:22,295 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck PEWorker-16(pid=17804), run time 14.3340 sec
...
2020-06-06 00:34:27,295 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck PEWorker-16(pid=17804), run time 19.3340 sec
...
{noformat}

The cluster stays in this state, with PEWorker thread stuck for upwards of 15 minutes. Eventually master starts logging

{noformat}
2020-06-06 00:50:18,033 INFO org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue is full on hbasedn139.example.com,16020,1591403576247, too many items queued ?, details=row 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' on table 'hbase:meta' at region=hbase:meta,,1.
1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, see https://s.apache.org/timeout
{noformat}

The master never recovers on its own.

I'm not sure how common this condition might be. This popped after about 20 total hours of running ITBLL with ServerKillingMonkey.

  was:
I have what appears to be a deadlock while assigning meta. During recovery, master creates the assign procedure for meta, and immediately marks meta as assigned in zookeeper. It then creates the subprocedure to open meta on the target region. However, the PEWorker pool is full of procedures that are stuck, I think because their calls to update meta are going nowhere. For what it's worth, the balancer is running concurrently, and has calculated a plan size of 41.

From the master log,

{noformat}
2020-06-06 00:34:07,314 INFO org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: Starting pid=17802, ppid=17801, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; state=OPEN, location=null; forceNewPlan=true, retain=false
2020-06-06 00:34:07,465 INFO org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as hbasedn139.example.com,16020,1591403576247
2020-06-06 00:34:07,466 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
{noformat}

{{pid=17803 }} is not mentioned again. hbasedn139 never receives an {{openRegion}} RPC.

Meanwhile, additional procedures are scheduled and picked up by workers, each getting "stuck". I see log lines for all 16 PEWorker threads, saying that they are stuck.

{noformat}
2020-06-06 00:34:07,961 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestBigLinkedList, region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
2020-06-06 00:34:07,961 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
...
2020-06-06 00:34:22,295 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck PEWorker-16(pid=17804), run time 14.3340 sec
...
2020-06-06 00:34:27,295 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck PEWorker-16(pid=17804), run time 19.3340 sec
...
{noformat}

The cluster stays in this state, with PEWorker thread stuck for upwards of 15 minutes. Eventually master starts logging

{noformat}
2020-06-06 00:50:18,033 INFO org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue is full on hbasedn139.example.com,16020,1591403576247, too many items queued ?, details=row 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' on table 'hbase:meta' at region=hbase:meta,,1.
1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, see https://s.apache.org/timeout
{noformat}

The master never recovers on its own.

I'm not sure how common this condition might be. This popped after about 20 total hours of running ITBLL with ServerKillingMonkey.


> Deadlock executing assign meta procedure
> ----------------------------------------
>
>                 Key: HBASE-24526
>                 URL: https://issues.apache.org/jira/browse/HBASE-24526
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> I have what appears to be a deadlock while assigning meta. During recovery, master creates the assign procedure for meta, and immediately marks meta as assigned in zookeeper. It then creates the subprocedure to open meta on the target region. However, the PEWorker pool is full of procedures that are stuck, I think because their calls to update meta are going nowhere. For what it's worth, the balancer is running concurrently, and has calculated a plan size of 41.
> From the master log,
> {noformat}
> 2020-06-06 00:34:07,314 INFO org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: Starting pid=17802, ppid=17801, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; state=OPEN, location=null; forceNewPlan=true, retain=false
> 2020-06-06 00:34:07,465 INFO org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as hbasedn139.example.com,16020,1591403576247
> 2020-06-06 00:34:07,466 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> {{pid=17803}} is not mentioned again. hbasedn139 never receives an {{openRegion}} RPC.
> Meanwhile, additional procedures are scheduled and picked up by workers, each getting "stuck". I see log lines for all 16 PEWorker threads, saying that they are stuck.
> {noformat}
> 2020-06-06 00:34:07,961 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; TransitRegionStateProcedure table=IntegrationTestBigLinkedList, region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
> 2020-06-06 00:34:07,961 INFO org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
> ...
> 2020-06-06 00:34:22,295 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck PEWorker-16(pid=17804), run time 14.3340 sec
> ...
> 2020-06-06 00:34:27,295 WARN org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck PEWorker-16(pid=17804), run time 19.3340 sec
> ...
> {noformat}
> The cluster stays in this state, with PEWorker thread stuck for upwards of 15 minutes. Eventually master starts logging
> {noformat}
> 2020-06-06 00:50:18,033 INFO org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue is full on hbasedn139.example.com,16020,1591403576247, too many items queued ?, details=row 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.' on table 'hbase:meta' at region=hbase:meta,,1.
> 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, see https://s.apache.org/timeout
> {noformat}
> The master never recovers on its own.
> I'm not sure how common this condition might be. This popped after about 20 total hours of running ITBLL with ServerKillingMonkey.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)