You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bahram Chehrazy (JIRA)" <ji...@apache.org> on 2019/03/19 01:58:00 UTC

[jira] [Updated] (HBASE-22060) postOpenDeployTasks could send OPENED region transition state to the wrong master

     [ https://issues.apache.org/jira/browse/HBASE-22060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bahram Chehrazy updated HBASE-22060:
------------------------------------
    Description: 
As was reported in HBASE-21788, we have repeatedly seen regions getting stuck in OPENING after master restarts. Here is one scenario that I've observed recently:

 

1) There is a region is transit (RIT).

2) The active master aborts and begins shutting down.

3) The backup master becomes active quickly, finds the RIT, creates OpenRegionProcedure and send request to some server.

4) The server quickly opens the region and posts OPENED state transition, but it uses its cached master instead of the new one. 

5) The old active master which had not completely shutdown its assignment manager yet, notes the OPENED state report and ignores it. Because no corresponding procedure can be found.

6) The new master waits forever for a response to its OPEN region request.

 

This happens more often with the meta region because it's small and takes a few seconds to open. Below are some related logs:




*Previous HMaster:*
2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING master <master-1>,17000,1552438242232: Shutting down HBase cluster: file system not available *****
2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: ***** STOPPING region server '<master-1>,17000,1552438242232' *****
2019-03-14 13:20:54,358 WARN [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000] assignment.AssignmentManager: No matching procedure found for rit=OPEN, location=*************,17020,1552561955412, table=hbase:meta, region=1588230740 transition to OPENED
2019-03-14 13:20:55,707 INFO [master/<master-1>:17000] assignment.AssignmentManager: Stopping assignment manager


*New HMaster logs:*

2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Deleting ZNode for /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871 from backup master directory
2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Registered as active master=<master-2>,17000,1552438259871
2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as <server-1>,17020,1552536956826
2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor: Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]

 

*HServer logs:*

2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegion: Opened 1588230740; next sequenceid=229166
2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegionServer: Post open deploy tasks for hbase:meta,,1.1588230740
2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740

 

 

  was:
Not much for this one yet.
I repeatedly see the cases when the region is stuck in OPENING, and after master restart RIT is recovered, and stays WAITING; its OpenRegionProcedure (also recovered) is stuck in Runnable and never does anything for hours. I cannot find logs on the target server indicating that it ever tried to do anything after master restart.

This procedure needs at the very least logging of what it's trying to do, and maybe a timeout so it unconditionally fails after a configurable period (1 hour?).
I may also investigate why it doesn't do anything and file a separate bug. I wonder if it's somehow related to the region status check, but this is just a hunch.


> postOpenDeployTasks could send OPENED region transition state to the wrong master
> ---------------------------------------------------------------------------------
>
>                 Key: HBASE-22060
>                 URL: https://issues.apache.org/jira/browse/HBASE-22060
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Bahram Chehrazy
>            Assignee: stack
>            Priority: Critical
>
> As was reported in HBASE-21788, we have repeatedly seen regions getting stuck in OPENING after master restarts. Here is one scenario that I've observed recently:
>  
> 1) There is a region is transit (RIT).
> 2) The active master aborts and begins shutting down.
> 3) The backup master becomes active quickly, finds the RIT, creates OpenRegionProcedure and send request to some server.
> 4) The server quickly opens the region and posts OPENED state transition, but it uses its cached master instead of the new one. 
> 5) The old active master which had not completely shutdown its assignment manager yet, notes the OPENED state report and ignores it. Because no corresponding procedure can be found.
> 6) The new master waits forever for a response to its OPEN region request.
>  
> This happens more often with the meta region because it's small and takes a few seconds to open. Below are some related logs:
> *Previous HMaster:*
> 2019-03-14 13:19:16,310 ERROR [PEWorker-1] master.HMaster: ***** ABORTING master <master-1>,17000,1552438242232: Shutting down HBase cluster: file system not available *****
> 2019-03-14 13:19:16,310 INFO [PEWorker-1] regionserver.HRegionServer: ***** STOPPING region server '<master-1>,17000,1552438242232' *****
> 2019-03-14 13:20:54,358 WARN [RpcServer.priority.FPBQ.Fifo.handler=11,queue=1,port=17000] assignment.AssignmentManager: No matching procedure found for rit=OPEN, location=*************,17020,1552561955412, table=hbase:meta, region=1588230740 transition to OPENED
> 2019-03-14 13:20:55,707 INFO [master/<master-1>:17000] assignment.AssignmentManager: Stopping assignment manager
> *New HMaster logs:*
> 2019-03-14 13:19:16,907 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Deleting ZNode for /HBaseServerZnodeCommonDir/**************/backup-masters/<master-2>,17000,1552438259871 from backup master directory
> 2019-03-14 13:19:17,031 INFO [master/<master-2>:17000:becomeActiveMaster] master.ActiveMasterManager: Registered as active master=<master-2>,17000,1552438259871
> 2019-03-14 13:20:52,017 INFO [PEWorker-12] zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in ZooKeeper as <server-1>,17020,1552536956826
> 2019-03-14 13:20:52,105 INFO [PEWorker-12] procedure2.ProcedureExecutor: Initialized subprocedures=[\{pid=178230, ppid=178229, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
>  
> *HServer logs:*
> 2019-03-14 13:20:52,708 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Open hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,353 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegion: Opened 1588230740; next sequenceid=229166
> 2019-03-14 13:20:54,356 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] regionserver.HRegionServer: Post open deploy tasks for hbase:meta,,1.1588230740
> 2019-03-14 13:20:54,358 INFO [RS_CLOSE_META-regionserver/<server-1>:17020-0] handler.AssignRegionHandler: Opened hbase:meta,,1.1588230740
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)