You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2023/11/14 17:20:00 UTC
[jira] [Created] (HBASE-28203) Graceful shutdown of active hmaster
Bryan Beaudreault created HBASE-28203:
-----------------------------------------
Summary: Graceful shutdown of active hmaster
Key: HBASE-28203
URL: https://issues.apache.org/jira/browse/HBASE-28203
Project: HBase
Issue Type: Improvement
Reporter: Bryan Beaudreault
We recently had an operational incident due to bad interplay between an ongoing maintenance of regionservers, and a new maintenance of hmasters. We'd been running a rolling restart of regionservers. Someone unknowingly started a rolling restart of hmasters. This caused long RITs on a few clusters, what happened was:
* regionserver restart executed a region move
* hmaster saw it and started TSRP, sending request to RS to close region
* immediately after, the hmaster stopped due to that rolling restart
* the regionserver saw the close request, closed the region, and tried to report state back to the hmaster.
* This spammed tons of failures for the next 30+ secs while the new hmaster became active.
* Finally the new hmaster started up and recovered the state and the RIT finished.
Long RITs are really painful and effectively downtime for hbase. I think we should have a shutdown hook on the hmaster which:
* Sets state so new move requests are rejected
* Waits for any existing move TSRP to finish
* Shuts down
--
This message was sent by Atlassian Jira
(v8.20.10#820010)