You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Bryan Beaudreault (Jira)" <ji...@apache.org> on 2023/11/14 17:20:00 UTC

[jira] [Created] (HBASE-28203) Graceful shutdown of active hmaster

Bryan Beaudreault created HBASE-28203:
-----------------------------------------

             Summary: Graceful shutdown of active hmaster
                 Key: HBASE-28203
                 URL: https://issues.apache.org/jira/browse/HBASE-28203
             Project: HBase
          Issue Type: Improvement
            Reporter: Bryan Beaudreault


We recently had an operational incident due to bad interplay between an ongoing maintenance of regionservers, and a new maintenance of hmasters. We'd been running a rolling restart of regionservers. Someone unknowingly started a rolling restart of hmasters. This caused long RITs on a few clusters, what happened was:
 * regionserver restart executed a region move
 * hmaster saw it and started TSRP, sending request to RS to close region
 * immediately after, the hmaster stopped due to that rolling restart
 * the regionserver saw the close request, closed the region, and tried to report state back to the hmaster.
 * This spammed tons of failures for the next 30+ secs while the new hmaster became active.
 * Finally the new hmaster started up and recovered the state and the RIT finished.

Long RITs are really painful and effectively downtime for hbase. I think we should have a shutdown hook on the hmaster which:
 * Sets state so new move requests are rejected
 * Waits for any existing move TSRP to finish
 * Shuts down



--
This message was sent by Atlassian Jira
(v8.20.10#820010)