You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@hbase.apache.org by GitBox <gi...@apache.org> on 2020/09/30 21:28:02 UTC

[GitHub] [hbase] saintstack commented on pull request #2113: HBASE-24286: HMaster won't become healthy after after cloning or crea…

saintstack commented on pull request #2113:
URL: https://github.com/apache/hbase/pull/2113#issuecomment-701656158

On the @joshelser good question of 'I was concerned about making sure all "old" RegionServers are actually down before we reassign regions onto new servers', SCP should probably call expire on the ServerName it is passed. It'd be redundant in most cases. I thought it did this already but it does not. Queuing an SCP adds the server to the 'Dead Servers' list (I think -- check) so if it arrives at any time subsequent, it will be told 'YouAreDead..' and it will shut itself down.

On the @Apache9 question:

> And I prefer we just add a new operation in HBCK2, to scan for unknown servers and schedule SCP for them? Or maybe we already have one in place? @saintstack Can you recall sir? I'm not very familiar with all the operations in HBCK2.

Currently HBCK2 does not have special handling for 'Unknown Servers'. The 'HBCK Report' page that reports 'Unknown Servers' found by a CatalogJanitor run suggests:

`The below are servers mentioned in the hbase:meta table that are no longer 'live' or known 'dead'.
The server likely belongs to an older cluster epoch since replaced by a new instance because of a restart/crash.
To clear 'Unknown Servers', run 'hbck2 scheduleRecoveries UNKNOWN_SERVERNAME'. This will schedule a ServerCrashProcedure.
It will clear out 'Unknown Server' references and schedule reassigns of any Regions that were associated with this host.
But first!, be sure the referenced Region is not currently stuck looping trying to OPEN. Does it show as a Region-In-Transition on the
Master home page? Is it mentioned in the 'Procedures and Locks' Procedures list? If so, perhaps it stuck in a loop
trying to OPEN but unable to because of a missing reference or file.
Read the Master log looking for the most recent
mentions of the associated Region name. Try and address any such complaint first. If successful, a side-effect
should be the clean up of the 'Unknown Servers' list. It may take a while. OPENs are retried forever but the interval
between retries grows. The 'Unknown Server' may be cleared because it is just the last RegionServer the Region was
successfully opened on; on the next open, the 'Unknown Server' will be purged.`

So, the 'fix' for 'Unknown Servers' as exercised by myself recently was to parse the 'HBCK Report' page to make a list of all 'Unknown Servers' and then script a call to 'hbck2 scheduleRecoveries' for each one. We should be able to do better than this -- either add handling of 'Unknown Servers' to the set of issues 'fixed' when we run 'hbck2 fixMeta' or as is done here, scheduling an SCP for any 'Unknown Server' found when CatalogJanitor runs.

On the latter auto-fix, there is understandable reluctance. I think this comes of 'Unknown Servers' being an ill-defined entity-type; the auto-fix can wait on the concept hardening.

I like this comment of @Apache9:

> I do not think there is a guarantee that you change the filesystem layout of HBase internal, and HBase cluster will still be functional. Even if sometimes it could, as you said, on 1.4, it does not mean that we will always keep this behavior in new versions.

But there should be 'safe' means of attaining your ends @taklwu .

Perhaps of help is a little known utility, hbase.master.maintenance_mode config, where you can start the Master in 'maintenance' mode (HBASE-21073): Master comes up, assigns meta but nothing else... it is so you can ask Master to make edits of state/procedures/meta. Perhaps you could script moving cluster to new location, starting Master in new location in maintenance mode, edit meta (a scp that doesn't assign?), then shut it down followed by normal restart.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org