You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/01/28 18:52:00 UTC

[jira] [Created] (HBASE-21797) more resilient master startup for bad cluster state

Sergey Shelukhin created HBASE-21797:
----------------------------------------

Summary: more resilient master startup for bad cluster state
Key: HBASE-21797
URL: https://issues.apache.org/jira/browse/HBASE-21797
Project: HBase
Issue Type: Bug
Reporter: Sergey Shelukhin

See HBASE-21743 for broader context.
During failure, master upon restart should already be able to handle having failed to persist the state of some procedures (because by definition cluster is much more likely to be in a bad state if master restarted due to some issue), so it should also be able to abandon old recovery procedures (SCP & RIT and their children) as if they were not saved, and create new ones during startup.

This should be off by default.

The idea is (some steps can be done in parallel as they are now, e.g. loading server list and meta):
1) During proc WAL recovery do not recover SCP and open/close related procs.
2) Load server list as usual (dead and alive).
3) Recover meta vi either a a new SCP (or perhaps just a separate meta recovery proc without extra SCP steps, and leave the SCP for step 5), if it's on a dead server.
4) Load region list as usual.
5) Create SCPs for dead servers.
6) Reassign any regions on non-existent servers (we've seen some issues with this after SCP finishes but there are lots of HDFS errors and/or manual intervention, so master "forgets" the server ever existed and the region stays "open" there forever).
7) ? Look for other simple inconsistencies that don't require HBCK-level changes.

--
This message was sent by Atlassian JIRA
(v7.6.3#76005)