You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Sergey Shelukhin (JIRA)" <ji...@apache.org> on 2019/01/23 02:45:00 UTC

[jira] [Comment Edited] (HBASE-21743) stateless assignment

    [ https://issues.apache.org/jira/browse/HBASE-21743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749406#comment-16749406 ] 

Sergey Shelukhin edited comment on HBASE-21743 at 1/23/19 2:44 AM:
-------------------------------------------------------------------

We've been running a master snapshot. Indeed, we found that sometimes procv2 deletion can lead to additional issues, however sometimes it's also the only way forward. 
[~stack] there's no way to find out about old dead servers on restart other than WAL directories (or inferring from stale region assignments stored in meta), because the servers are not stored anywhere else (and ZK node is gone for a dead server, as intended).

The basic idea is to look at list of regions (meta), look at live and dead servers - both of which master already does - and schedule procedures from scratch as required, instead of relying on procedure WAL for RIT/SCP/etc  (optionally i.e. with a config) . 
Personally (as we've discussed years ago ;)) I would prefer to have something like actor model where a central fast actor does this in a loop and fires off idempotent slow actions asynchronously, but within the current paradigm I think  reducing state would provide some benefit. 
Right now for every bug I file (and all those I don't file that result from subtly-incorrect/too-aggressive manual interventions needed to address other bugs)  if master was looking at cluster state it would be trivial to resolve, but because of the split brain problem every part of the system is waiting for some other part with incorrect assumptions; so, the whole thing is very fragile w.r.t. both bugs, and also manual intervention that as we know are often necessary despite best intentions (hence hbck/offlinerepair/etc).

For example the above bug with incorrect SCP for meta server resulted because master init is waiting for SCP to fix meta, but SCP doesn't know it needs to fix meta because of some bug. OFC if persistent SCP didn't exist it wouldn't have the bug in the first place, but abstractly if one actor was looking at this he'd just see meta assigned to a dead server, and recover it just like that. No state needed other than where meta is and the list of servers.

Then, to resolve this we had to nuke the proc WAL to get rid of the bad SCP. Some more SCPs for some servers got lost in the nuke, and we had some regions CLOSING on dead servers that have neither SCP nor WAL directory. Again, looking from a unified perspective we can see - woops, region closing on the server, server has no WALs to split - just count it as closed. Whereas now close region procedure is not responsible for this, it just waits for SCP to deal with the server. But there's no SCP because there's no WAL directory. So, nobody looks at these two together... so after this manual intervention (or for example imagine there was an HDFS issue, and the WAL write did not succeed) cluster is broken and I have to go and fix those regions.

Now I go to meta and set regions to CLOSED (pretend I'm actually hbck2). If assignment was stateless, master would see closed regions and assign them. Whereas now confirm-close retry loop is well-isolated so it doesn't care about anything in the world and just blindly resets them back to CLOSING, so I have to additionally kill -9 the master to make sure that stupid RITs go away and on restart master actually recovers the region.

Luckily when recovered RIT procedures in this case see CLOSED region with empty server, they just silently go away (which might technically be a bug but it works for me ;)); I've seen other cases where some procedure sees region in an unexpected state (due to a race condition) it either fails master (as with meta replicas) or updates it to some other state, resulting in a strange state.

This is just one example. And on all 3.5 steps the persistent procedure is 100% unnecessary, because master has all the information to make correct decisions. As long as it's done in a sane way like with hybrid actor model without its own persistent state...




was (Author: sershe):
We've been running a master snapshot. Indeed, we found that sometimes procv2 deletion can lead to additional issues, however sometimes it's also the only way forward. 
[~stack] there's no way to find out about old dead servers on restart other than WAL directories (or inferring from stale region assignments stored in meta), because the servers are not stored anywhere else (and ZK node is gone for a dead server, as intended).

The basic idea is to look at list of regions (meta), look at live and dead servers - both of which master already does - and schedule procedures from scratch as required, instead of relying on procedure WAL. 
Personally (as we've discussed years ago ;)) I would prefer to have something like actor model where a central fast actor does this in a loop and fires off idempotent slow actions asynchronously, but within the current paradigm I think  reducing state (optionally i.e. with a config) would provide some benefit. 
Right now for every bug I file (and all those I don't file that result from subtly-incorrect/too-aggressive manual interventions needed to address other bugs)  if master was looking at cluster state it would be trivial to resolve, but because of the split brain problem every part of the system is waiting for some other part with incorrect assumptions; so, the whole thing is very fragile w.r.t. both bugs, and also manual intervention that as we know are often necessary despite best intentions (hence hbck/offlinerepair/etc).

For example the above bug with incorrect SCP for meta server resulted because master init is waiting for SCP to fix meta, but SCP doesn't know it needs to fix meta because of some bug. OFC if persistent SCP didn't exist it wouldn't have the bug in the first place, but abstractly if one actor was looking at this he'd just see meta assigned to a dead server, and recover it just like that. No state needed other than where meta is and the list of servers.

Then, to resolve this we had to nuke the proc WAL to get rid of the bad SCP. Some more SCPs for some servers got lost in the nuke, and we had some regions CLOSING on dead servers that have neither SCP nor WAL directory. Again, looking from a unified perspective we can see - woops, region closing on the server, server has no WALs to split - just count it as closed. Whereas now close region procedure is not responsible for this, it just waits for SCP to deal with the server. But there's no SCP because there's no WAL directory. So, nobody looks at these two together... so after this manual intervention (or for example imagine there was an HDFS issue, and the WAL write did not succeed) cluster is broken and I have to go and fix those regions.

Now I go to meta and set regions to CLOSED (pretend I'm actually hbck2). If assignment was stateless, master would see closed regions and assign them. Whereas now confirm-close retry loop is well-isolated so it doesn't care about anything in the world and just blindly resets them back to CLOSING, so I have to additionally kill -9 the master to make sure that stupid RITs go away and on restart master actually recovers the region.

Luckily when recovered RIT procedures in this case see CLOSED region with empty server, they just silently go away (which might technically be a bug but it works for me ;)); I've seen other cases where some procedure sees region in an unexpected state (due to a race condition) it either fails master (as with meta replicas) or updates it to some other state, resulting in a strange state.

This is just one example. And on all 3.5 steps the persistent procedure is 100% unnecessary, because master has all the information to make correct decisions. As long as it's done in a sane way like with hybrid actor model without its own persistent state...



> stateless assignment
> --------------------
>
>                 Key: HBASE-21743
>                 URL: https://issues.apache.org/jira/browse/HBASE-21743
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Priority: Major
>
> Running HBase for only a few weeks we found dozen(s?) of bugs with assignment that all seem to have the same nature - split brain between 2 procedures; or between procedure and master startup (meta replica bugs); or procedure and master shutdown (HBASE-21742); or procedure and something else (when SCP had incorrect region list persisted, don't recall the bug#). 
> To me, it starts to look like a pattern where, like in AMv1 where concurrent interactions were unclear and hard to reason about, despite the cleaner individual pieces in AMv2 the problem of unclear concurrent interactions has been preserved and in fact increased because of the operation state persistence and  isolation.
> Procedures are great for multi-step operations that need rollback and stuff like that, e.g. creating a table or snapshot, or even region splitting. However I'm not so sure about assignment. 
> We have the persisted information - region state in meta (incl transition states like opening, or closing), server list as WAL directory list. Procedure state is not any more reliable then those (we can argue that meta update can fail, but so can procv2 WAL flush, so we have to handle cases of out of date information regardless). So, we don't need any extra state to decide on assignment, whether for recovery and balancing. In fact, as mentioned in some bugs, deleting procv2 WAL is often the best way to recover the cluster, because master can already figure out what to do without additional state.
> I think there should be an option for stateless assignment that does that.
> It can either be as a separate pluggable assignment procedure; or an option that will not recover SCP, RITs etc from WAL but always derive recovery procedures from the existing cluster state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)