You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Jonathan Hsieh <jo...@cloudera.com> on 2013/10/19 16:12:37 UTC

[hbase-5487] Master design discussion.

We're moving a discussion started in HBASE-5487 (
https://issues.apache.org/jira/browse/HBASE-5487) based on stack's
suggestion.

Master5 refers to the design doc sergey posted, and hbck-master refers to
the design doc that I posted.  (I'd suggest refer to them by the doc name
as opposed to whomever wrote it).

Here's the questions I posed to sergey:

---

I have a lot of questions. I'll hit the big questions first. Also would i
be possible to put a version of this up as gdoc so we can point out nits
and places that need minor clarification? (I have a marked up physical copy
version of the doc, would be easier to provide feedback).

Main Concerns:

What is a failure and how do you react to failures? I think the master5
design needs to spend more effort to considering failure and recovery
cases. I claim there are 4 types of responses from a networked IO operation
- two states we normally deal with ack successful, ack failed (nack) and
unknown due to timeout that succeeded (timeout success) and unknown due to
timeout that failed (timeout failed). We have historically missed the last
two timeout cases or assumed timeout means failure nack. It seems that
master5 makes the same assumptions.

I'm very concerned about what we need to do to invalidate information
cached RS information at clients in the case of hang, and that will violate
the isolation guarantees that we claim to provide. I really want a slice
in-depth failure handling case analysis including a client with cached rs
assignments for move and something more complicated such as split or alter.

I really want more invariant specified for the FSM states. e.g. if a region
is in state X, does it have a row in meta? does have data on the FS? is it
open on another region? is it open on only one region? I think having 8
pages of tables at the back of the master5 doc can be more concise and
precise which will help us get attempt to prove correctness.

Clarification questions:

1) State update coordination. What is a "state updates from the outside" Do
RS's initiate splitting on their own? Maybe a picture would help so we can
figure out if it is similar or different from hbck-master's?

2) Single point of truth. What is this truth? what the user specficied
actions? what the rs's are reporting? the last state we were confirmed to
be at? hbck-master tries to define what single point of truth means by
defining intended, current, and actual state data with durability
properties on each kind. What do clients look at who modifies what?

3) Table record: "if regions is out of date, it should be closed and
reopened". It is not clear in master5 how regionservers find out that they
are out of date. Moreover, how do clients talking to those RS's with stale
versions know they are going to the correct RS especially in the face of RS
failures due to timeout?

4) region record: transition states. Shouldn't be defined as part of the
region record? (This is really similar to hbck-masters current state and
intended state. )

5) Note on user operations: the forgetting thing is scary to me – in your
move split example, what happens if an RS reads state that is forgotten?

6) table state machine. how do we guarantee clients are not writing to
against out of date region versions? (in hang situations, regions could be
open on multple places – the hung RS and the new RS the region was assigned
to and successfully opened on)

7) region state machine. Earlier draft hand splitting and merge cases. Are
they elided in master5 or are not present any more. How would this get
extended handle jeffrey's distributed log replay/fast write recovery
feature?

8) logical interactions: sounds like master5 allows concurrent operations
in specfiic regions and and specfiic table. (e.g. it will allow moves and
splits and merges on the same region). hbck-master (though not fully
documented) only allows certain region transitions when the table is
enabled or if the table is disabled. Are we sure we don't get into race
conditions? What happens if disable gets issued – its possible for someone
to reopens the region and for old clients to continue writing to it even
though it is closed?

nit. 9) "in cursive" mean in italics.

10) The table operations section have tables which I believe are the
actions between FSM states in the table or region fsms. Is this correct?
Can the edges be labeled to describe which steps these transitions
correspond to?

Short doc:
nit: Design Constraints, code should: Have AM logic isolated from the
persistent storage of state.
// I think this should be "abstracted" so we can plug in different
implementations of persistent storage of state.

--
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [hbase-5487] Master design discussion.

Posted by Jonathan Hsieh <jo...@cloudera.com>.

Responses inline.   I'm going to wait for next doc before I ask more
questions about the master5 specifics. :)

I suggest we start a thread about the hang/double assign/fencing scenario.

Jon.

On Sat, Oct 19, 2013 at 7:17 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> Here's sergey's replies from jira (modulo some reformatting for email.)
>
> ----
> Answers lifted from email also (some fixes + one answer was modified due
> to clarification here ).
>
> What is a failure and how do you react to failures? I think the master5
>> design needs to spend more effort to considering failure and recovery
>> cases. I claim there are 4 types of responses from a networked IO operation
>> - two states we normally deal with ack successful, ack failed (nack) and
>> unknown due to timeout that succeeded (timeout success) and unknown due to
>> timeout that failed (timeout failed). We have historically missed the last
>> two cases and they aren't considered in the master5 design.
>
>
> There are a few considerations. Let me examine if there are other cases
> than these.
> I am assuming the collocated table, which should reduce such cases for
> state (probably, if collocated table cannot be written reliably, master
> must stop-the-world and fail over).
>

For the master, I agree.  In the case of nack it knows it failed and should
failover, in the case of timeouts it should try verify and retry and if it
cannot it should abdicate.  This seems equivalent to the 9x-master's "if we
can't get to zk we abort" behavior.  We need to guarantee that the old
master and new master are fenced off from each other to make sure they
don't split brain.

Off list, you suggested wal fencing and hbck-master sketched out a lease
timeouts mechanism that provides isolation. Correctness in the face of
hangs and partitions are my biggest concern, and a lot of the questions I'm
asking focus on these scenarios.

Can you point me to where I read and consider the wal fencing?  Do you have
thoughts or a comparison of these approaches?  Are there others techniques?


> When RS contacts master to do state update, it errs on the side of caution
> - no state update, no open region (or split).
>

not sure what "it" in the second part of the sentence means here -- is this
the master or the rs?  If it is the master I this is what hbck-master calls
an "actual state update".

If "it" mean RS's, I'm confused.


> Thus, except for the case of multiple masters running, we can always
> assume RS didn't online the region if we don't know about it.
>

I don't think that is a safe assumption -- I still think double assignment
can and will happen.  If an RS hangs long enough for the master to think it
is dead, the master would reassign.  If that RS came back, it could still
be open handling clients who cached the old location.

 Then, for messages to RS, see "Note on messages"; they are idempotent so
> they can always be resent.
>
>
> 1) State update coordination. What is a "state updates from the outside"
>> Do RS's initiate splitting on their own? Maybe a picture would help so we
>> can figure out if it is similar or different from hbck-master's?
>
>
> Yes, these are RS messages. They are mentioned in some operation
> descriptions in part 2 - opening->opened, closing->closed; splitting, etc.
>
> [will revisit when part 2 posted]


> 2) Single point of truth. hbck-master tries to define what single point of
>> truth means by defining intended, current, and actual state data with
>> durability properties on each kind. What do clients look at who modifies
>> what?
>
>
> Sorry, don't understand the question. I mean single source of truth mainly
> about what is going on with the region; it is described in design
> considerations.
> I like the idea of "intended state", however without more detailed reading
> I am not sure how it works for multiple ops e.g. master recovering the
> region while the user intends to split it, so the split should be executed
> after it's opened.
>
> I'm suggesting there are multiple version of truth since we have multiple
actors.  What the master thinks and knows is one kind (hbck-master:
currentState), what the master wants is another (hbck-master:
intendedState), and what the regionservers think are another (hbck-mater:
actualState).  The master doesn't know exactly what is really going on (it
could be partitioned from some of the RS's that are still actively serving
clients!).

Rephrase of question:  "What do clients look at, and who modifies the
truth?".


>
> 3) Table record: "if regions is out of date, it should be closed and
>> reopened". It is not clear in master5 how regionservers find out that they
>> are out of date. Moreover, how do clients talking to those RS's with stale
>> versions know they are going to the correct RS especially in the face of RS
>> failures due to timeout?
>
>
> On alter (and startup if failed), master tries to reopen all regions that
> are out of date.
> Regions that are not opened with either pick up the new version when they
> are opened, or (e.g. if they are now Opening with old version) master
> discovers they are out of date when they are transitioned to Opened by RS,
> and reopens them again.
>
> I buy this part of the versioning scheme.  (though this another place
succeptable to the hang/double assign scenario I described a few questions
earlier).


> As for any case of alter on enabled table, there are no guarantees for
> clients.
>
To provide these w/o disable/enable (or logical equivalent of coordinating
> all close-s and open-s), one would need some form of version-time-travel,
> or waiting for versions, or both.
>

By version time travel you mean something like this right?
client gets from region R1 with version 1.
client gets from region R2 with version 1.
alter initiated.
R1 updated to v2.
client gets from R1 with version 2.
client gets from R2 with version 1 (version time travel here).
R2 updated to V2.

I think we'd want to wait instead of allowing version time-travel and pass
version info to the client that it needs to pass to RS requests.  This
would prevent version-time-travel which could potentially lead to data
hazard situations.  (or maybe online alter should not be allowed to
add/remove cfs -- only modify their settings).



> 4) region record: transition states. This is really similar to
>> hbck-masters current state and intended state. Shouldn't be defined as part
>> of the region record?
>
>
> I mention somewhere that could be done. One thing is that if several paths
> are possible between states, it's useful to know which is taken.
> But do note that I store user intent separately from what is currently
> going on, so they are not exactly similar as far as I see.
>
> [I'll wait for the next master5 doc to consider the paths you are mention.]

I think that user intent and system intent can probably be place in the
same place with just a flag to differentiate (that probably gives user
intent priority).


>
> 5) Note on user operations: the forgetting thing is scary to me – in your
>> move split example, what happens if an RS reads state that is forgotten?
>
>
> I think my description of this might be too vague. State is not forgotten;
> previous intent is forgotten. I.e. if user does several operations in order
> that conflict (e.g. split and then merge), the first one will be canceled
> (safely ).
> Also, RS does not read state as a guideline to what needs to be done.
>
> Got it.  I think this is similar in hbck-master.  In hbck-master parlance
-- the intended state may be updated multiple times by the user.  Instead
of canceling however, hbck-master would figure out how to recover to the
latest intended state.  How this is done in hbck-master is generic but its
a little fuzzy on the details currently (where is it a FSM and where is it
a push-down automata (PDA))..


> 6) table state machine. how do we guarantee clients are writing from the
>> correct version in the in failures?
>
>
> Can you please elaborate?
>
> This is the hang/double assign scenario again.


>
> 7) region state machine. Earlier draft hand splitting and merge cases. Are
>> they elided in master5 or are not present any more. How would this get
>> extended handle jeffrey's distributed log replay/fast write recovery
>> feature?
>
>
> As I mention somewhere these could be separate states. I was kind of
> afraid of blowing up state machine too much, so I noticed that for
> split/merge you anyway store siblings/children, so you can recognize them
> and for most purposes different split-merge states are the same as Opened
> and Closed.
>
> I will add those back, it would make sense.
>
> There are properties that lead master5 consider the two to be the same
state -- what are these invariants?  I felt they were different because
they do a different set of IO operations.


>  8) logical interactions: sounds like master5 allows concurrent region and
>> table operations. hbck-master (though not fully documented) only allows
>> certain region transitions when the table is enabled or if the table is
>> disabled. Are we sure we don't get into race conditions? What happens if
>> disable gets issued – its possible for someone to reopens the region and
>> for old clients to continue writing to it even though it is closed?
>
>
> Yes, parallelism is intended. You can never be sure you have no races but
> we should aim for it
>
>
You can prove that you won't be affected by races and can eliminate cases
where you are.

Let me rephrase as a few scenarios:
* what happens if multiple operations are issued to a specific region
concurrently (lets say split and merge involving the split target, or maybe
something simpler) ? If so how does master5 resolve this?
* table disable is issued concurrently with a region split.  what are the
possible out comes? what should happen?

master5 is missing disabled/enabled check, that is a mistake.
>
> Part1 operation interactions already cover it:
>
> table disable doesn't ack until all regions are closed (master5 is wrong ).
>

[ah, the shorter doc is updated from master5.. I'll wait for the new one]

region opening cannot start if table is already disabling or disabled.
>
good

if region is already opening when disable is issued, opening will be
> opportunistically canceled.
>
if disable fails to cancel opening, or server opens it first in a race,
> region will be opened, and master will issue close immediately after state
> update. Given that region is not closed, disable is not complete.
> if opening (or closing) times out, master will fence off RS and mark
> region as closed. If there was some way of fencing region separately (ZK
> lease?) it would be possible to use that.
>
> Interesting.. why not just let it finish the open on the RS but not let
the client see that the region assigned there, and then close? (this is why
I'm trying to separate what client's see from what the master sees)

I think hbck-master would just update intended state and then separately
"recover" the problem if the region showed up open in actual state.

In any case, until client checks table state before every write, there's no
> easy way to prevent writes on disabling table. Writes on disabled table
> will not be possible.
>
> I'm thinking we can stop the writes to all regions of a table on an RS
once the RS knows the table is disabling.  This could be done by the
version number threading idea or maybe some other fencing mechanism.


> On ensuring there's no double assignment due to RS hanging:
> The intent is to fence the WAL for region server, the way we do now. One
> could also use other mechanism.
>

[I need a pointer to this]


> Perhaps I could specify it more clearly; I think the problem of making
> sure RS is dead is nearly orthogonal.
> In my model, due to how opening region is committed to opened, we can only
> be unsure when the region is in Opened state (or similar states such as
> Splitting which are not present in my current version, but will be added).
> In that case, in absence of normal transition, we cannot do literally
> anything with the region unless we are sufficiently sure that RS is
> sufficiently dead (e.g. cannot write).
> So, while we ensure that RS is dead we don't reassign.
> My document implies (but doesn't elaborate, I'll fix that) that master
> does direct Opened->Closed direct transition only when that is true.
> A state called "MaybeOpened" could be added. Let me add it...
>
> looking forward to the update.


> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [hbase-5487] Master design discussion.

Posted by Jonathan Hsieh <jo...@cloudera.com>.

Here's sergey's replies from jira (modulo some reformatting for email.)

----
Answers lifted from email also (some fixes + one answer was modified due to
clarification here ).

What is a failure and how do you react to failures? I think the master5
> design needs to spend more effort to considering failure and recovery
> cases. I claim there are 4 types of responses from a networked IO operation
> - two states we normally deal with ack successful, ack failed (nack) and
> unknown due to timeout that succeeded (timeout success) and unknown due to
> timeout that failed (timeout failed). We have historically missed the last
> two cases and they aren't considered in the master5 design.


There are a few considerations. Let me examine if there are other cases
than these.
I am assuming the collocated table, which should reduce such cases for
state (probably, if collocated table cannot be written reliably, master
must stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution
- no state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume
RS didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so
they can always be resent.

1) State update coordination. What is a "state updates from the outside" Do
> RS's initiate splitting on their own? Maybe a picture would help so we can
> figure out if it is similar or different from hbck-master's?


Yes, these are RS messages. They are mentioned in some operation
descriptions in part 2 - opening->opened, closing->closed; splitting, etc.

2) Single point of truth. hbck-master tries to define what single point of
> truth means by defining intended, current, and actual state data with
> durability properties on each kind. What do clients look at who modifies
> what?


Sorry, don't understand the question. I mean single source of truth mainly
about what is going on with the region; it is described in design
considerations.
I like the idea of "intended state", however without more detailed reading
I am not sure how it works for multiple ops e.g. master recovering the
region while the user intends to split it, so the split should be executed
after it's opened.

3) Table record: "if regions is out of date, it should be closed and
> reopened". It is not clear in master5 how regionservers find out that they
> are out of date. Moreover, how do clients talking to those RS's with stale
> versions know they are going to the correct RS especially in the face of RS
> failures due to timeout?


On alter (and startup if failed), master tries to reopen all regions that
are out of date.
Regions that are not opened with either pick up the new version when they
are opened, or (e.g. if they are now Opening with old version) master
discovers they are out of date when they are transitioned to Opened by RS,
and reopens them again.

As for any case of alter on enabled table, there are no guarantees for
clients.
To provide these w/o disable/enable (or logical equivalent of coordinating
all close-s and open-s), one would need some form of version-time-travel,
or waiting for versions, or both.

4) region record: transition states. This is really similar to hbck-masters
> current state and intended state. Shouldn't be defined as part of the
> region record?


I mention somewhere that could be done. One thing is that if several paths
are possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently
going on, so they are not exactly similar as far as I see.

5) Note on user operations: the forgetting thing is scary to me – in your
> move split example, what happens if an RS reads state that is forgotten?


I think my description of this might be too vague. State is not forgotten;
previous intent is forgotten. I.e. if user does several operations in order
that conflict (e.g. split and then merge), the first one will be canceled
(safely ).
Also, RS does not read state as a guideline to what needs to be done.

6) table state machine. how do we guarantee clients are writing from the
> correct version in the in failures?


Can you please elaborate?

7) region state machine. Earlier draft hand splitting and merge cases. Are
> they elided in master5 or are not present any more. How would this get
> extended handle jeffrey's distributed log replay/fast write recovery
> feature?


As I mention somewhere these could be separate states. I was kind of afraid
of blowing up state machine too much, so I noticed that for split/merge you
anyway store siblings/children, so you can recognize them and for most
purposes different split-merge states are the same as Opened and Closed.

I will add those back, it would make sense.

 8) logical interactions: sounds like master5 allows concurrent region and
> table operations. hbck-master (though not fully documented) only allows
> certain region transitions when the table is enabled or if the table is
> disabled. Are we sure we don't get into race conditions? What happens if
> disable gets issued – its possible for someone to reopens the region and
> for old clients to continue writing to it even though it is closed?


Yes, parallelism is intended. You can never be sure you have no races but
we should aim for it

master5 is missing disabled/enabled check, that is a mistake.

Part1 operation interactions already cover it:

table disable doesn't ack until all regions are closed (master5 is wrong ).
region opening cannot start if table is already disabling or disabled.
if region is already opening when disable is issued, opening will be
opportunistically canceled.
if disable fails to cancel opening, or server opens it first in a race,
region will be opened, and master will issue close immediately after state
update. Given that region is not closed, disable is not complete.
if opening (or closing) times out, master will fence off RS and mark region
as closed. If there was some way of fencing region separately (ZK lease?)
it would be possible to use that.

In any case, until client checks table state before every write, there's no
easy way to prevent writes on disabling table. Writes on disabled table
will not be possible.

On ensuring there's no double assignment due to RS hanging:
The intent is to fence the WAL for region server, the way we do now. One
could also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure
RS is dead is nearly orthogonal.
In my model, due to how opening region is committed to opened, we can only
be unsure when the region is in Opened state (or similar states such as
Splitting which are not present in my current version, but will be added).
In that case, in absence of normal transition, we cannot do literally
anything with the region unless we are sufficiently sure that RS is
sufficiently dead (e.g. cannot write).
So, while we ensure that RS is dead we don't reassign.
My document implies (but doesn't elaborate, I'll fix that) that master does
direct Opened->Closed direct transition only when that is true.
A state called "MaybeOpened" could be added. Let me add it...

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com