You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Jonathan Hsieh <jo...@cloudera.com> on 2012/09/05 21:38:31 UTC

Hbase Assignments in trunk.

I generally think in pictures, so I've mapped out the single Assignment
control flow as found in trunk yesterday in terms of threads and network
communications (each of which can possibly fail).  It is a process that has
18 or so network communications, 3 processes, and about 8 threads
coordinating (excluding meta writes)

I wanted to put this out because we've had some discussions about
simplifying it or making it more accessible so we can comfortably access
patches and possibly use it as a rough design doc or a counter to new
potential strawman designs.  For me at least it would be useful when
reviewing patches in this area.

We've also talked about defining design and code invariants -- here's the
one that I've gotten so far:  (We can pull up more from discussion)

* ZK state should transient (treat it like memory). If deleted, hbase
should be able to recover and essentially be in the same state (a few
exceptions -- enabled/disable state)

A few questions I have from this exercise:

1) Why do we have ZK asynchronously update the HM?  (why not do it
synchronously?)
2) Why do we have the RS update ZK as it opens -- why not have the HM
manage all ZK comms and not have the RS talk directly to ZK in this
process?  Then ZK is just for failover and less so for coordination.
3) Clients who issue assign calls are partially asynchronous and partially
synchronous.  Why not go all the way?
4) Why are there multiple error conventions -- abort, FAILED_OPEN, throwing
exception, (and cases where we "return" silently without notification)?
5) How do we handle timeout situations -- IMO it makes sense to have a
rollback or fail forward policy for different places on the timeline.
6) Can we use cancellation instead of checking for
enabling/disabled/disabling/shutdown/stopping all over the place? (let's
say these cluster ops would cancel the assign and then win by blocking
assigns).
7) In memory state has different but similarly named states in the HM, ZK,
and in the RS's.  And there are the transition events could be missed.
8) Is having multiple processes "responsible for acting" necessary?  (why
not have the HM open and then update meta)?

Thoughts? (and corrections please!)

Jon.
-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Hbase Assignments in trunk.

Posted by n keywal <nk...@gmail.com>.

region assignment in ZK could be interesting. + having the regionserver
state available. This would require some work in ZK I fear (ZOOKEEPER-1147).

However, persisting data in ZK is dangerous: this leads to have the cluster
state shared in two references, making the whole thing complicated to
manage (I'm thinking about snapshots for example). It should be possible to
restart the cluster with an empty ZK. The single persisting store being
HBase/HDFS.

And making 3.4+ mandatory for 0.98 seems a good thing to do as well :-).

On Tue, Sep 11, 2012 at 4:45 AM, Enis Söztutar <en...@hortonworks.com> wrote:

> +1 on rethinking the assignment + splitting code paths, and using zk as a
> transactional database. Just my 2 cents w/o spending a lot of time on the
> details, but maybe we should stop keeping master and RS in memory metadata,
> but keep region-assignments in zk, and HM and RS just keep a consistent
> in-memory cache.
>
> Enis
>
> On Mon, Sep 10, 2012 at 3:29 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > I've been saying a while ago that we should require ZK 3.4.x for 0.96+.
> >
> > Distributed consensus without a "transaction" option always rang a bit
> > weird to me.
> >
> > Maybe switch in 0.98+?
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: n keywal <nk...@gmail.com>
> > To: dev@hbase.apache.org
> > Cc:
> > Sent: Thursday, September 6, 2012 12:53 AM
> > Subject: Re: Hbase Assignments in trunk.
> >
> > On the Async vs. sync: there are 3 different ways to write multiple
> znodes
> > in ZK, and huge differences in the performances between them:
> >
> > 1) for loop sync
> > 2) for loop async
> > 3) multi
> >
> > Async will be 20 to 100 times faster than sync. multi will be 2 to 4
> times
> > faster than async (that is, 80 to 400 times faster than sync).
> >
> > Multi was not available before ZK 3.4. It has several obvious advantages
> > over async imho: it's faster, it's synchronous and it's a transaction.
> That
> > simplifies the user code usually.
> >
> > It has other advantages:
> > - async and sync will typically send 1 or more packet per znode (naggle
> is
> > not activated iirc), while there will be only a few packets for all the
> > znodes with multi
> > - you can expect async to write multiple times on the disk, while multi
> > should write only once. This is usually better for i/o performances.
> >
> > On a serious recovery situation, with all the regions moving all other
> the
> > place, saving disk and network i/o for ZooKeeper is important.
> >
> > Disadvantage: it's new.
> >
> > On Thu, Sep 6, 2012 at 7:49 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> > > > Here's a link to the pdf/picture.
> > > >
> > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
> > > >
> > >
> > > Pretty picture.  Not a pretty story.
> > >
> > > What you thinking?
> > >
> > > St.Ack
> > >
> >
> >
>

Re: Hbase Assignments in trunk.

Posted by Enis Söztutar <en...@hortonworks.com>.

+1 on rethinking the assignment + splitting code paths, and using zk as a
transactional database. Just my 2 cents w/o spending a lot of time on the
details, but maybe we should stop keeping master and RS in memory metadata,
but keep region-assignments in zk, and HM and RS just keep a consistent
in-memory cache.

Enis

On Mon, Sep 10, 2012 at 3:29 PM, lars hofhansl <lh...@yahoo.com> wrote:

> I've been saying a while ago that we should require ZK 3.4.x for 0.96+.
>
> Distributed consensus without a "transaction" option always rang a bit
> weird to me.
>
> Maybe switch in 0.98+?
>
> -- Lars
>
>
> ----- Original Message -----
> From: n keywal <nk...@gmail.com>
> To: dev@hbase.apache.org
> Cc:
> Sent: Thursday, September 6, 2012 12:53 AM
> Subject: Re: Hbase Assignments in trunk.
>
> On the Async vs. sync: there are 3 different ways to write multiple znodes
> in ZK, and huge differences in the performances between them:
>
> 1) for loop sync
> 2) for loop async
> 3) multi
>
> Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times
> faster than async (that is, 80 to 400 times faster than sync).
>
> Multi was not available before ZK 3.4. It has several obvious advantages
> over async imho: it's faster, it's synchronous and it's a transaction. That
> simplifies the user code usually.
>
> It has other advantages:
> - async and sync will typically send 1 or more packet per znode (naggle is
> not activated iirc), while there will be only a few packets for all the
> znodes with multi
> - you can expect async to write multiple times on the disk, while multi
> should write only once. This is usually better for i/o performances.
>
> On a serious recovery situation, with all the regions moving all other the
> place, saving disk and network i/o for ZooKeeper is important.
>
> Disadvantage: it's new.
>
> On Thu, Sep 6, 2012 at 7:49 AM, Stack <st...@duboce.net> wrote:
>
> > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> > > Here's a link to the pdf/picture.
> > >
> > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
> > >
> >
> > Pretty picture.  Not a pretty story.
> >
> > What you thinking?
> >
> > St.Ack
> >
>
>

Re: Hbase Assignments in trunk.

Posted by lars hofhansl <lh...@yahoo.com>.

I've been saying a while ago that we should require ZK 3.4.x for 0.96+.

Distributed consensus without a "transaction" option always rang a bit weird to me.

Maybe switch in 0.98+?

-- Lars

----- Original Message -----
From: n keywal <nk...@gmail.com>
To: dev@hbase.apache.org
Cc: 
Sent: Thursday, September 6, 2012 12:53 AM
Subject: Re: Hbase Assignments in trunk.

On the Async vs. sync: there are 3 different ways to write multiple znodes
in ZK, and huge differences in the performances between them:

1) for loop sync
2) for loop async
3) multi

Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times
faster than async (that is, 80 to 400 times faster than sync).

Multi was not available before ZK 3.4. It has several obvious advantages
over async imho: it's faster, it's synchronous and it's a transaction. That
simplifies the user code usually.

It has other advantages:
- async and sync will typically send 1 or more packet per znode (naggle is
not activated iirc), while there will be only a few packets for all the
znodes with multi
- you can expect async to write multiple times on the disk, while multi
should write only once. This is usually better for i/o performances.

On a serious recovery situation, with all the regions moving all other the
place, saving disk and network i/o for ZooKeeper is important.

Disadvantage: it's new.

On Thu, Sep 6, 2012 at 7:49 AM, Stack <st...@duboce.net> wrote:

> On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> > Here's a link to the pdf/picture.
> >
> > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
> >
>
> Pretty picture.  Not a pretty story.
>
> What you thinking?
>
> St.Ack
>

Re: Hbase Assignments in trunk.

Posted by Dave Wang <ds...@cloudera.com>.

There's a discussion on the ZK mailing list about releasing ZK 3.4.4, which
will have multi and some other fixes.  Once that is out, we can move to
that on trunk.  That will also help with one of the replication patches
that Himanshu currently has pending, which relies on multi.

- Dave

On Thu, Sep 6, 2012 at 3:20 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> IMO, moving to new ZK seems to  makes sense for HBase trunk.
>
> Jon.
>
> On Thu, Sep 6, 2012 at 12:53 AM, n keywal <nk...@gmail.com> wrote:
>
> > On the Async vs. sync: there are 3 different ways to write multiple
> znodes
> > in ZK, and huge differences in the performances between them:
> >
> > 1) for loop sync
> > 2) for loop async
> > 3) multi
> >
> > Async will be 20 to 100 times faster than sync. multi will be 2 to 4
> times
> > faster than async (that is, 80 to 400 times faster than sync).
> >
> > Multi was not available before ZK 3.4. It has several obvious advantages
> > over async imho: it's faster, it's synchronous and it's a transaction.
> That
> > simplifies the user code usually.
> >
> > It has other advantages:
> >  - async and sync will typically send 1 or more packet per znode (naggle
> is
> > not activated iirc), while there will be only a few packets for all the
> > znodes with multi
> >  - you can expect async to write multiple times on the disk, while multi
> > should write only once. This is usually better for i/o performances.
> >
> > On a serious recovery situation, with all the regions moving all other
> the
> > place, saving disk and network i/o for ZooKeeper is important.
> >
> > Disadvantage: it's new.
> >
> > On Thu, Sep 6, 2012 at 7:49 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> > > > Here's a link to the pdf/picture.
> > > >
> > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
> > > >
> > >
> > > Pretty picture.  Not a pretty story.
> > >
> > > What you thinking?
> > >
> > > St.Ack
> > >
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: Hbase Assignments in trunk.

Posted by Jonathan Hsieh <jo...@cloudera.com>.

IMO, moving to new ZK seems to  makes sense for HBase trunk.

Jon.

On Thu, Sep 6, 2012 at 12:53 AM, n keywal <nk...@gmail.com> wrote:

> On the Async vs. sync: there are 3 different ways to write multiple znodes
> in ZK, and huge differences in the performances between them:
>
> 1) for loop sync
> 2) for loop async
> 3) multi
>
> Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times
> faster than async (that is, 80 to 400 times faster than sync).
>
> Multi was not available before ZK 3.4. It has several obvious advantages
> over async imho: it's faster, it's synchronous and it's a transaction. That
> simplifies the user code usually.
>
> It has other advantages:
>  - async and sync will typically send 1 or more packet per znode (naggle is
> not activated iirc), while there will be only a few packets for all the
> znodes with multi
>  - you can expect async to write multiple times on the disk, while multi
> should write only once. This is usually better for i/o performances.
>
> On a serious recovery situation, with all the regions moving all other the
> place, saving disk and network i/o for ZooKeeper is important.
>
> Disadvantage: it's new.
>
> On Thu, Sep 6, 2012 at 7:49 AM, Stack <st...@duboce.net> wrote:
>
> > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> > > Here's a link to the pdf/picture.
> > >
> > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
> > >
> >
> > Pretty picture.  Not a pretty story.
> >
> > What you thinking?
> >
> > St.Ack
> >
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Hbase Assignments in trunk.

Posted by n keywal <nk...@gmail.com>.

On the Async vs. sync: there are 3 different ways to write multiple znodes
in ZK, and huge differences in the performances between them:

1) for loop sync
2) for loop async
3) multi

Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times
faster than async (that is, 80 to 400 times faster than sync).

Multi was not available before ZK 3.4. It has several obvious advantages
over async imho: it's faster, it's synchronous and it's a transaction. That
simplifies the user code usually.

It has other advantages:
 - async and sync will typically send 1 or more packet per znode (naggle is
not activated iirc), while there will be only a few packets for all the
znodes with multi
 - you can expect async to write multiple times on the disk, while multi
should write only once. This is usually better for i/o performances.

On a serious recovery situation, with all the regions moving all other the
place, saving disk and network i/o for ZooKeeper is important.

Disadvantage: it's new.

On Thu, Sep 6, 2012 at 7:49 AM, Stack <st...@duboce.net> wrote:

> On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> > Here's a link to the pdf/picture.
> >
> > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
> >
>
> Pretty picture.  Not a pretty story.
>
> What you thinking?
>
> St.Ack
>

Re: Hbase Assignments in trunk.

Posted by Stack <st...@duboce.net>.

On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> Here's a link to the pdf/picture.
>
> http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf
>

Pretty picture.  Not a pretty story.

What you thinking?

St.Ack

Re: Hbase Assignments in trunk.

Posted by Jonathan Hsieh <jo...@cloudera.com>.

Here's a link to the pdf/picture.

http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf

Jon.

On Wed, Sep 5, 2012 at 5:07 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

>
>
> On Wed, Sep 5, 2012 at 4:08 PM, Stack <st...@duboce.net> wrote:
>
>> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>> > I generally think in pictures, so I've mapped out the single Assignment
>> > control flow as found in trunk yesterday in terms of threads and network
>> > communications (each of which can possibly fail).  It is a process that
>> has
>> > 18 or so network communications, 3 processes, and about 8 threads
>> > coordinating (excluding meta writes)
>> >
>>
>> Did you attach your picture Jon?
>>
>
> I attached a 571k pdf.  If it didn't get through, I'll post it somewhere
> so folks can see it.
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>
>


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Hbase Assignments in trunk.

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Wed, Sep 5, 2012 at 4:08 PM, Stack <st...@duboce.net> wrote:

> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> > I generally think in pictures, so I've mapped out the single Assignment
> > control flow as found in trunk yesterday in terms of threads and network
> > communications (each of which can possibly fail).  It is a process that
> has
> > 18 or so network communications, 3 processes, and about 8 threads
> > coordinating (excluding meta writes)
> >
>
> Did you attach your picture Jon?
>

I attached a 571k pdf.  If it didn't get through, I'll post it somewhere so
folks can see it.


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Hbase Assignments in trunk.

Posted by Stack <st...@duboce.net>.

On Thu, Sep 6, 2012 at 3:16 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> On Wed, Sep 5, 2012 at 4:08 PM, Stack <st...@duboce.net> wrote:
>
>> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
...
>> We should post these invariants somewhere?  In dev section of refguide?
>>
>> We should definitely put this in the javadoc.  Maybe we should have a
> dev-guide section of the ref-guide where some of these things are also
> captured?
>

I added an invariants section to the developer pages.  I used your
wording of the zk data axiom above.

(What other invariants do we have?)

>> On a code craft point of view, failure behavior is buried deeply and could
> be pulled out to the process methods of the handlers.  In many cases, it
> isn't easy to figure out why one behavior is chosen vs others.
>

Nod.

> I'm also suggesting that we could avoid using ZK event callbacks like the
> OPENING and OPENED zk transition and instead have the master would manage
> those.  We'd have an opening RS would tickle some other znode to show
> progress.   At least then RegionState would be closer to accurate, and the
> HM would go through all state transitions.
>

Perhaps.

I would look at any prospective design to see if I could see holes
where master and regionserver might diverge in terms of what they
think a particular region's state is at any one time (Up to this,
they've done it via the znode proxy that one or the other purportedly
owns outright at any time; there is even some facility for progressing
in the face of missed callbacks though for sure we are now into a gray
area).

St.Ack

Re: Hbase Assignments in trunk.

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Wed, Sep 5, 2012 at 4:08 PM, Stack <st...@duboce.net> wrote:

> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
...

> > We've also talked about defining design and code invariants -- here's the
> > one that I've gotten so far:  (We can pull up more from discussion)
> >
> > * ZK state should transient (treat it like memory). If deleted, hbase
> should
> > be able to recover and essentially be in the same state (a few
> exceptions --
> > enabled/disable state)
> >
>
> Yes.
>
> We should post these invariants somewhere?  In dev section of refguide?
>
> We should definitely put this in the javadoc.  Maybe we should have a
dev-guide section of the ref-guide where some of these things are also
captured?


> > 4) Why are there multiple error conventions -- abort, FAILED_OPEN,
> throwing
> > exception, (and cases where we "return" silently without notification)?
>
> I would have to look at the particular instance but high level I'd say
> its a case of:
>
> 1. On the one hand your classic myopic patch-centric view
> 2. While on the other, you can't throw an exception out to the master
> if the rpc open has been successfully handed off and the rpc has
> completed... there needs to be another means flagging error.
>
> On a code craft point of view, failure behavior is buried deeply and could
be pulled out to the process methods of the handlers.  In many cases, it
isn't easy to figure out why one behavior is chosen vs others.


> > 5) How do we handle timeout situations -- IMO it makes sense to have a
> > rollback or fail forward policy for different places on the timeline.
>
> Yes.  There are a couple of flavors of this in the code base at
> present.  Could do w/ a revisit for sure.
>
> This is more a question -- I'm not familiar with the details of rpc
timeouts currently.


>  > 6) Can we use cancellation instead of checking for
> > enabling/disabled/disabling/shutdown/stopping all over the place? (let's
> say
> > these cluster ops would cancel the assign and then win by blocking
> assigns).
>
> The enabling, etc., checks are done on assign to make sure we don't go
> ahead if table state has changed since the order to assign was given.
>
> To me cancel seems like something else; the open or close has gone out
> already and we want to stop it happening.
>
> They seem like different things to me.
>
> I'm suggesting that when a overriding operation like
enable/disable/shutdown/stop is triggered we internally use cancellation to
preemmpt assignments/unassignments.  This could be in the same places where
we currently do the checks, but also eventually be used to cancel
open/close operations.  Maybe this is too far out for the time being.


> > 7) In memory state has different but similarly named states in the HM,
> ZK,
> > and in the RS's.  And there are the transition events could be missed.
>
> Yes.  This is a problem.
>
> My peeve is the one where we cannot trust what RegionState says and
> even if we could, its states are not 'clean'; e.g. OFFINE is both
> BEGIN the open of a region but also a catchall parking state that we
> put regions into when not sure what else to do w/ them.
>
>
There is the state name (i agree).  Also, there is the fact that
RegionState is not always right (possibly more than one state transition
behind).  This is actually why I was considering taking the zk-based
control flow elements and putting them in the master.  If states are
skipped we need to make sure the transitions happen on the master (or we
can safely skip the transition).

I'm also suggesting that we could avoid using ZK event callbacks like the
OPENING and OPENED zk transition and instead have the master would manage
those.  We'd have an opening RS would tickle some other znode to show
progress.   At least then RegionState would be closer to accurate, and the
HM would go through all state transitions.



> > 8) Is having multiple processes "responsible for acting" necessary?  (why
> > not have the HM open and then update meta)?
> >
>
> It could be good having master do all meta edits.  Would be good to
> see what advantage it would bring us before going about making the
> change.
>
>
I'm pretty sure it would have more latency.  Controlling when the becomes a
assigned region availabile might make this trickier.  (Jimmy caught a bug
in an earlier version of this).


> I can provide more history and provenance if needed, np.
>
> St.Ack
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: Hbase Assignments in trunk.

Posted by Stack <st...@duboce.net>.

On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> I generally think in pictures, so I've mapped out the single Assignment
> control flow as found in trunk yesterday in terms of threads and network
> communications (each of which can possibly fail).  It is a process that has
> 18 or so network communications, 3 processes, and about 8 threads
> coordinating (excluding meta writes)
>

Did you attach your picture Jon?

> We've also talked about defining design and code invariants -- here's the
> one that I've gotten so far:  (We can pull up more from discussion)
>
> * ZK state should transient (treat it like memory). If deleted, hbase should
> be able to recover and essentially be in the same state (a few exceptions --
> enabled/disable state)
>

Yes.

We should post these invariants somewhere?  In dev section of refguide?

> A few questions I have from this exercise:
>
> 1) Why do we have ZK asynchronously update the HM?  (why not do it
> synchronously?)

IIRC, it was faster.

> 2) Why do we have the RS update ZK as it opens -- why not have the HM manage
> all ZK comms and not have the RS talk directly to ZK in this process?  Then
> ZK is just for failover and less so for coordination.

IIRC, the notion was that we could keep an eye on the regionserver
progress opening a region.  RS could take a long time opening and as
long as it was tickling zk by resetting state, the master would not
take control of the region away from the RS.  Inversely, if the RS
froze mid-open, it'd know it lost control if when it tried to set
state, the sequence id had moved on from what it thought it was.

> 3) Clients who issue assign calls are partially asynchronous and partially
> synchronous.  Why not go all the way?

No reason.  The thought was async meant less friction.  The work was
just never done to async it all.

> 4) Why are there multiple error conventions -- abort, FAILED_OPEN, throwing
> exception, (and cases where we "return" silently without notification)?

I would have to look at the particular instance but high level I'd say
its a case of:

1. On the one hand your classic myopic patch-centric view
2. While on the other, you can't throw an exception out to the master
if the rpc open has been successfully handed off and the rpc has
completed... there needs to be another means flagging error.

> 5) How do we handle timeout situations -- IMO it makes sense to have a
> rollback or fail forward policy for different places on the timeline.

Yes.  There are a couple of flavors of this in the code base at
present.  Could do w/ a revisit for sure.

> 6) Can we use cancellation instead of checking for
> enabling/disabled/disabling/shutdown/stopping all over the place? (let's say
> these cluster ops would cancel the assign and then win by blocking assigns).

The enabling, etc., checks are done on assign to make sure we don't go
ahead if table state has changed since the order to assign was given.

To me cancel seems like something else; the open or close has gone out
already and we want to stop it happening.

They seem like different things to me.

> 7) In memory state has different but similarly named states in the HM, ZK,
> and in the RS's.  And there are the transition events could be missed.

Yes.  This is a problem.

My peeve is the one where we cannot trust what RegionState says and
even if we could, its states are not 'clean'; e.g. OFFINE is both
BEGIN the open of a region but also a catchall parking state that we
put regions into when not sure what else to do w/ them.

> 8) Is having multiple processes "responsible for acting" necessary?  (why
> not have the HM open and then update meta)?
>

It could be good having master do all meta edits.  Would be good to
see what advantage it would bring us before going about making the
change.

I can provide more history and provenance if needed, np.

St.Ack