You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Jonathan Hsieh <jo...@cloudera.com> on 2013/12/03 07:00:44 UTC

Re: [Shadow Regions / Read Replicas ] Block Affinity

> Enis:
> I was trying to refer to not having co-location constraints for secondary
replicas whose primaries are hosted by the same
> RS. For example, if R1(replica=0), and R2(replica=0) are hosted on RS1,
R1(replica=1) and R2(replica=1) can be hosted by RS2
> and RS3 respectively. This can definitely use the hdfs block affinity
work though.

This particular example doesn't have enough to tease out the different
ideal situations.  Hopefully this will help:

We have RS-A hosting regions X and Y.  With affinity groups let's say
RS-A's logs are written to RS-A, RS-L, and RS-M.

Let's also say that X is written to RS-A, RS-B and RS-C and Y is to RS-A,
RS-D, RS-E.

>> Jon:
>> However, I don't think we get into a situation where all RS's must read
all other RS's logs – we only need to have the shadows RS's to read the
primary RS's log.
> Enis:
> I am assuming a random distribution of secondary regions per above. In
this case, for replication=2, a region server will
> have half of it's regions in primary and the other in secondary mode. For
all the regions in the secondary mode, it has to
> tail the logs of the rs where the primary is hosted. However, since there
is no co-location guarantee, the primaries are
> also randomly distributed. For n secondary regions, and m region servers,
you will have to tail the logs of most of the RSs
> if n > m with a high probability (I do not have the smarts to calculate
the exact probability)

For hi-availability stale reads replicas (read replicas) it seems best to
assign the secondary regions assigned to the rs's where the HFiles are
hosted. Thus this approach would want to assign shadow regions like this
(this is the "random distribution of sedonary regions):
* X on RS-A(rep=0), RS-B(rep=1), and RS-C(rep=2); and
* Y on RS-A(rep=0), RS-D(rep=1), and RS-E(rep=2).

For the most efficient consistent read-recovery (shadow regions/memstores),
it would make sense to have them assigned to the rs's where the Hlogs are
local. Thus this approach would want to assign shadow regions for regions
X, Y, and Z on RS-L and RS-M.

A simple optimal solution for both read replicas and shadow regions would
be to assign the regions and the HLog to the same set of machines so that
the RS's for the logs and region x, y, and z hosted are on the same
machines -- let's say RS-A, RS-H, and RS-I.  This has some non-optimal
balancing ramifications upon machine failure -- the work of RS-A would be
split between to RS-H and RS-I.

A more complex solution for both would be to choose machines for the
purpose they are best suited for.  Read replicas are hosted on their
respective machines, and shadow region memstores on the hlog's rs's.
 Promotion becomes a more complicated dance where upon RS-A's failure, we
have the log tailing shadow region catchup and perform a flush of the
affected memstores to the appropriate hfds affinity group/favored nodes.
 So the shadow memstore for region X would flush the hfile to A,B,C and
region Y to A,D,E.  Then the read replicas would be promoted (close
secondary, open as primary) based on where the regions/hfiles's affinity
group.  This feels likes an optimization done one the 2rd or 3rd rev.

Jon

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ] Block Affinity

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Tue, Dec 3, 2013 at 3:46 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> On Tue, Dec 3, 2013 at 11:37 AM, Enis Söztutar <en...@gmail.com> wrote:
>
> > I think we do not want to differentiate between RS's by splitting them
> between
> > primaries and shadows. This will complicate provisioning, administration,
> > monitoring and load balancing a lot, and will not achieve very cheap
> > secondary region promotions (because you have to move the region still as
> > you described).
> >
>
> The idea of having "primary hosts" and "replica hosts" was brought up in
> initial design discussions over here. I am particularly against this
> approach because of the additional complexity. I need to update myself on
> Enis's doc (I'm a week+ behind), but my opinion is that we treat a
> non-primary region (be it a "read replica" or a "shadow region") as a
> first-class and independent entities. These entities can be assigned to any
> host in the cluster, each with their own individual state machine
> instances.
>
> Of course, the balancer would need to be aware of the relationship between
> the primary and its non-primaries in order to maintain the balancing policy
> requirements. However, I see no reason for there to be specialization at
> the host level, and I agree with Enis's arguments against it.
>
> -n
>

I think there was a misunderstanding here -- I made a distinction between
the "normal" primary regions, eventually-consistent-read-replica/secondary
regions, and shadow memstore regions (for fast consistent read recovery).
 All region servers would be able to host normal primary regions,
read-replica regions and shadow memstore regions.

There would be different potential sweet spots if read-replica regions and
shadow memstore regions were  co-located at region on recover time with
trade offs for fast consistent recovery, ability to have more recent
values, locality optimizations and load balancing optimizations.

Jon.

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ] Block Affinity

Posted by Nick Dimiduk <nd...@gmail.com>.

On Tue, Dec 3, 2013 at 11:37 AM, Enis Söztutar <en...@gmail.com> wrote:

> I think we do not want to differentiate between RS's by splitting them between
> primaries and shadows. This will complicate provisioning, administration,
> monitoring and load balancing a lot, and will not achieve very cheap
> secondary region promotions (because you have to move the region still as
> you described).
>

The idea of having "primary hosts" and "replica hosts" was brought up in
initial design discussions over here. I am particularly against this
approach because of the additional complexity. I need to update myself on
Enis's doc (I'm a week+ behind), but my opinion is that we treat a
non-primary region (be it a "read replica" or a "shadow region") as a
first-class and independent entities. These entities can be assigned to any
host in the cluster, each with their own individual state machine instances.

Of course, the balancer would need to be aware of the relationship between
the primary and its non-primaries in order to maintain the balancing policy
requirements. However, I see no reason for there to be specialization at
the host level, and I agree with Enis's arguments against it.

-n

Re: [Shadow Regions / Read Replicas ] Block Affinity

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Tue, Dec 3, 2013 at 11:37 AM, Enis Söztutar <en...@gmail.com> wrote:

> Responses inlined.
>
> On Mon, Dec 2, 2013 at 10:00 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > For the most efficient consistent read-recovery (shadow
> regions/memstores),
> > it would make sense to have them assigned to the rs's where the Hlogs are
> > local. Thus this approach would want to assign shadow regions for regions
> > X, Y, and Z on RS-L and RS-M.
> >
>
> I don't this this is the case.

Clarification: For achieving the goal of low latency for recovery using the
fewest resources (efficiency for this goal) this is the design point that
best achieves the goal.

> Recovery is a multi step process, and
> reading and
> applying the log is only one step.

Yes and the replay to recover read consistency seems to be one of the more
expensive steps.

> After the region is opened, you
> definitely want
> the data files to be local as much as possible.

Ideal but not necessary for correctness or faster consistent read recovery.
 Today we don't have data files local as much as possible and normal hbase
user's can't use the feature yet.

> Considering the relative
> sizes of
> the files and the WALs, I think we will always want to use hdfs affinity
> groups for
> hfiles rather than hlogs to assign secondary replicas. This will help both
> stale reads
> and local reads in case of a promotion to primary.
>
>
>
This is why i'm suggesting separating the two functions (read replica and
shadow memstore) to separate logical types with different placement
options..

I agree with you on selecting hfile related region servers for read
replicas, (which I hadn't fully considered when I was working on the shadow
memstore writeup).  However for minimizing recovery time replay is more
costly (either the n^2 tailers).

I don't think hlog tailers should be coupled to the hfile affinity groups
for the same reasons as you -- n^2 tailers where n is # of regions.

 >
> > A simple optimal solution for both read replicas and shadow regions would
> > be to assign the regions and the HLog to the same set of machines so that
> > the RS's for the logs and region x, y, and z hosted are on the same
> > machines -- let's say RS-A, RS-H, and RS-I.  This has some non-optimal
> > balancing ramifications upon machine failure -- the work of RS-A would be
> > split between to RS-H and RS-I.
> >
>
> I don't think we want this. This implies that we are creating region
> assignment groups ( group-based
> assignment as described in the doc). The problem is that in case of a
> crash, you cannot evenly
> distribute out the regions from the primary otherwise you will still end up
> tailing all the logs for
> all the region servers. Plus if you want to load balance, it will be even
> harder to satisfy the constraints while
> keeping the balance.
>
> In your example, if you have replication=2 for example, we cannot simply
> move all the primary regions
> of RS-A to RS-H, which will then suddenly have twice the number of regions.
>
>
I think a realistic use would be to set replication 3 since we have three
replicas of the logs.  Instead the client would just choose to hit the
first two replicas (rep0=primary and rep1=secondary) to reduce the memory
pressure on the 3rd node (rep=secondary, not read from).

Here's an extension to this hybrid approach which potentially buys us both
good recency and high availability (at the cost of poor balance if we
enforce optimal locality).  We essentially group a set of regions to the
same three RS's for all regions, and hlogs in that group in a little cycle.

Ex:
Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2).
Region Y on RS-A (rep=1), RS-B(rep=2), RS-C(rep=0).
Region Z on RS-A (rep=2), RS-B(rep=0), RS-C(rep=1).
RS-A's log on RS-A, RS-B, RS-C.
RS-B's log on RS-B, RS-C, RS-A.
RS-C's log on RS-C, RS-A, RS-B.

If any RS's go down the load is spread between the other two in the group.

> >
> > A more complex solution for both would be to choose machines for the
> > purpose they are best suited for.  Read replicas are hosted on their
> > respective machines, and shadow region memstores on the hlog's rs's.
> >  Promotion becomes a more complicated dance where upon RS-A's failure, we
> > have the log tailing shadow region catchup and perform a flush of the
> > affected memstores to the appropriate hfds affinity group/favored nodes.
> >  So the shadow memstore for region X would flush the hfile to A,B,C and
> > region Y to A,D,E.  Then the read replicas would be promoted (close
> > secondary, open as primary) based on where the regions/hfiles's affinity
> > group.  This feels likes an optimization done one the 2rd or 3rd rev.
> >
>
> I think we do not want to differentiate between RS's by splitting them
> between primaries and shadows.
> This will complicate provisioning, administration, monitoring and load
> balancing a lot, and will not achieve
> very cheap secondary region promotions (because you have to move the region
> still as you described).
>
>
I think there is a misunderstanding here -- clarifying.  In this combined
approach, we have a pool of RS's, each of which will can host a combination
of primary regions, secondary read replica regions, and shadow memstore
regions.  If we don't have separate out the shadow memstore regions from
the secondary read replicas, we end up with the inefficient design implied
in the read replica write up -- where all region servers need to
essentially read all hlogs from the other region servers causing n^2
tailers across the region instead of n or 2n tailers.

We will want to have different metrics for monitoring and log for the read
replicas and the shadow memstore.  Ideally we'd know how far behind we are.

In the combined approach, this isn't a move here -- the flush is completed
by the shadow to the nodes that are assigned as the secondaries than we
promote the secondary to primary by closing and then opening the region as
primary.

>
Ex:
Region X on RS-A (rep=0), RS-B(rep=1), RS-C(rep=2).
Region Y on RS-A (rep=0), RS-D(rep=1), RS-E(rep=2).
RS-A's log on RS-A, RS-F, RS-G.
RS-F and RS-G shadow RS-A's Hlog.

RS-A goes down.
RS-F and RS-G catch up to the end of RS-A's HLog.
RS-F flushes the region X shadow memstore to nodes RS-B, RS-C (and some
other node).
RS-G flushes the region Y shadow memstore to nodes RS-D, RS-E (and some
other node).
Master promotes region X on secondary RS-B (closeReplica, open X) with all
its stores local to RS-B and RS-C.
Master promotes region Y on secondary RS-D (closeReplica, open Y) with all
its stores local to RS-D and RS-E.

> >
> > Jon
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ] Block Affinity

Posted by Enis Söztutar <en...@gmail.com>.

Responses inlined.

On Mon, Dec 2, 2013 at 10:00 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> > Enis:
> > I was trying to refer to not having co-location constraints for secondary
> replicas whose primaries are hosted by the same
> > RS. For example, if R1(replica=0), and R2(replica=0) are hosted on RS1,
> R1(replica=1) and R2(replica=1) can be hosted by RS2
> > and RS3 respectively. This can definitely use the hdfs block affinity
> work though.
>
> This particular example doesn't have enough to tease out the different
> ideal situations.  Hopefully this will help:
>
> We have RS-A hosting regions X and Y.  With affinity groups let's say
> RS-A's logs are written to RS-A, RS-L, and RS-M.
>
> Let's also say that X is written to RS-A, RS-B and RS-C and Y is to RS-A,
> RS-D, RS-E.
>
> >> Jon:
> >> However, I don't think we get into a situation where all RS's must read
> all other RS's logs – we only need to have the shadows RS's to read the
> primary RS's log.
> > Enis:
> > I am assuming a random distribution of secondary regions per above. In
> this case, for replication=2, a region server will
> > have half of it's regions in primary and the other in secondary mode. For
> all the regions in the secondary mode, it has to
> > tail the logs of the rs where the primary is hosted. However, since there
> is no co-location guarantee, the primaries are
> > also randomly distributed. For n secondary regions, and m region servers,
> you will have to tail the logs of most of the RSs
> > if n > m with a high probability (I do not have the smarts to calculate
> the exact probability)
>
> For hi-availability stale reads replicas (read replicas) it seems best to
> assign the secondary regions assigned to the rs's where the HFiles are
> hosted. Thus this approach would want to assign shadow regions like this
> (this is the "random distribution of sedonary regions):
> * X on RS-A(rep=0), RS-B(rep=1), and RS-C(rep=2); and
> * Y on RS-A(rep=0), RS-D(rep=1), and RS-E(rep=2).
>
> For the most efficient consistent read-recovery (shadow regions/memstores),
> it would make sense to have them assigned to the rs's where the Hlogs are
> local. Thus this approach would want to assign shadow regions for regions
> X, Y, and Z on RS-L and RS-M.
>

I don't this this is the case. Recovery is a multi step process, and
reading and
applying the log is only one step. After the region is opened, you
definitely want
the data files to be local as much as possible. Considering the relative
sizes of
the files and the WALs, I think we will always want to use hdfs affinity
groups for
hfiles rather than hlogs to assign secondary replicas. This will help both
stale reads
and local reads in case of a promotion to primary.

>
> A simple optimal solution for both read replicas and shadow regions would
> be to assign the regions and the HLog to the same set of machines so that
> the RS's for the logs and region x, y, and z hosted are on the same
> machines -- let's say RS-A, RS-H, and RS-I.  This has some non-optimal
> balancing ramifications upon machine failure -- the work of RS-A would be
> split between to RS-H and RS-I.
>

I don't think we want this. This implies that we are creating region
assignment groups ( group-based
assignment as described in the doc). The problem is that in case of a
crash, you cannot evenly
distribute out the regions from the primary otherwise you will still end up
tailing all the logs for
all the region servers. Plus if you want to load balance, it will be even
harder to satisfy the constraints while
keeping the balance.

In your example, if you have replication=2 for example, we cannot simply
move all the primary regions
of RS-A to RS-H, which will then suddenly have twice the number of regions.

>
> A more complex solution for both would be to choose machines for the
> purpose they are best suited for.  Read replicas are hosted on their
> respective machines, and shadow region memstores on the hlog's rs's.
>  Promotion becomes a more complicated dance where upon RS-A's failure, we
> have the log tailing shadow region catchup and perform a flush of the
> affected memstores to the appropriate hfds affinity group/favored nodes.
>  So the shadow memstore for region X would flush the hfile to A,B,C and
> region Y to A,D,E.  Then the read replicas would be promoted (close
> secondary, open as primary) based on where the regions/hfiles's affinity
> group.  This feels likes an optimization done one the 2rd or 3rd rev.
>

I think we do not want to differentiate between RS's by splitting them
between primaries and shadows.
This will complicate provisioning, administration, monitoring and load
balancing a lot, and will not achieve
very cheap secondary region promotions (because you have to move the region
still as you described).

>
> Jon
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>