You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Jonathan Hsieh <jo...@cloudera.com> on 2013/12/03 06:54:54 UTC

[Shadow Regions / Read Replicas ]

HBASE-10070 [1]  looks to be heading into a discussion more apt for the
mailing list than in the jira. Moving this to the dev list for threaded
discussion.  I'll start a few threads by replying to this thread with
edited titles

[1] https://issues.apache.org/jira/browse/HBASE-10070

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

On Wed, Dec 4, 2013 at 12:25 PM, Jimmy Xiang <jx...@cloudera.com> wrote:

> I am concerned about reading stale data. I understand some people may want
> this feature. One of the reason is about the region availability. If we
> make sure those regions are always available, we don't have to compromise,
> right?  How about we support something like region pipeline? For each
> important region, we assign it to two/three region servers and make sure
> all writes are on all three region instances, and just one of them persists
> data to hlog, or each region instance has its own local hlog (on local fs,
> not hdfs). Is this too complex to consider, or write overhead is too high?
>

It is not that simple. In a pipeline model, you can only do reads from the
primary
since only that node knows about what is committed and what is not. hdfs
pipelines
works when reading from other replicas even when the pipeline is still
open, because the
data is immutable. The length of the block is made when the block replica
ACK's it. In hdfs' case
the pipeline is like a append only WAL with length as the transaction id.

In a pipelined sync replication style (like ZAB or RAFT) you still have to
read from the primary
for doing consistent reads, because the followers do not learn about the
commits until after leader commits them
and sends the commit message.

I think having paxos-style quorum reads might decide what is committed and
what is not, and can
provide strong consistency but I am still not sure on the exact details of
a practical system.


>
> On Tue, Dec 3, 2013 at 10:20 PM, Devaraj Das <dd...@hortonworks.com> wrote:
>
> > On Tue, Dec 3, 2013 at 6:47 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> >
> > > On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <en...@gmail.com>
> > wrote:
> > >
> > > > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:>
> > > >  >
> > > > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> > > wrote:
> > > > >
> > > > > > Thanks Jon for bringing this to dev@.
> > > > > >
> > > > > >
> > > > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <
> jon@cloudera.com>
> > > > > wrote:
> > > > > >
> > > > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> > > instead
> > > > of
> > > > > > > tackling a feature that other systems architecturally can do
> > better
> > > > > > > (inconsistent reads).   I consider consistent reads/writes
> being
> > > one
> > > > of
> > > > > > > HBase's defining features. That said, I think read replicas
> makes
> > > > sense
> > > > > > and
> > > > > > > is a nice feature to have.
> > > > > > >
> > > > > >
> > > > > > Our design proposal has a specific use case goal, and hopefully
> we
> > > can
> > > > > > demonstrate the
> > > > > > benefits of having this in HBase so that even more pieces can be
> > > built
> > > > on
> > > > > > top of this. Plus I imagine this will
> > > > > > be a widely used feature for read-only tables or bulk loaded
> > tables.
> > > We
> > > > > are
> > > > > > not
> > > > > > proposing of reworking strong consistency semantics or major
> > > > > architectural
> > > > > > changes. I think by
> > > > > > having the tables to be defined with replication count, and the
> > > > proposed
> > > > > > client API changes (Consistency definition)
> > > > > > plugs well into the HBase model rather well.
> > > > > >
> > > > > >
> > > > > I do agree think that without any recent updating mechanism, we are
> > > > > limiting this usefulness of this feature to essentially *only* the
> > > > > read-only or bulk load only tables.  Recency if there were any
> > > > > edits/updates would be severely lagging (by default potentially an
> > > hour)
> > > > > especially in cases where there are only a few edits to a primarily
> > > bulk
> > > > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > > > requirements (or a non-requirements section) definitely should be
> > > listed
> > > > > there.
> > > > >
> > > >
> > > > Obviously the amount of lag you would observe depends on whether you
> > are
> > > > using
> > > > "Region snapshots", "WAL-Tailing" or "Async wal replication". I think
> > > there
> > > > are still
> > > > use cases where you can live with >1 hour old stale reads, so that
> > > "Region
> > > > snapshots"
> > > > is not *just* for read-only tables. I'll add these to the tradeoff's
> > > > section.
> > > >
> > >
> > > Thanks for adding it there -- I really think it is a big headline
> caveat
> > on
> > > my expectation of "eventual consistency".  Other systems out there that
> > > give you eventually consistency on the millisecond level for most
> cases,
> > > while this initial implementation would has eventual mean 10's of
> minutes
> > > or even handfuls of minutes behind (with the snapshots flush
> mechanism)!
> > >
> > >
> > But that's just how the implementation is broken up currently. When WAL
> > tailing is implemented, we will be close, maybe, in the order of seconds
> > behind.
> >
> >
> > > There are a handful of other things in the phase one part of the
> > > implementation section that limit the usefulness of the feature to a
> > > certain kind of constrained hbase user.  I'll start another thread for
> > > those.
> > >
> > >
> > Cool. The one thing I just realized is that we might have some additional
> > work to handle security issues for the shadow regions.
> >
> >
> > >
> > > >
> > > > We are proposing to implement "Region snapshots" first and "Async wal
> > > > replication" second.
> > > > As argued, I think wal-tailing only makes sense with WALpr so, that
> > work
> > > is
> > > > left until after we have WAL
> > > > per region.
> > > >
> > > >
> > > This is our main disagreement -- I'm not convinced that wal tailing
> only
> > > making sense for the wal per region hlog implementation.  Instead of
> > > bouncing around hypotheticals, it sounds like I'll be doing more
> > > experiments to prove it to myself and to convince you. :)
> > >
> > >
> > >
> > Thanks :-) Async WAL replication approach outlined in the doc does not
> > require WALpr and also has the advantage that the source itself can
> direct
> > the edits to specific other regionservers hosting the replicas in
> question.
> >
> >
> > > >
> > > > >
> > > > > With the current design it might be best to have a flag on the
> table
> > > > which
> > > > > marks it read-only or bulk-load only so that it only gets used by
> > users
> > > > > when the table is in that mode?  (and maybe an "escape hatch" for
> > power
> > > > > users).
> > > > >
> > > >
> > > > I think we have a read-only flag already. We might not have bulk-load
> > > only
> > > > flag though. Makes sense to add it
> > > > if we want to restrict allowing bulk loads but preventing writes.
> > > >
> > > > Great.
> > >
> > > >
> > > > >
> > > > > [snip]
> > > > > >
> > > > > > - I think the two goals are both worthy on their own each with
> > their
> > > > own
> > > > > > > optimal points.  We should in the design makes sure that we can
> > > > support
> > > > > > > both goals.
> > > > > > >
> > > > > >
> > > > > > I think our proposal is consistent with your doc, and we have
> > > > considered
> > > > > > secondary region promotion
> > > > > > in the future section. It would be good if you can review and
> > comment
> > > > on
> > > > > > whether you see any points
> > > > > > missing.
> > > > > >
> > > > > >
> > > > > > I definitely will. At the moment, I think the hybrid for the
> > > > wals/hlogs I
> > > > > suggested in the other thread seems to be an optimal solution
> > > considering
> > > > > locality.  Though feasible is obviously more complex than just one
> > > > approach
> > > > > alone.
> > > > >
> > > > >
> > > > > > > - I want to making sure the proposed design have a path for
> > optimal
> > > > > > > fast-consistent read-recovery.
> > > > > > >
> > > > > >
> > > > > > We think that it is, but it is a secondary goal for the initial
> > > work. I
> > > > > > don't see any reason why secondary
> > > > > > promotion cannot be build on top of this, once the branch is in a
> > > > better
> > > > > > state.
> > > > > >
> > > > >
> > > > > Based on the detail in the design doc and this statement it sounds
> > like
> > > > you
> > > > > have a prototype branch already?  Is this the case?
> > > > >
> > > >
> > > > Indeed. I think that is mentioned in the jira description. We have
> some
> > > > parts of the
> > > > changes for region, region server, HRI, and master. Client changes
> are
> > on
> > > > the way.
> > > > I think we can post that in a github branch for now to share the code
> > > early
> > > > and solicit
> > > > early reviews.
> > > >
> > > > I think that would be great.  Back when we did snapshots, we had
> active
> > > development against a prototype and spent a bit of time breaking it
> down
> > > into manageable more polished pieces that had slightly lenient reviews.
> > >  This exercise really helped us with our interfaces.  We committed code
> > to
> > > the dev branch which limited merge pains and diff for modifications
> made
> > by
> > > different contributors.  In the end when we had something we were happy
> > > with on the dev branch we merged with trunk and fixed bugs/diffs that
> > > cropped up in the mean time.  I'd suggest a similar process for this.
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
> > --
> > CONFIDENTIALITY NOTICE
> > NOTICE: This message is intended for the use of the individual or entity
> to
> > which it is addressed and may contain information that is confidential,
> > privileged and exempt from disclosure under applicable law. If the reader
> > of this message is not the intended recipient, you are hereby notified
> that
> > any printing, copying, dissemination, distribution, disclosure or
> > forwarding of this communication is strictly prohibited. If you have
> > received this communication in error, please contact the sender
> immediately
> > and delete it from your system. Thank You.
> >
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jimmy Xiang <jx...@cloudera.com>.

I am concerned about reading stale data. I understand some people may want
this feature. One of the reason is about the region availability. If we
make sure those regions are always available, we don't have to compromise,
right?  How about we support something like region pipeline? For each
important region, we assign it to two/three region servers and make sure
all writes are on all three region instances, and just one of them persists
data to hlog, or each region instance has its own local hlog (on local fs,
not hdfs). Is this too complex to consider, or write overhead is too high?


On Tue, Dec 3, 2013 at 10:20 PM, Devaraj Das <dd...@hortonworks.com> wrote:

> On Tue, Dec 3, 2013 at 6:47 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <en...@gmail.com>
> wrote:
> >
> > > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:>
> > >  >
> > > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> > wrote:
> > > >
> > > > > Thanks Jon for bringing this to dev@.
> > > > >
> > > > >
> > > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > > wrote:
> > > > >
> > > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> > instead
> > > of
> > > > > > tackling a feature that other systems architecturally can do
> better
> > > > > > (inconsistent reads).   I consider consistent reads/writes being
> > one
> > > of
> > > > > > HBase's defining features. That said, I think read replicas makes
> > > sense
> > > > > and
> > > > > > is a nice feature to have.
> > > > > >
> > > > >
> > > > > Our design proposal has a specific use case goal, and hopefully we
> > can
> > > > > demonstrate the
> > > > > benefits of having this in HBase so that even more pieces can be
> > built
> > > on
> > > > > top of this. Plus I imagine this will
> > > > > be a widely used feature for read-only tables or bulk loaded
> tables.
> > We
> > > > are
> > > > > not
> > > > > proposing of reworking strong consistency semantics or major
> > > > architectural
> > > > > changes. I think by
> > > > > having the tables to be defined with replication count, and the
> > > proposed
> > > > > client API changes (Consistency definition)
> > > > > plugs well into the HBase model rather well.
> > > > >
> > > > >
> > > > I do agree think that without any recent updating mechanism, we are
> > > > limiting this usefulness of this feature to essentially *only* the
> > > > read-only or bulk load only tables.  Recency if there were any
> > > > edits/updates would be severely lagging (by default potentially an
> > hour)
> > > > especially in cases where there are only a few edits to a primarily
> > bulk
> > > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > > requirements (or a non-requirements section) definitely should be
> > listed
> > > > there.
> > > >
> > >
> > > Obviously the amount of lag you would observe depends on whether you
> are
> > > using
> > > "Region snapshots", "WAL-Tailing" or "Async wal replication". I think
> > there
> > > are still
> > > use cases where you can live with >1 hour old stale reads, so that
> > "Region
> > > snapshots"
> > > is not *just* for read-only tables. I'll add these to the tradeoff's
> > > section.
> > >
> >
> > Thanks for adding it there -- I really think it is a big headline caveat
> on
> > my expectation of "eventual consistency".  Other systems out there that
> > give you eventually consistency on the millisecond level for most cases,
> > while this initial implementation would has eventual mean 10's of minutes
> > or even handfuls of minutes behind (with the snapshots flush mechanism)!
> >
> >
> But that's just how the implementation is broken up currently. When WAL
> tailing is implemented, we will be close, maybe, in the order of seconds
> behind.
>
>
> > There are a handful of other things in the phase one part of the
> > implementation section that limit the usefulness of the feature to a
> > certain kind of constrained hbase user.  I'll start another thread for
> > those.
> >
> >
> Cool. The one thing I just realized is that we might have some additional
> work to handle security issues for the shadow regions.
>
>
> >
> > >
> > > We are proposing to implement "Region snapshots" first and "Async wal
> > > replication" second.
> > > As argued, I think wal-tailing only makes sense with WALpr so, that
> work
> > is
> > > left until after we have WAL
> > > per region.
> > >
> > >
> > This is our main disagreement -- I'm not convinced that wal tailing only
> > making sense for the wal per region hlog implementation.  Instead of
> > bouncing around hypotheticals, it sounds like I'll be doing more
> > experiments to prove it to myself and to convince you. :)
> >
> >
> >
> Thanks :-) Async WAL replication approach outlined in the doc does not
> require WALpr and also has the advantage that the source itself can direct
> the edits to specific other regionservers hosting the replicas in question.
>
>
> > >
> > > >
> > > > With the current design it might be best to have a flag on the table
> > > which
> > > > marks it read-only or bulk-load only so that it only gets used by
> users
> > > > when the table is in that mode?  (and maybe an "escape hatch" for
> power
> > > > users).
> > > >
> > >
> > > I think we have a read-only flag already. We might not have bulk-load
> > only
> > > flag though. Makes sense to add it
> > > if we want to restrict allowing bulk loads but preventing writes.
> > >
> > > Great.
> >
> > >
> > > >
> > > > [snip]
> > > > >
> > > > > - I think the two goals are both worthy on their own each with
> their
> > > own
> > > > > > optimal points.  We should in the design makes sure that we can
> > > support
> > > > > > both goals.
> > > > > >
> > > > >
> > > > > I think our proposal is consistent with your doc, and we have
> > > considered
> > > > > secondary region promotion
> > > > > in the future section. It would be good if you can review and
> comment
> > > on
> > > > > whether you see any points
> > > > > missing.
> > > > >
> > > > >
> > > > > I definitely will. At the moment, I think the hybrid for the
> > > wals/hlogs I
> > > > suggested in the other thread seems to be an optimal solution
> > considering
> > > > locality.  Though feasible is obviously more complex than just one
> > > approach
> > > > alone.
> > > >
> > > >
> > > > > > - I want to making sure the proposed design have a path for
> optimal
> > > > > > fast-consistent read-recovery.
> > > > > >
> > > > >
> > > > > We think that it is, but it is a secondary goal for the initial
> > work. I
> > > > > don't see any reason why secondary
> > > > > promotion cannot be build on top of this, once the branch is in a
> > > better
> > > > > state.
> > > > >
> > > >
> > > > Based on the detail in the design doc and this statement it sounds
> like
> > > you
> > > > have a prototype branch already?  Is this the case?
> > > >
> > >
> > > Indeed. I think that is mentioned in the jira description. We have some
> > > parts of the
> > > changes for region, region server, HRI, and master. Client changes are
> on
> > > the way.
> > > I think we can post that in a github branch for now to share the code
> > early
> > > and solicit
> > > early reviews.
> > >
> > > I think that would be great.  Back when we did snapshots, we had active
> > development against a prototype and spent a bit of time breaking it down
> > into manageable more polished pieces that had slightly lenient reviews.
> >  This exercise really helped us with our interfaces.  We committed code
> to
> > the dev branch which limited merge pains and diff for modifications made
> by
> > different contributors.  In the end when we had something we were happy
> > with on the dev branch we merged with trunk and fixed bugs/diffs that
> > cropped up in the mean time.  I'd suggest a similar process for this.
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: [Shadow Regions / Read Replicas ]

Posted by Devaraj Das <dd...@hortonworks.com>.

On Tue, Dec 3, 2013 at 6:47 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <en...@gmail.com> wrote:
>
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:>
> >  >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> >
> > Obviously the amount of lag you would observe depends on whether you are
> > using
> > "Region snapshots", "WAL-Tailing" or "Async wal replication". I think
> there
> > are still
> > use cases where you can live with >1 hour old stale reads, so that
> "Region
> > snapshots"
> > is not *just* for read-only tables. I'll add these to the tradeoff's
> > section.
> >
>
> Thanks for adding it there -- I really think it is a big headline caveat on
> my expectation of "eventual consistency".  Other systems out there that
> give you eventually consistency on the millisecond level for most cases,
> while this initial implementation would has eventual mean 10's of minutes
> or even handfuls of minutes behind (with the snapshots flush mechanism)!
>
>
But that's just how the implementation is broken up currently. When WAL
tailing is implemented, we will be close, maybe, in the order of seconds
behind.


> There are a handful of other things in the phase one part of the
> implementation section that limit the usefulness of the feature to a
> certain kind of constrained hbase user.  I'll start another thread for
> those.
>
>
Cool. The one thing I just realized is that we might have some additional
work to handle security issues for the shadow regions.


>
> >
> > We are proposing to implement "Region snapshots" first and "Async wal
> > replication" second.
> > As argued, I think wal-tailing only makes sense with WALpr so, that work
> is
> > left until after we have WAL
> > per region.
> >
> >
> This is our main disagreement -- I'm not convinced that wal tailing only
> making sense for the wal per region hlog implementation.  Instead of
> bouncing around hypotheticals, it sounds like I'll be doing more
> experiments to prove it to myself and to convince you. :)
>
>
>
Thanks :-) Async WAL replication approach outlined in the doc does not
require WALpr and also has the advantage that the source itself can direct
the edits to specific other regionservers hosting the replicas in question.


> >
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> >
> > I think we have a read-only flag already. We might not have bulk-load
> only
> > flag though. Makes sense to add it
> > if we want to restrict allowing bulk loads but preventing writes.
> >
> > Great.
>
> >
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> >
> > Indeed. I think that is mentioned in the jira description. We have some
> > parts of the
> > changes for region, region server, HRI, and master. Client changes are on
> > the way.
> > I think we can post that in a github branch for now to share the code
> early
> > and solicit
> > early reviews.
> >
> > I think that would be great.  Back when we did snapshots, we had active
> development against a prototype and spent a bit of time breaking it down
> into manageable more polished pieces that had slightly lenient reviews.
>  This exercise really helped us with our interfaces.  We committed code to
> the dev branch which limited merge pains and diff for modifications made by
> different contributors.  In the end when we had something we were happy
> with on the dev branch we merged with trunk and fixed bugs/diffs that
> cropped up in the mean time.  I'd suggest a similar process for this.
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: [Shadow Regions / Read Replicas ]

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Wed, Dec 4, 2013 at 3:56 PM, Stack <st...@duboce.net> wrote:

> On Thu, Dec 5, 2013 at 7:46 AM, Enis Söztutar <en...@gmail.com> wrote:
>
> > I did not know that we were reopening the log file for tailing. From what
> > Nicolas talks about in
> > https://issues.apache.org/jira/browse/HDFS-3219 it seems that the
> visible
> > length is not
> > updated for the open stream which is a shame. However in the append
> design,
> > the primary can
> > send the committed length ( minimum of RAs) to replicas, so that replicas
> > can make that data visible
> > to the client.
> >
> >
> Nit: We could but then we'd have two 'tailing' workarounds in hbase; the
> file reopen when we run of the end used by replication and then the
> suggestion made here.
>
>
> I've written a tailing experiment and I can say with the newish and poorly
named FS.isFileClosed() call, we can get correctness. If and when better
tailing facilities are added hdfs we can take advanage of those apis
instead of the non-optimal version of today.

We'd have one version of this if we proceed with the work.

> It would be good if we can implement this in hdfs.
> >
> >
> Yes.  This would be better all around.
>
>
>
> > About the minimum work agreed that we should not merge this in unless
> there
> > are real benefits
> > demonstrated. That is why we proposed to do the work for phase 1 in a
> > branch, and at the end of
> > that, we are hoping we can have something useful and working (but without
> > wal tailing and async
> > wal replication), and we will have a more detailed plan for the remaining
> > steps. We would love to
> > hear more feedback of how to test / stabilize the feature at the merge
> > discussions.
> >
>
> Sounds good E.
>
> St.Ack
>



-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ]

Posted by Stack <st...@duboce.net>.

On Thu, Dec 5, 2013 at 7:46 AM, Enis Söztutar <en...@gmail.com> wrote:

> I did not know that we were reopening the log file for tailing. From what
> Nicolas talks about in
> https://issues.apache.org/jira/browse/HDFS-3219 it seems that the visible
> length is not
> updated for the open stream which is a shame. However in the append design,
> the primary can
> send the committed length ( minimum of RAs) to replicas, so that replicas
> can make that data visible
> to the client.
>
>
Nit: We could but then we'd have two 'tailing' workarounds in hbase; the
file reopen when we run of the end used by replication and then the
suggestion made here.


> It would be good if we can implement this in hdfs.
>
>
Yes.  This would be better all around.



> About the minimum work agreed that we should not merge this in unless there
> are real benefits
> demonstrated. That is why we proposed to do the work for phase 1 in a
> branch, and at the end of
> that, we are hoping we can have something useful and working (but without
> wal tailing and async
> wal replication), and we will have a more detailed plan for the remaining
> steps. We would love to
> hear more feedback of how to test / stabilize the feature at the merge
> discussions.
>

Sounds good E.

St.Ack

Re: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

I did not know that we were reopening the log file for tailing. From what
Nicolas talks about in
https://issues.apache.org/jira/browse/HDFS-3219 it seems that the visible
length is not
updated for the open stream which is a shame. However in the append design,
the primary can
send the committed length ( minimum of RAs) to replicas, so that replicas
can make that data visible
to the client.

It would be good if we can implement this in hdfs.

About the minimum work agreed that we should not merge this in unless there
are real benefits
demonstrated. That is why we proposed to do the work for phase 1 in a
branch, and at the end of
that, we are hoping we can have something useful and working (but without
wal tailing and async
wal replication), and we will have a more detailed plan for the remaining
steps. We would love to
hear more feedback of how to test / stabilize the feature at the merge
discussions.

Enis



On Wed, Dec 4, 2013 at 2:47 PM, Stack <st...@duboce.net> wrote:

> A few comments after reading through this thread:
>
> + Thanks for moving the (good) discussion here out of the issue.
> + Testing WAL 'tailing'* would be a good input to have.  My sense is that a
> WALpr would make for about the same load on HDFS (and if so, lets just go
> there altogether).
> + I like the notion of doing the minimum work necessary first BUT as has
> been said above, we can't add a 'feature' that only works for one exotic
> use case only; it will just rot.  Any 'customer' of said addition likely
> does not want to be in a position where they are the only ones using any
> such new addition.
> + I like the list Vladimir makes above.  We need to work on his list too
> but it should be aside from this one.
>
> Thanks,
> St.Ack
>
> * HDFS does not support 'tailing'.  Rather it is a heavyweight reopen of
> the file each time we run off the end of the data.  Doing this for
> replication and then per region replica would impose 'heavy' HDFS loading
> (to be measured)
>
>
> On Thu, Dec 5, 2013 at 6:00 AM, Enis Söztutar <en...@gmail.com> wrote:
>
> > >
> > >
> > > Thanks for adding it there -- I really think it is a big headline
> caveat
> > on
> > > my expectation of "eventual consistency".  Other systems out there that
> > > give you eventually consistency on the millisecond level for most
> cases,
> > > while this initial implementation would has eventual mean 10's of
> minutes
> > > or even handfuls of minutes behind (with the snapshots flush
> mechanism)!
> >
> >
> > > There are a handful of other things in the phase one part of the
> > > implementation section that limit the usefulness of the feature to a
> > > certain kind of constrained hbase user.  I'll start another thread for
> > > those.
> > >
> > >
> > Yes, hopefully we will not stop with only phase 1, and continue to
> > implement
> > the more-latent async wal replication and/or wal tailing. However phase 1
> > will get us
> > to the point of demonstrating that replicated regions works, the client
> > side of execution
> > is manageable, and there is real benefit for read-only or bulk loaded
> > tables plus some
> > specific use cases for read/write tables.
> >
> >
> > >
> > > >
> > > > We are proposing to implement "Region snapshots" first and "Async wal
> > > > replication" second.
> > > > As argued, I think wal-tailing only makes sense with WALpr so, that
> > work
> > > is
> > > > left until after we have WAL
> > > > per region.
> > > >
> > > >
> > > This is our main disagreement -- I'm not convinced that wal tailing
> only
> > > making sense for the wal per region hlog implementation.  Instead of
> > > bouncing around hypotheticals, it sounds like I'll be doing more
> > > experiments to prove it to myself and to convince you. :)
> > >
> >
> > That would be awesome! Region grouping or other related proposals for
> > efficient wal tailing
> > deserves it's own design doc(s).
> >
> >
> > > >
> > > > I think that would be great.  Back when we did snapshots, we had
> active
> > > development against a prototype and spent a bit of time breaking it
> down
> > > into manageable more polished pieces that had slightly lenient reviews.
> > >  This exercise really helped us with our interfaces.  We committed code
> > to
> > > the dev branch which limited merge pains and diff for modifications
> made
> > by
> > > different contributors.  In the end when we had something we were happy
> > > with on the dev branch we merged with trunk and fixed bugs/diffs that
> > > cropped up in the mean time.  I'd suggest a similar process for this.
> > >
> >
> > Agreed. We can make use of the previous best practices. Shame that we
> still
> > do not have read-write git repo.
> >
> >
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>

Re: [Shadow Regions / Read Replicas ]

Posted by Stack <st...@duboce.net>.

A few comments after reading through this thread:

+ Thanks for moving the (good) discussion here out of the issue.
+ Testing WAL 'tailing'* would be a good input to have.  My sense is that a
WALpr would make for about the same load on HDFS (and if so, lets just go
there altogether).
+ I like the notion of doing the minimum work necessary first BUT as has
been said above, we can't add a 'feature' that only works for one exotic
use case only; it will just rot.  Any 'customer' of said addition likely
does not want to be in a position where they are the only ones using any
such new addition.
+ I like the list Vladimir makes above.  We need to work on his list too
but it should be aside from this one.

Thanks,
St.Ack

* HDFS does not support 'tailing'.  Rather it is a heavyweight reopen of
the file each time we run off the end of the data.  Doing this for
replication and then per region replica would impose 'heavy' HDFS loading
(to be measured)


On Thu, Dec 5, 2013 at 6:00 AM, Enis Söztutar <en...@gmail.com> wrote:

> >
> >
> > Thanks for adding it there -- I really think it is a big headline caveat
> on
> > my expectation of "eventual consistency".  Other systems out there that
> > give you eventually consistency on the millisecond level for most cases,
> > while this initial implementation would has eventual mean 10's of minutes
> > or even handfuls of minutes behind (with the snapshots flush mechanism)!
>
>
> > There are a handful of other things in the phase one part of the
> > implementation section that limit the usefulness of the feature to a
> > certain kind of constrained hbase user.  I'll start another thread for
> > those.
> >
> >
> Yes, hopefully we will not stop with only phase 1, and continue to
> implement
> the more-latent async wal replication and/or wal tailing. However phase 1
> will get us
> to the point of demonstrating that replicated regions works, the client
> side of execution
> is manageable, and there is real benefit for read-only or bulk loaded
> tables plus some
> specific use cases for read/write tables.
>
>
> >
> > >
> > > We are proposing to implement "Region snapshots" first and "Async wal
> > > replication" second.
> > > As argued, I think wal-tailing only makes sense with WALpr so, that
> work
> > is
> > > left until after we have WAL
> > > per region.
> > >
> > >
> > This is our main disagreement -- I'm not convinced that wal tailing only
> > making sense for the wal per region hlog implementation.  Instead of
> > bouncing around hypotheticals, it sounds like I'll be doing more
> > experiments to prove it to myself and to convince you. :)
> >
>
> That would be awesome! Region grouping or other related proposals for
> efficient wal tailing
> deserves it's own design doc(s).
>
>
> > >
> > > I think that would be great.  Back when we did snapshots, we had active
> > development against a prototype and spent a bit of time breaking it down
> > into manageable more polished pieces that had slightly lenient reviews.
> >  This exercise really helped us with our interfaces.  We committed code
> to
> > the dev branch which limited merge pains and diff for modifications made
> by
> > different contributors.  In the end when we had something we were happy
> > with on the dev branch we merged with trunk and fixed bugs/diffs that
> > cropped up in the mean time.  I'd suggest a similar process for this.
> >
>
> Agreed. We can make use of the previous best practices. Shame that we still
> do not have read-write git repo.
>
>
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

Re: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

>
>
> Thanks for adding it there -- I really think it is a big headline caveat on
> my expectation of "eventual consistency".  Other systems out there that
> give you eventually consistency on the millisecond level for most cases,
> while this initial implementation would has eventual mean 10's of minutes
> or even handfuls of minutes behind (with the snapshots flush mechanism)!


> There are a handful of other things in the phase one part of the
> implementation section that limit the usefulness of the feature to a
> certain kind of constrained hbase user.  I'll start another thread for
> those.
>
>
Yes, hopefully we will not stop with only phase 1, and continue to
implement
the more-latent async wal replication and/or wal tailing. However phase 1
will get us
to the point of demonstrating that replicated regions works, the client
side of execution
is manageable, and there is real benefit for read-only or bulk loaded
tables plus some
specific use cases for read/write tables.


>
> >
> > We are proposing to implement "Region snapshots" first and "Async wal
> > replication" second.
> > As argued, I think wal-tailing only makes sense with WALpr so, that work
> is
> > left until after we have WAL
> > per region.
> >
> >
> This is our main disagreement -- I'm not convinced that wal tailing only
> making sense for the wal per region hlog implementation.  Instead of
> bouncing around hypotheticals, it sounds like I'll be doing more
> experiments to prove it to myself and to convince you. :)
>

That would be awesome! Region grouping or other related proposals for
efficient wal tailing
deserves it's own design doc(s).


> >
> > I think that would be great.  Back when we did snapshots, we had active
> development against a prototype and spent a bit of time breaking it down
> into manageable more polished pieces that had slightly lenient reviews.
>  This exercise really helped us with our interfaces.  We committed code to
> the dev branch which limited merge pains and diff for modifications made by
> different contributors.  In the end when we had something we were happy
> with on the dev branch we merged with trunk and fixed bugs/diffs that
> cropped up in the mean time.  I'd suggest a similar process for this.
>

Agreed. We can make use of the previous best practices. Shame that we still
do not have read-write git repo.


>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jimmy Xiang <jx...@cloudera.com>.

We don't have to ship the edits one by one.  We can use a configurable
batch to control the impact on network.


On Tue, Dec 3, 2013 at 7:59 PM, Jimmy Xiang <jx...@cloudera.com> wrote:

> A separate branch similar to that for snapshot is great. +1.
>
> For wal tailing, we can just skip those edits not for the shadow regions,
> right?
>
> To tail the wal, we need to wait till the wal block is available. There
> seems to be a hard latency.  Is it better to have a pool of daemon threads
> to ship corresponding wal edits directly?  By this way, we have a better
> control on what edits to ship around. The shadow region will be much closer
> to the primary region.  We don't need a queue for those edits not shipped
> yet.  We can just use the memstore as the queue.  Once the memstore is
> flushed, its content is no need to ship around.
>
>
>
> On Tue, Dec 3, 2013 at 6:47 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
>> On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <en...@gmail.com> wrote:
>>
>> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
>> wrote:>
>> >  >
>> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
>> wrote:
>> > >
>> > > > Thanks Jon for bringing this to dev@.
>> > > >
>> > > >
>> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
>> > > wrote:
>> > > >
>> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
>> instead
>> > of
>> > > > > tackling a feature that other systems architecturally can do
>> better
>> > > > > (inconsistent reads).   I consider consistent reads/writes being
>> one
>> > of
>> > > > > HBase's defining features. That said, I think read replicas makes
>> > sense
>> > > > and
>> > > > > is a nice feature to have.
>> > > > >
>> > > >
>> > > > Our design proposal has a specific use case goal, and hopefully we
>> can
>> > > > demonstrate the
>> > > > benefits of having this in HBase so that even more pieces can be
>> built
>> > on
>> > > > top of this. Plus I imagine this will
>> > > > be a widely used feature for read-only tables or bulk loaded
>> tables. We
>> > > are
>> > > > not
>> > > > proposing of reworking strong consistency semantics or major
>> > > architectural
>> > > > changes. I think by
>> > > > having the tables to be defined with replication count, and the
>> > proposed
>> > > > client API changes (Consistency definition)
>> > > > plugs well into the HBase model rather well.
>> > > >
>> > > >
>> > > I do agree think that without any recent updating mechanism, we are
>> > > limiting this usefulness of this feature to essentially *only* the
>> > > read-only or bulk load only tables.  Recency if there were any
>> > > edits/updates would be severely lagging (by default potentially an
>> hour)
>> > > especially in cases where there are only a few edits to a primarily
>> bulk
>> > > loaded table.  This limitation is not mentioned in the tradeoffs or
>> > > requirements (or a non-requirements section) definitely should be
>> listed
>> > > there.
>> > >
>> >
>> > Obviously the amount of lag you would observe depends on whether you are
>> > using
>> > "Region snapshots", "WAL-Tailing" or "Async wal replication". I think
>> there
>> > are still
>> > use cases where you can live with >1 hour old stale reads, so that
>> "Region
>> > snapshots"
>> > is not *just* for read-only tables. I'll add these to the tradeoff's
>> > section.
>> >
>>
>> Thanks for adding it there -- I really think it is a big headline caveat
>> on
>> my expectation of "eventual consistency".  Other systems out there that
>> give you eventually consistency on the millisecond level for most cases,
>> while this initial implementation would has eventual mean 10's of minutes
>> or even handfuls of minutes behind (with the snapshots flush mechanism)!
>>
>> There are a handful of other things in the phase one part of the
>> implementation section that limit the usefulness of the feature to a
>> certain kind of constrained hbase user.  I'll start another thread for
>> those.
>>
>>
>> >
>> > We are proposing to implement "Region snapshots" first and "Async wal
>> > replication" second.
>> > As argued, I think wal-tailing only makes sense with WALpr so, that
>> work is
>> > left until after we have WAL
>> > per region.
>> >
>> >
>> This is our main disagreement -- I'm not convinced that wal tailing only
>> making sense for the wal per region hlog implementation.  Instead of
>> bouncing around hypotheticals, it sounds like I'll be doing more
>> experiments to prove it to myself and to convince you. :)
>>
>>
>> >
>> > >
>> > > With the current design it might be best to have a flag on the table
>> > which
>> > > marks it read-only or bulk-load only so that it only gets used by
>> users
>> > > when the table is in that mode?  (and maybe an "escape hatch" for
>> power
>> > > users).
>> > >
>> >
>> > I think we have a read-only flag already. We might not have bulk-load
>> only
>> > flag though. Makes sense to add it
>> > if we want to restrict allowing bulk loads but preventing writes.
>> >
>> > Great.
>>
>> >
>> > >
>> > > [snip]
>> > > >
>> > > > - I think the two goals are both worthy on their own each with their
>> > own
>> > > > > optimal points.  We should in the design makes sure that we can
>> > support
>> > > > > both goals.
>> > > > >
>> > > >
>> > > > I think our proposal is consistent with your doc, and we have
>> > considered
>> > > > secondary region promotion
>> > > > in the future section. It would be good if you can review and
>> comment
>> > on
>> > > > whether you see any points
>> > > > missing.
>> > > >
>> > > >
>> > > > I definitely will. At the moment, I think the hybrid for the
>> > wals/hlogs I
>> > > suggested in the other thread seems to be an optimal solution
>> considering
>> > > locality.  Though feasible is obviously more complex than just one
>> > approach
>> > > alone.
>> > >
>> > >
>> > > > > - I want to making sure the proposed design have a path for
>> optimal
>> > > > > fast-consistent read-recovery.
>> > > > >
>> > > >
>> > > > We think that it is, but it is a secondary goal for the initial
>> work. I
>> > > > don't see any reason why secondary
>> > > > promotion cannot be build on top of this, once the branch is in a
>> > better
>> > > > state.
>> > > >
>> > >
>> > > Based on the detail in the design doc and this statement it sounds
>> like
>> > you
>> > > have a prototype branch already?  Is this the case?
>> > >
>> >
>> > Indeed. I think that is mentioned in the jira description. We have some
>> > parts of the
>> > changes for region, region server, HRI, and master. Client changes are
>> on
>> > the way.
>> > I think we can post that in a github branch for now to share the code
>> early
>> > and solicit
>> > early reviews.
>> >
>> > I think that would be great.  Back when we did snapshots, we had active
>> development against a prototype and spent a bit of time breaking it down
>> into manageable more polished pieces that had slightly lenient reviews.
>>  This exercise really helped us with our interfaces.  We committed code to
>> the dev branch which limited merge pains and diff for modifications made
>> by
>> different contributors.  In the end when we had something we were happy
>> with on the dev branch we merged with trunk and fixed bugs/diffs that
>> cropped up in the mean time.  I'd suggest a similar process for this.
>>
>>
>> --
>> // Jonathan Hsieh (shay)
>> // Software Engineer, Cloudera
>> // jon@cloudera.com
>>
>
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jimmy Xiang <jx...@cloudera.com>.

A separate branch similar to that for snapshot is great. +1.

For wal tailing, we can just skip those edits not for the shadow regions,
right?

To tail the wal, we need to wait till the wal block is available. There
seems to be a hard latency.  Is it better to have a pool of daemon threads
to ship corresponding wal edits directly?  By this way, we have a better
control on what edits to ship around. The shadow region will be much closer
to the primary region.  We don't need a queue for those edits not shipped
yet.  We can just use the memstore as the queue.  Once the memstore is
flushed, its content is no need to ship around.



On Tue, Dec 3, 2013 at 6:47 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <en...@gmail.com> wrote:
>
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:>
> >  >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> >
> > Obviously the amount of lag you would observe depends on whether you are
> > using
> > "Region snapshots", "WAL-Tailing" or "Async wal replication". I think
> there
> > are still
> > use cases where you can live with >1 hour old stale reads, so that
> "Region
> > snapshots"
> > is not *just* for read-only tables. I'll add these to the tradeoff's
> > section.
> >
>
> Thanks for adding it there -- I really think it is a big headline caveat on
> my expectation of "eventual consistency".  Other systems out there that
> give you eventually consistency on the millisecond level for most cases,
> while this initial implementation would has eventual mean 10's of minutes
> or even handfuls of minutes behind (with the snapshots flush mechanism)!
>
> There are a handful of other things in the phase one part of the
> implementation section that limit the usefulness of the feature to a
> certain kind of constrained hbase user.  I'll start another thread for
> those.
>
>
> >
> > We are proposing to implement "Region snapshots" first and "Async wal
> > replication" second.
> > As argued, I think wal-tailing only makes sense with WALpr so, that work
> is
> > left until after we have WAL
> > per region.
> >
> >
> This is our main disagreement -- I'm not convinced that wal tailing only
> making sense for the wal per region hlog implementation.  Instead of
> bouncing around hypotheticals, it sounds like I'll be doing more
> experiments to prove it to myself and to convince you. :)
>
>
> >
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> >
> > I think we have a read-only flag already. We might not have bulk-load
> only
> > flag though. Makes sense to add it
> > if we want to restrict allowing bulk loads but preventing writes.
> >
> > Great.
>
> >
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> >
> > Indeed. I think that is mentioned in the jira description. We have some
> > parts of the
> > changes for region, region server, HRI, and master. Client changes are on
> > the way.
> > I think we can post that in a github branch for now to share the code
> early
> > and solicit
> > early reviews.
> >
> > I think that would be great.  Back when we did snapshots, we had active
> development against a prototype and spent a bit of time breaking it down
> into manageable more polished pieces that had slightly lenient reviews.
>  This exercise really helped us with our interfaces.  We committed code to
> the dev branch which limited merge pains and diff for modifications made by
> different contributors.  In the end when we had something we were happy
> with on the dev branch we merged with trunk and fixed bugs/diffs that
> cropped up in the mean time.  I'd suggest a similar process for this.
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Tue, Dec 3, 2013 at 2:04 PM, Enis Söztutar <en...@gmail.com> wrote:

> On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:>
>  >
> > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
> >
> > > Thanks Jon for bringing this to dev@.
> > >
> > >
> > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:
> > >
> > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead
> of
> > > > tackling a feature that other systems architecturally can do better
> > > > (inconsistent reads).   I consider consistent reads/writes being one
> of
> > > > HBase's defining features. That said, I think read replicas makes
> sense
> > > and
> > > > is a nice feature to have.
> > > >
> > >
> > > Our design proposal has a specific use case goal, and hopefully we can
> > > demonstrate the
> > > benefits of having this in HBase so that even more pieces can be built
> on
> > > top of this. Plus I imagine this will
> > > be a widely used feature for read-only tables or bulk loaded tables. We
> > are
> > > not
> > > proposing of reworking strong consistency semantics or major
> > architectural
> > > changes. I think by
> > > having the tables to be defined with replication count, and the
> proposed
> > > client API changes (Consistency definition)
> > > plugs well into the HBase model rather well.
> > >
> > >
> > I do agree think that without any recent updating mechanism, we are
> > limiting this usefulness of this feature to essentially *only* the
> > read-only or bulk load only tables.  Recency if there were any
> > edits/updates would be severely lagging (by default potentially an hour)
> > especially in cases where there are only a few edits to a primarily bulk
> > loaded table.  This limitation is not mentioned in the tradeoffs or
> > requirements (or a non-requirements section) definitely should be listed
> > there.
> >
>
> Obviously the amount of lag you would observe depends on whether you are
> using
> "Region snapshots", "WAL-Tailing" or "Async wal replication". I think there
> are still
> use cases where you can live with >1 hour old stale reads, so that "Region
> snapshots"
> is not *just* for read-only tables. I'll add these to the tradeoff's
> section.
>

Thanks for adding it there -- I really think it is a big headline caveat on
my expectation of "eventual consistency".  Other systems out there that
give you eventually consistency on the millisecond level for most cases,
while this initial implementation would has eventual mean 10's of minutes
or even handfuls of minutes behind (with the snapshots flush mechanism)!

There are a handful of other things in the phase one part of the
implementation section that limit the usefulness of the feature to a
certain kind of constrained hbase user.  I'll start another thread for
those.


>
> We are proposing to implement "Region snapshots" first and "Async wal
> replication" second.
> As argued, I think wal-tailing only makes sense with WALpr so, that work is
> left until after we have WAL
> per region.
>
>
This is our main disagreement -- I'm not convinced that wal tailing only
making sense for the wal per region hlog implementation.  Instead of
bouncing around hypotheticals, it sounds like I'll be doing more
experiments to prove it to myself and to convince you. :)


>
> >
> > With the current design it might be best to have a flag on the table
> which
> > marks it read-only or bulk-load only so that it only gets used by users
> > when the table is in that mode?  (and maybe an "escape hatch" for power
> > users).
> >
>
> I think we have a read-only flag already. We might not have bulk-load only
> flag though. Makes sense to add it
> if we want to restrict allowing bulk loads but preventing writes.
>
> Great.

>
> >
> > [snip]
> > >
> > > - I think the two goals are both worthy on their own each with their
> own
> > > > optimal points.  We should in the design makes sure that we can
> support
> > > > both goals.
> > > >
> > >
> > > I think our proposal is consistent with your doc, and we have
> considered
> > > secondary region promotion
> > > in the future section. It would be good if you can review and comment
> on
> > > whether you see any points
> > > missing.
> > >
> > >
> > > I definitely will. At the moment, I think the hybrid for the
> wals/hlogs I
> > suggested in the other thread seems to be an optimal solution considering
> > locality.  Though feasible is obviously more complex than just one
> approach
> > alone.
> >
> >
> > > > - I want to making sure the proposed design have a path for optimal
> > > > fast-consistent read-recovery.
> > > >
> > >
> > > We think that it is, but it is a secondary goal for the initial work. I
> > > don't see any reason why secondary
> > > promotion cannot be build on top of this, once the branch is in a
> better
> > > state.
> > >
> >
> > Based on the detail in the design doc and this statement it sounds like
> you
> > have a prototype branch already?  Is this the case?
> >
>
> Indeed. I think that is mentioned in the jira description. We have some
> parts of the
> changes for region, region server, HRI, and master. Client changes are on
> the way.
> I think we can post that in a github branch for now to share the code early
> and solicit
> early reviews.
>
> I think that would be great.  Back when we did snapshots, we had active
development against a prototype and spent a bit of time breaking it down
into manageable more polished pieces that had slightly lenient reviews.
 This exercise really helped us with our interfaces.  We committed code to
the dev branch which limited merge pains and diff for modifications made by
different contributors.  In the end when we had something we were happy
with on the dev branch we merged with trunk and fixed bugs/diffs that
cropped up in the mean time.  I'd suggest a similar process for this.


-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> To keep the discussion focused on the design goals, I'm going start
> referring to enis and deveraj's eventually consistent read replicas as the
> *read replica* design, and consistent fast read recovery mechanism based on
> shadowing/tailing the wals as *shadow regions* or *shadow memstores*.  Can
> we agree on nomenclature?
>

Makes sense.


>
>
> On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
>
> > Thanks Jon for bringing this to dev@.
> >
> >
> > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead of
> > > tackling a feature that other systems architecturally can do better
> > > (inconsistent reads).   I consider consistent reads/writes being one of
> > > HBase's defining features. That said, I think read replicas makes sense
> > and
> > > is a nice feature to have.
> > >
> >
> > Our design proposal has a specific use case goal, and hopefully we can
> > demonstrate the
> > benefits of having this in HBase so that even more pieces can be built on
> > top of this. Plus I imagine this will
> > be a widely used feature for read-only tables or bulk loaded tables. We
> are
> > not
> > proposing of reworking strong consistency semantics or major
> architectural
> > changes. I think by
> > having the tables to be defined with replication count, and the proposed
> > client API changes (Consistency definition)
> > plugs well into the HBase model rather well.
> >
> >
> I do agree think that without any recent updating mechanism, we are
> limiting this usefulness of this feature to essentially *only* the
> read-only or bulk load only tables.  Recency if there were any
> edits/updates would be severely lagging (by default potentially an hour)
> especially in cases where there are only a few edits to a primarily bulk
> loaded table.  This limitation is not mentioned in the tradeoffs or
> requirements (or a non-requirements section) definitely should be listed
> there.
>

Obviously the amount of lag you would observe depends on whether you are
using
"Region snapshots", "WAL-Tailing" or "Async wal replication". I think there
are still
use cases where you can live with >1 hour old stale reads, so that "Region
snapshots"
is not *just* for read-only tables. I'll add these to the tradeoff's
section.

We are proposing to implement "Region snapshots" first and "Async wal
replication" second.
As argued, I think wal-tailing only makes sense with WALpr so, that work is
left until after we have WAL
per region.


>
> With the current design it might be best to have a flag on the table which
> marks it read-only or bulk-load only so that it only gets used by users
> when the table is in that mode?  (and maybe an "escape hatch" for power
> users).
>

I think we have a read-only flag already. We might not have bulk-load only
flag though. Makes sense to add it
if we want to restrict allowing bulk loads but preventing writes.


>
> [snip]
> >
> > - I think the two goals are both worthy on their own each with their own
> > > optimal points.  We should in the design makes sure that we can support
> > > both goals.
> > >
> >
> > I think our proposal is consistent with your doc, and we have considered
> > secondary region promotion
> > in the future section. It would be good if you can review and comment on
> > whether you see any points
> > missing.
> >
> >
> > I definitely will. At the moment, I think the hybrid for the wals/hlogs I
> suggested in the other thread seems to be an optimal solution considering
> locality.  Though feasible is obviously more complex than just one approach
> alone.
>
>
> > > - I want to making sure the proposed design have a path for optimal
> > > fast-consistent read-recovery.
> > >
> >
> > We think that it is, but it is a secondary goal for the initial work. I
> > don't see any reason why secondary
> > promotion cannot be build on top of this, once the branch is in a better
> > state.
> >
>
> Based on the detail in the design doc and this statement it sounds like you
> have a prototype branch already?  Is this the case?
>

Indeed. I think that is mentioned in the jira description. We have some
parts of the
changes for region, region server, HRI, and master. Client changes are on
the way.
I think we can post that in a github branch for now to share the code early
and solicit
early reviews.


>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Tue, Dec 3, 2013 at 2:48 PM, Vladimir Rodionov <vl...@gmail.com>wrote:

> >MTTR and this work is ortagonal. In a distributed system, you cannot
> >differentiate between
> >a process not responding because it is down or it is busy or network is
> >down, or whatnot. Having
> >a couple of seconds detection time is unrealistic. You will end up in a
> >very unstable state where
> >you will be failing servers all over the place. An external beacon also
> >cannot differentiate between
> >the main process not responding because it is busy, or it is down. What
> >happens why there is a temporary
> >network partition.
>
> Be pro-active, predict node failure (slow requests recently), detect
> possible router/network issues (syslog on each node), temporal network
> partitions are bad,  but they usually affect multiple servers - not just
> one. Pro-activity means that Master can disable RS before RS will go down.
> But ,you are right - its totally orthogonal to what you are proposing here.


I think this is a separate daemon management system.


>

I am just wondering, if FB claim 99.99% of their HBase availability
> (HBaseCon 2013) may be it is worth borrowing some their ideas? How did they
> achieve this?
>
>
Here's the deck http://www.slideshare.net/cloudera/operations-session-2

here's a quick tl;dr
- focus on rack switch failures
- lower timeouts
- improvements in the regionserver (HBASE-6638 in 0.94.2 / HBase-6508 no in
yet).
- locality based stuff (we have a version ported to 0.96 but it only really
works in constrained hbases like Fb's -- it doesn't work with balancing or
splitting at the moment)
- HDFS read from other replica (not in upstream hdfs yet)

Facebook's master is based of the hbase 0.20/0.89 master which is
significantly different than the hbase master from in 0.94/0.96/trunk
today.

Jon.

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ]

Posted by Vladimir Rodionov <vl...@gmail.com>.

>MTTR and this work is ortagonal. In a distributed system, you cannot
>differentiate between
>a process not responding because it is down or it is busy or network is
>down, or whatnot. Having
>a couple of seconds detection time is unrealistic. You will end up in a
>very unstable state where
>you will be failing servers all over the place. An external beacon also
>cannot differentiate between
>the main process not responding because it is busy, or it is down. What
>happens why there is a temporary
>network partition.

Be pro-active, predict node failure (slow requests recently), detect
possible router/network issues (syslog on each node), temporal network
partitions are bad,  but they usually affect multiple servers - not just
one. Pro-activity means that Master can disable RS before RS will go down.
But ,you are right - its totally orthogonal to what you are proposing here.
I am just wondering, if FB claim 99.99% of their HBase availability
(HBaseCon 2013) may be it is worth borrowing some their ideas? How did they
achieve this?



On Tue, Dec 3, 2013 at 2:18 PM, Enis Söztutar <en...@gmail.com> wrote:

> On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > The downside:
> >
> > - Double/Triple memstore usage
> > - Increased block cache usage (effectively, block cache will have 50%
> > capacity may be less)
>
>
> These are covered at the tradeoff section at the design doc.
>
>
> >
> >
> These downsides are pretty serious ones. This will result:
> >
> > 1. in decreased overall performance due to decreased efficient block
> cache
> > size
> >
>
> You can elect to not fill up the block cache for secondary reads. It will
> be a configuration option, and a
> tradeoff you may or may not want to pay. Details are in the doc.
>
>
> >  2. In more frequent memstore flushes - this will affect compaction and
> > write tput.
> >
>
> More frequent flushes is not needed unless you are using region snapshots
> approach,
> and want to bound the lag better. It is a tradeoff between expected lag vs
> more
> write amplification.
>
>
> >
> > I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> > of 10-20ms unless your RSs go down 2-3 times a day for several minutes
> each
> > time. You have to analyze first why are you having so frequent failures,
> > than fix the root source of the problem. Its possible to reduce
> 'detection'
> > phase in MTTR process to couple seconds either by using external beacon
> > process (as I suggested already) or by rewriting some code inside HBase
> and
> > NameNode to move all data out from Java heap to off-heap and reducing
> > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable.
> The
> > result: you will decrease MTTR by 50% at least w/o sacrificing the
> overall
> > cluster performance.
> >
> > I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> > prevents meeting strict SLA - not occasional server failures.
> >
>
> MTTR and this work is ortagonal. In a distributed system, you cannot
> differentiate between
> a process not responding because it is down or it is busy or network is
> down, or whatnot. Having
> a couple of seconds detection time is unrealistic. You will end up in a
> very unstable state where
> you will be failing servers all over the place. An external beacon also
> cannot differentiate between
> the main process not responding because it is busy, or it is down. What
> happens why there is a temporary
> network partition.
>
>
>
> >
> >
> >
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > To keep the discussion focused on the design goals, I'm going start
> > > referring to enis and deveraj's eventually consistent read replicas as
> > the
> > > *read replica* design, and consistent fast read recovery mechanism
> based
> > on
> > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
> >  Can
> > > we agree on nomenclature?
> > >
> > >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>

答复: 答复: [Shadow Regions / Read Replicas ]

Posted by 谢良 <xi...@xiaomi.com>.

Hi Enis,

Thanks for reply. I have realized that still need a read secondary node ability
to achive lower 99th or 99.9th percentile read latency, e.g. big GC on rs node.
And i have a idea that impletement this ability from hbase-client side, we could
issue a read request to slave cluster, that will make us:
1) warm up slave cluster, so more performance confidence to switch the traffic
to slave cluster if the current master cluster is suffering a breakdown or sth.
2) we could have a very real stress testing result:)
We could impletement several policy on read just similar with your design's.
3) this behiviour could be similar with the traditional RDBMS operation: write to
master, and read from slave or slave+master :)  making the system scaling to read.

The mainly shortcoming of above is it's suitable only if having a replication running.
But i still like it and indeed also have a plan to write sth to do a prototype testing
next week. I still keep the same concern like before: the cost is heavy if you
want to enable the read from the secondary RS in the same cluster, with block cache
warming up always to achive the lower read latency.

Sorry my talking probably is not about the mainly design point(HA for read), but focus
on latency related.

Thanks,
________________________________________
发件人: Enis Söztutar [enis.soz@gmail.com]
发送时间: 2013年12月10日 5:24
收件人: dev@hbase.apache.org
主题: Re: 答复: [Shadow Regions / Read Replicas ]

We are also proposing to implement HBASE-7509 as a part of this major
undertaking. HBASE-7509 will help with HBase in general (even if you are
not using HBASE-10070), and possibly some other hdfs clients.
HBASE-10070 will give you similar benefits to HBASE-7509 if your use case
needs that, but on the hbase layer which will sit on top of HBASE-7509.

Enis


On Sat, Dec 7, 2013 at 5:39 AM, 谢良 <xi...@xiaomi.com> wrote:

> For one advantage of this design(ability to do low latency reads with
> <20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509
> solution, Since if you want to ganrantee similar high performance read
> ability in
> shadow regions, then you must let the shadow rs warmup the related hot
> blocks
> into block cache.(In deed, i have a similar worry with Vladimir).
> I tried to think how this design could beat hbase-7509 on cutting the
> latency tail,
> but no result still.
>
> Enis, could you share your thoughts on it? thanks
>
> Thanks,
>
> ________________________________________
> 发件人: Enis Söztutar [enis.soz@gmail.com]
> 发送时间: 2013年12月4日 6:18
> 收件人: dev@hbase.apache.org
> 主题: Re: [Shadow Regions / Read Replicas ]
>
> On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > The downside:
> >
> > - Double/Triple memstore usage
> > - Increased block cache usage (effectively, block cache will have 50%
> > capacity may be less)
>
>
> These are covered at the tradeoff section at the design doc.
>
>
> >
> >
> These downsides are pretty serious ones. This will result:
> >
> > 1. in decreased overall performance due to decreased efficient block
> cache
> > size
> >
>
> You can elect to not fill up the block cache for secondary reads. It will
> be a configuration option, and a
> tradeoff you may or may not want to pay. Details are in the doc.
>
>
> >  2. In more frequent memstore flushes - this will affect compaction and
> > write tput.
> >
>
> More frequent flushes is not needed unless you are using region snapshots
> approach,
> and want to bound the lag better. It is a tradeoff between expected lag vs
> more
> write amplification.
>
>
> >
> > I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> > of 10-20ms unless your RSs go down 2-3 times a day for several minutes
> each
> > time. You have to analyze first why are you having so frequent failures,
> > than fix the root source of the problem. Its possible to reduce
> 'detection'
> > phase in MTTR process to couple seconds either by using external beacon
> > process (as I suggested already) or by rewriting some code inside HBase
> and
> > NameNode to move all data out from Java heap to off-heap and reducing
> > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable.
> The
> > result: you will decrease MTTR by 50% at least w/o sacrificing the
> overall
> > cluster performance.
> >
> > I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> > prevents meeting strict SLA - not occasional server failures.
> >
>
> MTTR and this work is ortagonal. In a distributed system, you cannot
> differentiate between
> a process not responding because it is down or it is busy or network is
> down, or whatnot. Having
> a couple of seconds detection time is unrealistic. You will end up in a
> very unstable state where
> you will be failing servers all over the place. An external beacon also
> cannot differentiate between
> the main process not responding because it is busy, or it is down. What
> happens why there is a temporary
> network partition.
>
>
>
> >
> >
> >
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > To keep the discussion focused on the design goals, I'm going start
> > > referring to enis and deveraj's eventually consistent read replicas as
> > the
> > > *read replica* design, and consistent fast read recovery mechanism
> based
> > on
> > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
> >  Can
> > > we agree on nomenclature?
> > >
> > >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>

Re: 答复: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

We are also proposing to implement HBASE-7509 as a part of this major
undertaking. HBASE-7509 will help with HBase in general (even if you are
not using HBASE-10070), and possibly some other hdfs clients.
HBASE-10070 will give you similar benefits to HBASE-7509 if your use case
needs that, but on the hbase layer which will sit on top of HBASE-7509.

Enis


On Sat, Dec 7, 2013 at 5:39 AM, 谢良 <xi...@xiaomi.com> wrote:

> For one advantage of this design(ability to do low latency reads with
> <20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509
> solution, Since if you want to ganrantee similar high performance read
> ability in
> shadow regions, then you must let the shadow rs warmup the related hot
> blocks
> into block cache.(In deed, i have a similar worry with Vladimir).
> I tried to think how this design could beat hbase-7509 on cutting the
> latency tail,
> but no result still.
>
> Enis, could you share your thoughts on it? thanks
>
> Thanks,
>
> ________________________________________
> 发件人: Enis Söztutar [enis.soz@gmail.com]
> 发送时间: 2013年12月4日 6:18
> 收件人: dev@hbase.apache.org
> 主题: Re: [Shadow Regions / Read Replicas ]
>
> On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > The downside:
> >
> > - Double/Triple memstore usage
> > - Increased block cache usage (effectively, block cache will have 50%
> > capacity may be less)
>
>
> These are covered at the tradeoff section at the design doc.
>
>
> >
> >
> These downsides are pretty serious ones. This will result:
> >
> > 1. in decreased overall performance due to decreased efficient block
> cache
> > size
> >
>
> You can elect to not fill up the block cache for secondary reads. It will
> be a configuration option, and a
> tradeoff you may or may not want to pay. Details are in the doc.
>
>
> >  2. In more frequent memstore flushes - this will affect compaction and
> > write tput.
> >
>
> More frequent flushes is not needed unless you are using region snapshots
> approach,
> and want to bound the lag better. It is a tradeoff between expected lag vs
> more
> write amplification.
>
>
> >
> > I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> > of 10-20ms unless your RSs go down 2-3 times a day for several minutes
> each
> > time. You have to analyze first why are you having so frequent failures,
> > than fix the root source of the problem. Its possible to reduce
> 'detection'
> > phase in MTTR process to couple seconds either by using external beacon
> > process (as I suggested already) or by rewriting some code inside HBase
> and
> > NameNode to move all data out from Java heap to off-heap and reducing
> > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable.
> The
> > result: you will decrease MTTR by 50% at least w/o sacrificing the
> overall
> > cluster performance.
> >
> > I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> > prevents meeting strict SLA - not occasional server failures.
> >
>
> MTTR and this work is ortagonal. In a distributed system, you cannot
> differentiate between
> a process not responding because it is down or it is busy or network is
> down, or whatnot. Having
> a couple of seconds detection time is unrealistic. You will end up in a
> very unstable state where
> you will be failing servers all over the place. An external beacon also
> cannot differentiate between
> the main process not responding because it is busy, or it is down. What
> happens why there is a temporary
> network partition.
>
>
>
> >
> >
> >
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > To keep the discussion focused on the design goals, I'm going start
> > > referring to enis and deveraj's eventually consistent read replicas as
> > the
> > > *read replica* design, and consistent fast read recovery mechanism
> based
> > on
> > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
> >  Can
> > > we agree on nomenclature?
> > >
> > >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>

答复: [Shadow Regions / Read Replicas ]

Posted by 谢良 <xi...@xiaomi.com>.

For one advantage of this design(ability to do low latency reads with
<20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509
solution, Since if you want to ganrantee similar high performance read ability in
shadow regions, then you must let the shadow rs warmup the related hot blocks
into block cache.(In deed, i have a similar worry with Vladimir).
I tried to think how this design could beat hbase-7509 on cutting the latency tail,
but no result still.

Enis, could you share your thoughts on it? thanks

Thanks,

________________________________________
发件人: Enis Söztutar [enis.soz@gmail.com]
发送时间: 2013年12月4日 6:18
收件人: dev@hbase.apache.org
主题: Re: [Shadow Regions / Read Replicas ]

On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> The downside:
>
> - Double/Triple memstore usage
> - Increased block cache usage (effectively, block cache will have 50%
> capacity may be less)


These are covered at the tradeoff section at the design doc.


>
>
These downsides are pretty serious ones. This will result:
>
> 1. in decreased overall performance due to decreased efficient block cache
> size
>

You can elect to not fill up the block cache for secondary reads. It will
be a configuration option, and a
tradeoff you may or may not want to pay. Details are in the doc.


>  2. In more frequent memstore flushes - this will affect compaction and
> write tput.
>

More frequent flushes is not needed unless you are using region snapshots
approach,
and want to bound the lag better. It is a tradeoff between expected lag vs
more
write amplification.


>
> I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> of 10-20ms unless your RSs go down 2-3 times a day for several minutes each
> time. You have to analyze first why are you having so frequent failures,
> than fix the root source of the problem. Its possible to reduce 'detection'
> phase in MTTR process to couple seconds either by using external beacon
> process (as I suggested already) or by rewriting some code inside HBase and
> NameNode to move all data out from Java heap to off-heap and reducing
> GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. The
> result: you will decrease MTTR by 50% at least w/o sacrificing the overall
> cluster performance.
>
> I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> prevents meeting strict SLA - not occasional server failures.
>

MTTR and this work is ortagonal. In a distributed system, you cannot
differentiate between
a process not responding because it is down or it is busy or network is
down, or whatnot. Having
a couple of seconds detection time is unrealistic. You will end up in a
very unstable state where
you will be failing servers all over the place. An external beacon also
cannot differentiate between
the main process not responding because it is busy, or it is down. What
happens why there is a temporary
network partition.



>
>
>
> On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > To keep the discussion focused on the design goals, I'm going start
> > referring to enis and deveraj's eventually consistent read replicas as
> the
> > *read replica* design, and consistent fast read recovery mechanism based
> on
> > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
>  Can
> > we agree on nomenclature?
> >
> >
> > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
> >
> > > Thanks Jon for bringing this to dev@.
> > >
> > >
> > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:
> > >
> > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead
> of
> > > > tackling a feature that other systems architecturally can do better
> > > > (inconsistent reads).   I consider consistent reads/writes being one
> of
> > > > HBase's defining features. That said, I think read replicas makes
> sense
> > > and
> > > > is a nice feature to have.
> > > >
> > >
> > > Our design proposal has a specific use case goal, and hopefully we can
> > > demonstrate the
> > > benefits of having this in HBase so that even more pieces can be built
> on
> > > top of this. Plus I imagine this will
> > > be a widely used feature for read-only tables or bulk loaded tables. We
> > are
> > > not
> > > proposing of reworking strong consistency semantics or major
> > architectural
> > > changes. I think by
> > > having the tables to be defined with replication count, and the
> proposed
> > > client API changes (Consistency definition)
> > > plugs well into the HBase model rather well.
> > >
> > >
> > I do agree think that without any recent updating mechanism, we are
> > limiting this usefulness of this feature to essentially *only* the
> > read-only or bulk load only tables.  Recency if there were any
> > edits/updates would be severely lagging (by default potentially an hour)
> > especially in cases where there are only a few edits to a primarily bulk
> > loaded table.  This limitation is not mentioned in the tradeoffs or
> > requirements (or a non-requirements section) definitely should be listed
> > there.
> >
> > With the current design it might be best to have a flag on the table
> which
> > marks it read-only or bulk-load only so that it only gets used by users
> > when the table is in that mode?  (and maybe an "escape hatch" for power
> > users).
> >
> > [snip]
> > >
> > > - I think the two goals are both worthy on their own each with their
> own
> > > > optimal points.  We should in the design makes sure that we can
> support
> > > > both goals.
> > > >
> > >
> > > I think our proposal is consistent with your doc, and we have
> considered
> > > secondary region promotion
> > > in the future section. It would be good if you can review and comment
> on
> > > whether you see any points
> > > missing.
> > >
> > >
> > > I definitely will. At the moment, I think the hybrid for the
> wals/hlogs I
> > suggested in the other thread seems to be an optimal solution considering
> > locality.  Though feasible is obviously more complex than just one
> approach
> > alone.
> >
> >
> > > > - I want to making sure the proposed design have a path for optimal
> > > > fast-consistent read-recovery.
> > > >
> > >
> > > We think that it is, but it is a secondary goal for the initial work. I
> > > don't see any reason why secondary
> > > promotion cannot be build on top of this, once the branch is in a
> better
> > > state.
> > >
> >
> > Based on the detail in the design doc and this statement it sounds like
> you
> > have a prototype branch already?  Is this the case?
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

Re: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> The downside:
>
> - Double/Triple memstore usage
> - Increased block cache usage (effectively, block cache will have 50%
> capacity may be less)


These are covered at the tradeoff section at the design doc.


>
>
These downsides are pretty serious ones. This will result:
>
> 1. in decreased overall performance due to decreased efficient block cache
> size
>

You can elect to not fill up the block cache for secondary reads. It will
be a configuration option, and a
tradeoff you may or may not want to pay. Details are in the doc.


>  2. In more frequent memstore flushes - this will affect compaction and
> write tput.
>

More frequent flushes is not needed unless you are using region snapshots
approach,
and want to bound the lag better. It is a tradeoff between expected lag vs
more
write amplification.


>
> I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> of 10-20ms unless your RSs go down 2-3 times a day for several minutes each
> time. You have to analyze first why are you having so frequent failures,
> than fix the root source of the problem. Its possible to reduce 'detection'
> phase in MTTR process to couple seconds either by using external beacon
> process (as I suggested already) or by rewriting some code inside HBase and
> NameNode to move all data out from Java heap to off-heap and reducing
> GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. The
> result: you will decrease MTTR by 50% at least w/o sacrificing the overall
> cluster performance.
>
> I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> prevents meeting strict SLA - not occasional server failures.
>

MTTR and this work is ortagonal. In a distributed system, you cannot
differentiate between
a process not responding because it is down or it is busy or network is
down, or whatnot. Having
a couple of seconds detection time is unrealistic. You will end up in a
very unstable state where
you will be failing servers all over the place. An external beacon also
cannot differentiate between
the main process not responding because it is busy, or it is down. What
happens why there is a temporary
network partition.



>
>
>
> On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > To keep the discussion focused on the design goals, I'm going start
> > referring to enis and deveraj's eventually consistent read replicas as
> the
> > *read replica* design, and consistent fast read recovery mechanism based
> on
> > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
>  Can
> > we agree on nomenclature?
> >
> >
> > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
> >
> > > Thanks Jon for bringing this to dev@.
> > >
> > >
> > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:
> > >
> > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead
> of
> > > > tackling a feature that other systems architecturally can do better
> > > > (inconsistent reads).   I consider consistent reads/writes being one
> of
> > > > HBase's defining features. That said, I think read replicas makes
> sense
> > > and
> > > > is a nice feature to have.
> > > >
> > >
> > > Our design proposal has a specific use case goal, and hopefully we can
> > > demonstrate the
> > > benefits of having this in HBase so that even more pieces can be built
> on
> > > top of this. Plus I imagine this will
> > > be a widely used feature for read-only tables or bulk loaded tables. We
> > are
> > > not
> > > proposing of reworking strong consistency semantics or major
> > architectural
> > > changes. I think by
> > > having the tables to be defined with replication count, and the
> proposed
> > > client API changes (Consistency definition)
> > > plugs well into the HBase model rather well.
> > >
> > >
> > I do agree think that without any recent updating mechanism, we are
> > limiting this usefulness of this feature to essentially *only* the
> > read-only or bulk load only tables.  Recency if there were any
> > edits/updates would be severely lagging (by default potentially an hour)
> > especially in cases where there are only a few edits to a primarily bulk
> > loaded table.  This limitation is not mentioned in the tradeoffs or
> > requirements (or a non-requirements section) definitely should be listed
> > there.
> >
> > With the current design it might be best to have a flag on the table
> which
> > marks it read-only or bulk-load only so that it only gets used by users
> > when the table is in that mode?  (and maybe an "escape hatch" for power
> > users).
> >
> > [snip]
> > >
> > > - I think the two goals are both worthy on their own each with their
> own
> > > > optimal points.  We should in the design makes sure that we can
> support
> > > > both goals.
> > > >
> > >
> > > I think our proposal is consistent with your doc, and we have
> considered
> > > secondary region promotion
> > > in the future section. It would be good if you can review and comment
> on
> > > whether you see any points
> > > missing.
> > >
> > >
> > > I definitely will. At the moment, I think the hybrid for the
> wals/hlogs I
> > suggested in the other thread seems to be an optimal solution considering
> > locality.  Though feasible is obviously more complex than just one
> approach
> > alone.
> >
> >
> > > > - I want to making sure the proposed design have a path for optimal
> > > > fast-consistent read-recovery.
> > > >
> > >
> > > We think that it is, but it is a secondary goal for the initial work. I
> > > don't see any reason why secondary
> > > promotion cannot be build on top of this, once the branch is in a
> better
> > > state.
> > >
> >
> > Based on the detail in the design doc and this statement it sounds like
> you
> > have a prototype branch already?  Is this the case?
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

Re: [Shadow Regions / Read Replicas ]

Posted by Devaraj Das <dd...@hortonworks.com>.

On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> The downside:
>
> - Double/Triple memstore usage
> - Increased block cache usage (effectively, block cache will have 50%
> capacity may be less)
>
> These downsides are pretty serious ones. This will result:
>
> 1. in decreased overall performance due to decreased efficient block cache
> size
>  2. In more frequent memstore flushes - this will affect compaction and
> write tput.
>
>
The thing is that this is configurable on a per table basis. Depending on
the hardware characteristics one may choose to not have more than one
replica per region.. Certain classes of applications + cluster combination
can still benefit from this.



> I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> of 10-20ms unless your RSs go down 2-3 times a day for several minutes each
> time. You have to analyze first why are you having so frequent failures,
> than fix the root source of the problem. Its possible to reduce 'detection'
> phase in MTTR process to couple seconds either by using external beacon
> process (as I suggested already) or by rewriting some code inside HBase and
> NameNode to move all data out from Java heap to off-heap and reducing
> GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. The
> result: you will decrease MTTR by 50% at least w/o sacrificing the overall
> cluster performance.
>
> I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> prevents meeting strict SLA - not occasional server failures.
>
>
>
Possibly. Better MTTR and handling of GC issues will continue - no doubt.
But still there is that window of time when certain regions are
unavailable. We want to address that for applications that can tolerate
eventual consistency.


>
> On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > To keep the discussion focused on the design goals, I'm going start
> > referring to enis and deveraj's eventually consistent read replicas as
> the
> > *read replica* design, and consistent fast read recovery mechanism based
> on
> > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
>  Can
> > we agree on nomenclature?
> >
> >
> > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
> >
> > > Thanks Jon for bringing this to dev@.
> > >
> > >
> > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:
> > >
> > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead
> of
> > > > tackling a feature that other systems architecturally can do better
> > > > (inconsistent reads).   I consider consistent reads/writes being one
> of
> > > > HBase's defining features. That said, I think read replicas makes
> sense
> > > and
> > > > is a nice feature to have.
> > > >
> > >
> > > Our design proposal has a specific use case goal, and hopefully we can
> > > demonstrate the
> > > benefits of having this in HBase so that even more pieces can be built
> on
> > > top of this. Plus I imagine this will
> > > be a widely used feature for read-only tables or bulk loaded tables. We
> > are
> > > not
> > > proposing of reworking strong consistency semantics or major
> > architectural
> > > changes. I think by
> > > having the tables to be defined with replication count, and the
> proposed
> > > client API changes (Consistency definition)
> > > plugs well into the HBase model rather well.
> > >
> > >
> > I do agree think that without any recent updating mechanism, we are
> > limiting this usefulness of this feature to essentially *only* the
> > read-only or bulk load only tables.  Recency if there were any
> > edits/updates would be severely lagging (by default potentially an hour)
> > especially in cases where there are only a few edits to a primarily bulk
> > loaded table.  This limitation is not mentioned in the tradeoffs or
> > requirements (or a non-requirements section) definitely should be listed
> > there.
> >
> > With the current design it might be best to have a flag on the table
> which
> > marks it read-only or bulk-load only so that it only gets used by users
> > when the table is in that mode?  (and maybe an "escape hatch" for power
> > users).
> >
> > [snip]
> > >
> > > - I think the two goals are both worthy on their own each with their
> own
> > > > optimal points.  We should in the design makes sure that we can
> support
> > > > both goals.
> > > >
> > >
> > > I think our proposal is consistent with your doc, and we have
> considered
> > > secondary region promotion
> > > in the future section. It would be good if you can review and comment
> on
> > > whether you see any points
> > > missing.
> > >
> > >
> > > I definitely will. At the moment, I think the hybrid for the
> wals/hlogs I
> > suggested in the other thread seems to be an optimal solution considering
> > locality.  Though feasible is obviously more complex than just one
> approach
> > alone.
> >
> >
> > > > - I want to making sure the proposed design have a path for optimal
> > > > fast-consistent read-recovery.
> > > >
> > >
> > > We think that it is, but it is a secondary goal for the initial work. I
> > > don't see any reason why secondary
> > > promotion cannot be build on top of this, once the branch is in a
> better
> > > state.
> > >
> >
> > Based on the detail in the design doc and this statement it sounds like
> you
> > have a prototype branch already?  Is this the case?
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: [Shadow Regions / Read Replicas ]

Posted by Vladimir Rodionov <vl...@gmail.com>.

The downside:

- Double/Triple memstore usage
- Increased block cache usage (effectively, block cache will have 50%
capacity may be less)

These downsides are pretty serious ones. This will result:

1. in decreased overall performance due to decreased efficient block cache
size
 2. In more frequent memstore flushes - this will affect compaction and
write tput.

I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
of 10-20ms unless your RSs go down 2-3 times a day for several minutes each
time. You have to analyze first why are you having so frequent failures,
than fix the root source of the problem. Its possible to reduce 'detection'
phase in MTTR process to couple seconds either by using external beacon
process (as I suggested already) or by rewriting some code inside HBase and
NameNode to move all data out from Java heap to off-heap and reducing
GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. The
result: you will decrease MTTR by 50% at least w/o sacrificing the overall
cluster performance.

I think, its RS and NN large heaps   and frequent s-t-w GC  activities
prevents meeting strict SLA - not occasional server failures.



On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> To keep the discussion focused on the design goals, I'm going start
> referring to enis and deveraj's eventually consistent read replicas as the
> *read replica* design, and consistent fast read recovery mechanism based on
> shadowing/tailing the wals as *shadow regions* or *shadow memstores*.  Can
> we agree on nomenclature?
>
>
> On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
>
> > Thanks Jon for bringing this to dev@.
> >
> >
> > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead of
> > > tackling a feature that other systems architecturally can do better
> > > (inconsistent reads).   I consider consistent reads/writes being one of
> > > HBase's defining features. That said, I think read replicas makes sense
> > and
> > > is a nice feature to have.
> > >
> >
> > Our design proposal has a specific use case goal, and hopefully we can
> > demonstrate the
> > benefits of having this in HBase so that even more pieces can be built on
> > top of this. Plus I imagine this will
> > be a widely used feature for read-only tables or bulk loaded tables. We
> are
> > not
> > proposing of reworking strong consistency semantics or major
> architectural
> > changes. I think by
> > having the tables to be defined with replication count, and the proposed
> > client API changes (Consistency definition)
> > plugs well into the HBase model rather well.
> >
> >
> I do agree think that without any recent updating mechanism, we are
> limiting this usefulness of this feature to essentially *only* the
> read-only or bulk load only tables.  Recency if there were any
> edits/updates would be severely lagging (by default potentially an hour)
> especially in cases where there are only a few edits to a primarily bulk
> loaded table.  This limitation is not mentioned in the tradeoffs or
> requirements (or a non-requirements section) definitely should be listed
> there.
>
> With the current design it might be best to have a flag on the table which
> marks it read-only or bulk-load only so that it only gets used by users
> when the table is in that mode?  (and maybe an "escape hatch" for power
> users).
>
> [snip]
> >
> > - I think the two goals are both worthy on their own each with their own
> > > optimal points.  We should in the design makes sure that we can support
> > > both goals.
> > >
> >
> > I think our proposal is consistent with your doc, and we have considered
> > secondary region promotion
> > in the future section. It would be good if you can review and comment on
> > whether you see any points
> > missing.
> >
> >
> > I definitely will. At the moment, I think the hybrid for the wals/hlogs I
> suggested in the other thread seems to be an optimal solution considering
> locality.  Though feasible is obviously more complex than just one approach
> alone.
>
>
> > > - I want to making sure the proposed design have a path for optimal
> > > fast-consistent read-recovery.
> > >
> >
> > We think that it is, but it is a secondary goal for the initial work. I
> > don't see any reason why secondary
> > promotion cannot be build on top of this, once the branch is in a better
> > state.
> >
>
> Based on the detail in the design doc and this statement it sounds like you
> have a prototype branch already?  Is this the case?
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jonathan Hsieh <jo...@cloudera.com>.

To keep the discussion focused on the design goals, I'm going start
referring to enis and deveraj's eventually consistent read replicas as the
*read replica* design, and consistent fast read recovery mechanism based on
shadowing/tailing the wals as *shadow regions* or *shadow memstores*.  Can
we agree on nomenclature?


On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:

> Thanks Jon for bringing this to dev@.
>
>
> On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead of
> > tackling a feature that other systems architecturally can do better
> > (inconsistent reads).   I consider consistent reads/writes being one of
> > HBase's defining features. That said, I think read replicas makes sense
> and
> > is a nice feature to have.
> >
>
> Our design proposal has a specific use case goal, and hopefully we can
> demonstrate the
> benefits of having this in HBase so that even more pieces can be built on
> top of this. Plus I imagine this will
> be a widely used feature for read-only tables or bulk loaded tables. We are
> not
> proposing of reworking strong consistency semantics or major architectural
> changes. I think by
> having the tables to be defined with replication count, and the proposed
> client API changes (Consistency definition)
> plugs well into the HBase model rather well.
>
>
I do agree think that without any recent updating mechanism, we are
limiting this usefulness of this feature to essentially *only* the
read-only or bulk load only tables.  Recency if there were any
edits/updates would be severely lagging (by default potentially an hour)
especially in cases where there are only a few edits to a primarily bulk
loaded table.  This limitation is not mentioned in the tradeoffs or
requirements (or a non-requirements section) definitely should be listed
there.

With the current design it might be best to have a flag on the table which
marks it read-only or bulk-load only so that it only gets used by users
when the table is in that mode?  (and maybe an "escape hatch" for power
users).

[snip]
>
> - I think the two goals are both worthy on their own each with their own
> > optimal points.  We should in the design makes sure that we can support
> > both goals.
> >
>
> I think our proposal is consistent with your doc, and we have considered
> secondary region promotion
> in the future section. It would be good if you can review and comment on
> whether you see any points
> missing.
>
>
> I definitely will. At the moment, I think the hybrid for the wals/hlogs I
suggested in the other thread seems to be an optimal solution considering
locality.  Though feasible is obviously more complex than just one approach
alone.


> > - I want to making sure the proposed design have a path for optimal
> > fast-consistent read-recovery.
> >
>
> We think that it is, but it is a secondary goal for the initial work. I
> don't see any reason why secondary
> promotion cannot be build on top of this, once the branch is in a better
> state.
>

Based on the detail in the design doc and this statement it sounds like you
have a prototype branch already?  Is this the case?

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com

Re: [Shadow Regions / Read Replicas ]

Posted by Devaraj Das <dd...@hortonworks.com>.

On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:

> Thanks Jon for bringing this to dev@.
>
>
> On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead of
> > tackling a feature that other systems architecturally can do better
> > (inconsistent reads).   I consider consistent reads/writes being one of
> > HBase's defining features. That said, I think read replicas makes sense
> and
> > is a nice feature to have.
> >
>
> Our design proposal has a specific use case goal, and hopefully we can
> demonstrate the
> benefits of having this in HBase so that even more pieces can be built on
> top of this. Plus I imagine this will
> be a widely used feature for read-only tables or bulk loaded tables. We are
> not
> proposing of reworking strong consistency semantics or major architectural
> changes. I think by
> having the tables to be defined with replication count, and the proposed
> client API changes (Consistency definition)
> plugs well into the HBase model rather well.
>
>
>
The good part is that the proposed architecture and the underlying
implementation can be extended to provide strong consistency semantics in
the presence of shadows. I guess much of the work that is being proposed
here would be needed even there.


> >
> > A few thoughts:
> > - Both approaches need to have more failure handling cases thought out.
> >
>
> Agreed. We cover most of the assignment / master side of things in the doc
> though.
>
>
> > - I think the two goals are both worthy on their own each with their own
> > optimal points.  We should in the design makes sure that we can support
> > both goals.
> >
>
> I think our proposal is consistent with your doc, and we have considered
> secondary region promotion
> in the future section. It would be good if you can review and comment on
> whether you see any points
> missing.
>
>
> > - I want to making sure the proposed design have a path for optimal
> > fast-consistent read-recovery.
> >
>
> We think that it is, but it is a secondary goal for the initial work. I
> don't see any reason why secondary
> promotion cannot be build on top of this, once the branch is in a better
> state.
>
>
> >
> > Jon.
> >
> >
> > On Mon, Dec 2, 2013 at 9:54 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
> >
> > > HBASE-10070 [1]  looks to be heading into a discussion more apt for the
> > > mailing list than in the jira. Moving this to the dev list for threaded
> > > discussion.  I'll start a few threads by replying to this thread with
> > > edited titles
> > >
> > > [1] https://issues.apache.org/jira/browse/HBASE-10070
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> > >
> >
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@apache.org>.

Thanks Jon for bringing this to dev@.

On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> Fundamentally, I'd prefer focusing on making HBase "HBasier" instead of
> tackling a feature that other systems architecturally can do better
> (inconsistent reads).   I consider consistent reads/writes being one of
> HBase's defining features. That said, I think read replicas makes sense and
> is a nice feature to have.
>

Our design proposal has a specific use case goal, and hopefully we can
demonstrate the
benefits of having this in HBase so that even more pieces can be built on
top of this. Plus I imagine this will
be a widely used feature for read-only tables or bulk loaded tables. We are
not
proposing of reworking strong consistency semantics or major architectural
changes. I think by
having the tables to be defined with replication count, and the proposed
client API changes (Consistency definition)
plugs well into the HBase model rather well.

>
> A few thoughts:
> - Both approaches need to have more failure handling cases thought out.
>

Agreed. We cover most of the assignment / master side of things in the doc
though.

> - I think the two goals are both worthy on their own each with their own
> optimal points.  We should in the design makes sure that we can support
> both goals.
>

I think our proposal is consistent with your doc, and we have considered
secondary region promotion
in the future section. It would be good if you can review and comment on
whether you see any points
missing.

> - I want to making sure the proposed design have a path for optimal
> fast-consistent read-recovery.
>

We think that it is, but it is a secondary goal for the initial work. I
don't see any reason why secondary
promotion cannot be build on top of this, once the branch is in a better
state.

>
> Jon.
>
>
> On Mon, Dec 2, 2013 at 9:54 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > HBASE-10070 [1]  looks to be heading into a discussion more apt for the
> > mailing list than in the jira. Moving this to the dev list for threaded
> > discussion.  I'll start a few threads by replying to this thread with
> > edited titles
> >
> > [1] https://issues.apache.org/jira/browse/HBASE-10070
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>

Re: [Shadow Regions / Read Replicas ]

Posted by Jonathan Hsieh <jo...@cloudera.com>.

Fundamentally, I'd prefer focusing on making HBase "HBasier" instead of
tackling a feature that other systems architecturally can do better
(inconsistent reads).   I consider consistent reads/writes being one of
HBase's defining features. That said, I think read replicas makes sense and
is a nice feature to have.

A few thoughts:
- Both approaches need to have more failure handling cases thought out.
- I think the two goals are both worthy on their own each with their own
optimal points.  We should in the design makes sure that we can support
both goals.
- I want to making sure the proposed design have a path for optimal
fast-consistent read-recovery.

Jon.

On Mon, Dec 2, 2013 at 9:54 PM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> HBASE-10070 [1]  looks to be heading into a discussion more apt for the
> mailing list than in the jira. Moving this to the dev list for threaded
> discussion.  I'll start a few threads by replying to this thread with
> edited titles
>
> [1] https://issues.apache.org/jira/browse/HBASE-10070
>
> --
> // Jonathan Hsieh (shay)
> // Software Engineer, Cloudera
> // jon@cloudera.com
>
>

-- 
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// jon@cloudera.com