You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by 谢良 <xi...@xiaomi.com> on 2013/12/07 14:39:15 UTC

答复: [Shadow Regions / Read Replicas ]

For one advantage of this design(ability to do low latency reads with
<20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509
solution, Since if you want to ganrantee similar high performance read ability in
shadow regions, then you must let the shadow rs warmup the related hot blocks
into block cache.(In deed, i have a similar worry with Vladimir).
I tried to think how this design could beat hbase-7509 on cutting the latency tail,
but no result still.

Enis, could you share your thoughts on it? thanks

Thanks,

________________________________________
发件人: Enis Söztutar [enis.soz@gmail.com]
发送时间: 2013年12月4日 6:18
收件人: dev@hbase.apache.org
主题: Re: [Shadow Regions / Read Replicas ]

On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> The downside:
>
> - Double/Triple memstore usage
> - Increased block cache usage (effectively, block cache will have 50%
> capacity may be less)


These are covered at the tradeoff section at the design doc.


>
>
These downsides are pretty serious ones. This will result:
>
> 1. in decreased overall performance due to decreased efficient block cache
> size
>

You can elect to not fill up the block cache for secondary reads. It will
be a configuration option, and a
tradeoff you may or may not want to pay. Details are in the doc.


>  2. In more frequent memstore flushes - this will affect compaction and
> write tput.
>

More frequent flushes is not needed unless you are using region snapshots
approach,
and want to bound the lag better. It is a tradeoff between expected lag vs
more
write amplification.


>
> I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> of 10-20ms unless your RSs go down 2-3 times a day for several minutes each
> time. You have to analyze first why are you having so frequent failures,
> than fix the root source of the problem. Its possible to reduce 'detection'
> phase in MTTR process to couple seconds either by using external beacon
> process (as I suggested already) or by rewriting some code inside HBase and
> NameNode to move all data out from Java heap to off-heap and reducing
> GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable. The
> result: you will decrease MTTR by 50% at least w/o sacrificing the overall
> cluster performance.
>
> I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> prevents meeting strict SLA - not occasional server failures.
>

MTTR and this work is ortagonal. In a distributed system, you cannot
differentiate between
a process not responding because it is down or it is busy or network is
down, or whatnot. Having
a couple of seconds detection time is unrealistic. You will end up in a
very unstable state where
you will be failing servers all over the place. An external beacon also
cannot differentiate between
the main process not responding because it is busy, or it is down. What
happens why there is a temporary
network partition.



>
>
>
> On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > To keep the discussion focused on the design goals, I'm going start
> > referring to enis and deveraj's eventually consistent read replicas as
> the
> > *read replica* design, and consistent fast read recovery mechanism based
> on
> > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
>  Can
> > we agree on nomenclature?
> >
> >
> > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org> wrote:
> >
> > > Thanks Jon for bringing this to dev@.
> > >
> > >
> > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:
> > >
> > > > Fundamentally, I'd prefer focusing on making HBase "HBasier" instead
> of
> > > > tackling a feature that other systems architecturally can do better
> > > > (inconsistent reads).   I consider consistent reads/writes being one
> of
> > > > HBase's defining features. That said, I think read replicas makes
> sense
> > > and
> > > > is a nice feature to have.
> > > >
> > >
> > > Our design proposal has a specific use case goal, and hopefully we can
> > > demonstrate the
> > > benefits of having this in HBase so that even more pieces can be built
> on
> > > top of this. Plus I imagine this will
> > > be a widely used feature for read-only tables or bulk loaded tables. We
> > are
> > > not
> > > proposing of reworking strong consistency semantics or major
> > architectural
> > > changes. I think by
> > > having the tables to be defined with replication count, and the
> proposed
> > > client API changes (Consistency definition)
> > > plugs well into the HBase model rather well.
> > >
> > >
> > I do agree think that without any recent updating mechanism, we are
> > limiting this usefulness of this feature to essentially *only* the
> > read-only or bulk load only tables.  Recency if there were any
> > edits/updates would be severely lagging (by default potentially an hour)
> > especially in cases where there are only a few edits to a primarily bulk
> > loaded table.  This limitation is not mentioned in the tradeoffs or
> > requirements (or a non-requirements section) definitely should be listed
> > there.
> >
> > With the current design it might be best to have a flag on the table
> which
> > marks it read-only or bulk-load only so that it only gets used by users
> > when the table is in that mode?  (and maybe an "escape hatch" for power
> > users).
> >
> > [snip]
> > >
> > > - I think the two goals are both worthy on their own each with their
> own
> > > > optimal points.  We should in the design makes sure that we can
> support
> > > > both goals.
> > > >
> > >
> > > I think our proposal is consistent with your doc, and we have
> considered
> > > secondary region promotion
> > > in the future section. It would be good if you can review and comment
> on
> > > whether you see any points
> > > missing.
> > >
> > >
> > > I definitely will. At the moment, I think the hybrid for the
> wals/hlogs I
> > suggested in the other thread seems to be an optimal solution considering
> > locality.  Though feasible is obviously more complex than just one
> approach
> > alone.
> >
> >
> > > > - I want to making sure the proposed design have a path for optimal
> > > > fast-consistent read-recovery.
> > > >
> > >
> > > We think that it is, but it is a secondary goal for the initial work. I
> > > don't see any reason why secondary
> > > promotion cannot be build on top of this, once the branch is in a
> better
> > > state.
> > >
> >
> > Based on the detail in the design doc and this statement it sounds like
> you
> > have a prototype branch already?  Is this the case?
> >
> > --
> > // Jonathan Hsieh (shay)
> > // Software Engineer, Cloudera
> > // jon@cloudera.com
> >
>

答复: 答复: [Shadow Regions / Read Replicas ]

Posted by 谢良 <xi...@xiaomi.com>.

Hi Enis,

Thanks for reply. I have realized that still need a read secondary node ability
to achive lower 99th or 99.9th percentile read latency, e.g. big GC on rs node.
And i have a idea that impletement this ability from hbase-client side, we could
issue a read request to slave cluster, that will make us:
1) warm up slave cluster, so more performance confidence to switch the traffic
to slave cluster if the current master cluster is suffering a breakdown or sth.
2) we could have a very real stress testing result:)
We could impletement several policy on read just similar with your design's.
3) this behiviour could be similar with the traditional RDBMS operation: write to
master, and read from slave or slave+master :)  making the system scaling to read.

The mainly shortcoming of above is it's suitable only if having a replication running.
But i still like it and indeed also have a plan to write sth to do a prototype testing
next week. I still keep the same concern like before: the cost is heavy if you
want to enable the read from the secondary RS in the same cluster, with block cache
warming up always to achive the lower read latency.

Sorry my talking probably is not about the mainly design point(HA for read), but focus
on latency related.

Thanks,
________________________________________
发件人: Enis Söztutar [enis.soz@gmail.com]
发送时间: 2013年12月10日 5:24
收件人: dev@hbase.apache.org
主题: Re: 答复: [Shadow Regions / Read Replicas ]

We are also proposing to implement HBASE-7509 as a part of this major
undertaking. HBASE-7509 will help with HBase in general (even if you are
not using HBASE-10070), and possibly some other hdfs clients.
HBASE-10070 will give you similar benefits to HBASE-7509 if your use case
needs that, but on the hbase layer which will sit on top of HBASE-7509.

Enis


On Sat, Dec 7, 2013 at 5:39 AM, 谢良 <xi...@xiaomi.com> wrote:

> For one advantage of this design(ability to do low latency reads with
> <20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509
> solution, Since if you want to ganrantee similar high performance read
> ability in
> shadow regions, then you must let the shadow rs warmup the related hot
> blocks
> into block cache.(In deed, i have a similar worry with Vladimir).
> I tried to think how this design could beat hbase-7509 on cutting the
> latency tail,
> but no result still.
>
> Enis, could you share your thoughts on it? thanks
>
> Thanks,
>
> ________________________________________
> 发件人: Enis Söztutar [enis.soz@gmail.com]
> 发送时间: 2013年12月4日 6:18
> 收件人: dev@hbase.apache.org
> 主题: Re: [Shadow Regions / Read Replicas ]
>
> On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > The downside:
> >
> > - Double/Triple memstore usage
> > - Increased block cache usage (effectively, block cache will have 50%
> > capacity may be less)
>
>
> These are covered at the tradeoff section at the design doc.
>
>
> >
> >
> These downsides are pretty serious ones. This will result:
> >
> > 1. in decreased overall performance due to decreased efficient block
> cache
> > size
> >
>
> You can elect to not fill up the block cache for secondary reads. It will
> be a configuration option, and a
> tradeoff you may or may not want to pay. Details are in the doc.
>
>
> >  2. In more frequent memstore flushes - this will affect compaction and
> > write tput.
> >
>
> More frequent flushes is not needed unless you are using region snapshots
> approach,
> and want to bound the lag better. It is a tradeoff between expected lag vs
> more
> write amplification.
>
>
> >
> > I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> > of 10-20ms unless your RSs go down 2-3 times a day for several minutes
> each
> > time. You have to analyze first why are you having so frequent failures,
> > than fix the root source of the problem. Its possible to reduce
> 'detection'
> > phase in MTTR process to couple seconds either by using external beacon
> > process (as I suggested already) or by rewriting some code inside HBase
> and
> > NameNode to move all data out from Java heap to off-heap and reducing
> > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable.
> The
> > result: you will decrease MTTR by 50% at least w/o sacrificing the
> overall
> > cluster performance.
> >
> > I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> > prevents meeting strict SLA - not occasional server failures.
> >
>
> MTTR and this work is ortagonal. In a distributed system, you cannot
> differentiate between
> a process not responding because it is down or it is busy or network is
> down, or whatnot. Having
> a couple of seconds detection time is unrealistic. You will end up in a
> very unstable state where
> you will be failing servers all over the place. An external beacon also
> cannot differentiate between
> the main process not responding because it is busy, or it is down. What
> happens why there is a temporary
> network partition.
>
>
>
> >
> >
> >
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > To keep the discussion focused on the design goals, I'm going start
> > > referring to enis and deveraj's eventually consistent read replicas as
> > the
> > > *read replica* design, and consistent fast read recovery mechanism
> based
> > on
> > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
> >  Can
> > > we agree on nomenclature?
> > >
> > >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>

Re: 答复: [Shadow Regions / Read Replicas ]

Posted by Enis Söztutar <en...@gmail.com>.

We are also proposing to implement HBASE-7509 as a part of this major
undertaking. HBASE-7509 will help with HBase in general (even if you are
not using HBASE-10070), and possibly some other hdfs clients.
HBASE-10070 will give you similar benefits to HBASE-7509 if your use case
needs that, but on the hbase layer which will sit on top of HBASE-7509.

Enis


On Sat, Dec 7, 2013 at 5:39 AM, 谢良 <xi...@xiaomi.com> wrote:

> For one advantage of this design(ability to do low latency reads with
> <20ms 99.9% latencies for stale reads), to me, i more prefer to hbase-7509
> solution, Since if you want to ganrantee similar high performance read
> ability in
> shadow regions, then you must let the shadow rs warmup the related hot
> blocks
> into block cache.(In deed, i have a similar worry with Vladimir).
> I tried to think how this design could beat hbase-7509 on cutting the
> latency tail,
> but no result still.
>
> Enis, could you share your thoughts on it? thanks
>
> Thanks,
>
> ________________________________________
> 发件人: Enis Söztutar [enis.soz@gmail.com]
> 发送时间: 2013年12月4日 6:18
> 收件人: dev@hbase.apache.org
> 主题: Re: [Shadow Regions / Read Replicas ]
>
> On Tue, Dec 3, 2013 at 12:31 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > The downside:
> >
> > - Double/Triple memstore usage
> > - Increased block cache usage (effectively, block cache will have 50%
> > capacity may be less)
>
>
> These are covered at the tradeoff section at the design doc.
>
>
> >
> >
> These downsides are pretty serious ones. This will result:
> >
> > 1. in decreased overall performance due to decreased efficient block
> cache
> > size
> >
>
> You can elect to not fill up the block cache for secondary reads. It will
> be a configuration option, and a
> tradeoff you may or may not want to pay. Details are in the doc.
>
>
> >  2. In more frequent memstore flushes - this will affect compaction and
> > write tput.
> >
>
> More frequent flushes is not needed unless you are using region snapshots
> approach,
> and want to bound the lag better. It is a tradeoff between expected lag vs
> more
> write amplification.
>
>
> >
> > I do not believe that  HBase 'large' MTTR does not allow to meet 99% SLA.
> > of 10-20ms unless your RSs go down 2-3 times a day for several minutes
> each
> > time. You have to analyze first why are you having so frequent failures,
> > than fix the root source of the problem. Its possible to reduce
> 'detection'
> > phase in MTTR process to couple seconds either by using external beacon
> > process (as I suggested already) or by rewriting some code inside HBase
> and
> > NameNode to move all data out from Java heap to off-heap and reducing
> > GC-induced timeouts from 30 sec to 1-2 sec max. Its tough, but doable.
> The
> > result: you will decrease MTTR by 50% at least w/o sacrificing the
> overall
> > cluster performance.
> >
> > I think, its RS and NN large heaps   and frequent s-t-w GC  activities
> > prevents meeting strict SLA - not occasional server failures.
> >
>
> MTTR and this work is ortagonal. In a distributed system, you cannot
> differentiate between
> a process not responding because it is down or it is busy or network is
> down, or whatnot. Having
> a couple of seconds detection time is unrealistic. You will end up in a
> very unstable state where
> you will be failing servers all over the place. An external beacon also
> cannot differentiate between
> the main process not responding because it is busy, or it is down. What
> happens why there is a temporary
> network partition.
>
>
>
> >
> >
> >
> > On Tue, Dec 3, 2013 at 11:51 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > To keep the discussion focused on the design goals, I'm going start
> > > referring to enis and deveraj's eventually consistent read replicas as
> > the
> > > *read replica* design, and consistent fast read recovery mechanism
> based
> > on
> > > shadowing/tailing the wals as *shadow regions* or *shadow memstores*.
> >  Can
> > > we agree on nomenclature?
> > >
> > >
> > > On Tue, Dec 3, 2013 at 11:07 AM, Enis Söztutar <en...@apache.org>
> wrote:
> > >
> > > > Thanks Jon for bringing this to dev@.
> > > >
> > > >
> > > > On Mon, Dec 2, 2013 at 10:01 PM, Jonathan Hsieh <jo...@cloudera.com>
> > > wrote:
> > > >
> > > > > Fundamentally, I'd prefer focusing on making HBase "HBasier"
> instead
> > of
> > > > > tackling a feature that other systems architecturally can do better
> > > > > (inconsistent reads).   I consider consistent reads/writes being
> one
> > of
> > > > > HBase's defining features. That said, I think read replicas makes
> > sense
> > > > and
> > > > > is a nice feature to have.
> > > > >
> > > >
> > > > Our design proposal has a specific use case goal, and hopefully we
> can
> > > > demonstrate the
> > > > benefits of having this in HBase so that even more pieces can be
> built
> > on
> > > > top of this. Plus I imagine this will
> > > > be a widely used feature for read-only tables or bulk loaded tables.
> We
> > > are
> > > > not
> > > > proposing of reworking strong consistency semantics or major
> > > architectural
> > > > changes. I think by
> > > > having the tables to be defined with replication count, and the
> > proposed
> > > > client API changes (Consistency definition)
> > > > plugs well into the HBase model rather well.
> > > >
> > > >
> > > I do agree think that without any recent updating mechanism, we are
> > > limiting this usefulness of this feature to essentially *only* the
> > > read-only or bulk load only tables.  Recency if there were any
> > > edits/updates would be severely lagging (by default potentially an
> hour)
> > > especially in cases where there are only a few edits to a primarily
> bulk
> > > loaded table.  This limitation is not mentioned in the tradeoffs or
> > > requirements (or a non-requirements section) definitely should be
> listed
> > > there.
> > >
> > > With the current design it might be best to have a flag on the table
> > which
> > > marks it read-only or bulk-load only so that it only gets used by users
> > > when the table is in that mode?  (and maybe an "escape hatch" for power
> > > users).
> > >
> > > [snip]
> > > >
> > > > - I think the two goals are both worthy on their own each with their
> > own
> > > > > optimal points.  We should in the design makes sure that we can
> > support
> > > > > both goals.
> > > > >
> > > >
> > > > I think our proposal is consistent with your doc, and we have
> > considered
> > > > secondary region promotion
> > > > in the future section. It would be good if you can review and comment
> > on
> > > > whether you see any points
> > > > missing.
> > > >
> > > >
> > > > I definitely will. At the moment, I think the hybrid for the
> > wals/hlogs I
> > > suggested in the other thread seems to be an optimal solution
> considering
> > > locality.  Though feasible is obviously more complex than just one
> > approach
> > > alone.
> > >
> > >
> > > > > - I want to making sure the proposed design have a path for optimal
> > > > > fast-consistent read-recovery.
> > > > >
> > > >
> > > > We think that it is, but it is a secondary goal for the initial
> work. I
> > > > don't see any reason why secondary
> > > > promotion cannot be build on top of this, once the branch is in a
> > better
> > > > state.
> > > >
> > >
> > > Based on the detail in the design doc and this statement it sounds like
> > you
> > > have a prototype branch already?  Is this the case?
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // Software Engineer, Cloudera
> > > // jon@cloudera.com
> > >
> >
>