You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Stack <st...@duboce.net> on 2020/11/14 18:29:44 UTC

HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

HBASE-18070 makes it so hbase:meta read replicas can run closer to the
primary, (< second lags rather than minutes). It adds Async WAL
Replication[1] on the hbase:meta table; i.e. edits are sprayed across
replicas as they arrive at the primary's WAL. Before this work, Async WAL
Replication was only available on user-space tables and the only option for
hbase:meta read replicas was reloading the primaries hfiles on a period
(minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
policy that favors read replicas ahead of primary reads falling back to the
primary on fault. Together, these additions allow distributing hbase:meta
read load across primary and replicas alleviating 'hotspotting'.

I would like to merge the feature to master branch Monday evening if there
is no objection (Soon after I'll merge to branch-2 so this feature can
hopefully be included in the upcoming 2.4.0RC).

 * For the design, see [2].
 * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
feature, see [3].
 * For a PE report that compared performance before and after, see
HBASE-25127 (no regression).
 * A report on ITBLL runs is pending to be attached to HBASE-18070 but runs
so far show no regression with the feature enabled (ITBLL runs were done
against a backport of this feature to branch-2 as the ITBLL state of master
is currently an unknown).

Testing continues mainly looking for further improvement and to better
understand this feature in operation. Documentation is included but in need
of polish (working on it).

Dump any questions here and I'll be happy to respond. If you need more time
to review, just shout.

Thanks and thanks to all who contributed to this feature; the reviewers and
the testers in particular.

S

1. http://hbase.apache.org/book.html#_asnyc_wal_replication
2.
https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
This patch is currently missing HBASE-25280, a bug found in testing.
3. https://github.com/apache/hbase/pull/2643

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Andrew Purtell <an...@gmail.com>.
Anyway I don’t wish to discuss this veto at length. What’s done is done. The merge is off for the 2.4 RC. It will proceed without this feature.

> On Nov 16, 2020, at 7:33 AM, Andrew Purtell <an...@gmail.com> wrote:
> 
> Your -1 throws a wrench into an agreement we had to merge this for 2.4. The 2.4 RC has been waiting for two weeks for this reason. It is a rather unfriendly act and frustrating to me as 2.4 RC. I feel this needs to be said. Despite what you claim the reason for hedged reads has been explained here and on the JIRA. 
> 
>> On Nov 15, 2020, at 11:20 PM, 张铎 <pa...@gmail.com> wrote:
>> 
>> So what is your purpose of distributing the request of region location
>> lookup? It is just because you want to 'distribute the request of region
>> location lookup'?
>> 
>> Then I'm -1 on merging. We should reach an agreement on what we want to
>> solve before merging at least.
>> 
>> I've helped this issue from the design doc step. For me, the purpose for
>> this issue is clear. We want to prevent the hotspot of meta, so the
>> solution is simple, enable meta replica, and then just modify the client to
>> not always go to primary replica first(this is the old behavior even with
>> meta replica feature on).
>> And this will introduce another problem that, there is no meta region
>> replication implementation for meta read replicas, which means the latency
>> will be large as we can only sync the data between replicas through region
>> flush, so we implement meta region replication.
>> 
>> So I think it is very important to verify that we have truly distributed
>> the request of region location lookup, and also make sure that we could
>> support more requests of region location lookup. Otherwise this feature is
>> useless.
>> 
>> And I agree with Andrew that, since the feature is default off on branch-2
>> and has no regression, it is OK to merge for now. Theoretically our
>> approach here should work, so even it does not work for now, I think we
>> could fix the problems to make it work.
>> 
>> But your reply above made me wonder whether we are talking about the same
>> thing. That's why I'm -1 here. I'm not going to force you to do the test
>> suggested by me, as I said it could be done after merging, just want to
>> reach an agreement on the goal of this feature.
>> 
>> Thanks.
>> 
>> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
>> 
>>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <an...@gmail.com>
>>>> wrote:
>>>> 
>>>> I agree with Duo’s comment that a performance gain is unlikely but would
>>>> be orthogonal anyway;
>>> 
>>> 
>>> Perf observation is just an aside in the issue. Perf is orthogonal as you
>>> say above (as long as no regression).
>>> 
>>> 
>>> 
>>>> it’s an availability gain that is the goal. We can assume it based on
>>>> theory of operation and unit test results but the gain should be tested
>>> and
>>>> measured on a cluster too.
>>>> 
>>> 
>>> 
>>> The feature is about distributing load on hbase:meta to alleviate
>>> hotspotting; it makes read replicas more live so replicas are more likely
>>> to satisfy location lookups making read replicas more effective. That read
>>> replicas improve HA is presumed -- it was the original justification for
>>> this years old commit -- but HA is not the focus of this addition; hence no
>>> reports on effectiveness in this area.
>>> 
>>> I have no problem working on such tests/reports but suggest that they are
>>> done post merge.
>>> 
>>> 
>>> 
>>>> That said, the results of the testing thus far indicate no regression,
>>>> which gives me confidence to support a merge. Specifically, a merge to
>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the
>>>> default there is the feature is configured off. But please indicate in
>>>> documentation and release notes that the feature is not widely tested
>>> yet -
>>>> as is customarily done for new functionality like this.
>>>> 
>>>> 
>>> No problem w/ flagging the feature as new.
>>> 
>>> Thanks,
>>> S
>>> 
>>> 
>>> 
>>>> 
>>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
>>>>>> 
>>>>>> Replied on jira, I think we missed an important scenario when testing.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
>>>>>> 
>>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to the
>>>>>>> primary, (< second lags rather than minutes). It adds Async WAL
>>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed across
>>>>>>> replicas as they arrive at the primary's WAL. Before this work, Async
>>>>> WAL
>>>>>>> Replication was only available on user-space tables and the only
>>> option
>>>> for
>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
>>> period
>>>>>> (minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
>>>>>> policy that favors read replicas ahead of primary reads falling back
>>> to
>>>> the
>>>>>> primary on fault. Together, these additions allow distributing
>>>> hbase:meta
>>>>>> read load across primary and replicas alleviating 'hotspotting'.
>>>>>> 
>>>>>> I would like to merge the feature to master branch Monday evening if
>>>> there
>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature can
>>>>>> hopefully be included in the upcoming 2.4.0RC).
>>>>>> 
>>>>>> * For the design, see [2].
>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
>>>>>> feature, see [3].
>>>>>> * For a PE report that compared performance before and after, see
>>>>>> HBASE-25127 (no regression).
>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070 but
>>>> runs
>>>>>> so far show no regression with the feature enabled (ITBLL runs were
>>> done
>>>>>> against a backport of this feature to branch-2 as the ITBLL state of
>>>> master
>>>>>> is currently an unknown).
>>>>>> 
>>>>>> Testing continues mainly looking for further improvement and to better
>>>>>> understand this feature in operation. Documentation is included but in
>>>> need
>>>>>> of polish (working on it).
>>>>>> 
>>>>>> Dump any questions here and I'll be happy to respond. If you need more
>>>> time
>>>>>> to review, just shout.
>>>>>> 
>>>>>> Thanks and thanks to all who contributed to this feature; the
>>> reviewers
>>>> and
>>>>>> the testers in particular.
>>>>>> 
>>>>>> S
>>>>>> 
>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
>>>>>> 2.
>>>>>> 
>>>>>> 
>>>> 
>>> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
>>>>>> This patch is currently missing HBASE-25280, a bug found in testing.
>>>>>> 3. https://github.com/apache/hbase/pull/2643
>>>>>> 
>>>> 
>>> 

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Andrew Purtell <an...@gmail.com>.
Your -1 throws a wrench into an agreement we had to merge this for 2.4. The 2.4 RC has been waiting for two weeks for this reason. It is a rather unfriendly act and frustrating to me as 2.4 RC. I feel this needs to be said. Despite what you claim the reason for hedged reads has been explained here and on the JIRA. 

> On Nov 15, 2020, at 11:20 PM, 张铎 <pa...@gmail.com> wrote:
> 
> So what is your purpose of distributing the request of region location
> lookup? It is just because you want to 'distribute the request of region
> location lookup'?
> 
> Then I'm -1 on merging. We should reach an agreement on what we want to
> solve before merging at least.
> 
> I've helped this issue from the design doc step. For me, the purpose for
> this issue is clear. We want to prevent the hotspot of meta, so the
> solution is simple, enable meta replica, and then just modify the client to
> not always go to primary replica first(this is the old behavior even with
> meta replica feature on).
> And this will introduce another problem that, there is no meta region
> replication implementation for meta read replicas, which means the latency
> will be large as we can only sync the data between replicas through region
> flush, so we implement meta region replication.
> 
> So I think it is very important to verify that we have truly distributed
> the request of region location lookup, and also make sure that we could
> support more requests of region location lookup. Otherwise this feature is
> useless.
> 
> And I agree with Andrew that, since the feature is default off on branch-2
> and has no regression, it is OK to merge for now. Theoretically our
> approach here should work, so even it does not work for now, I think we
> could fix the problems to make it work.
> 
> But your reply above made me wonder whether we are talking about the same
> thing. That's why I'm -1 here. I'm not going to force you to do the test
> suggested by me, as I said it could be done after merging, just want to
> reach an agreement on the goal of this feature.
> 
> Thanks.
> 
> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> 
>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <an...@gmail.com>
>>> wrote:
>>> 
>>> I agree with Duo’s comment that a performance gain is unlikely but would
>>> be orthogonal anyway;
>> 
>> 
>> Perf observation is just an aside in the issue. Perf is orthogonal as you
>> say above (as long as no regression).
>> 
>> 
>> 
>>> it’s an availability gain that is the goal. We can assume it based on
>>> theory of operation and unit test results but the gain should be tested
>> and
>>> measured on a cluster too.
>>> 
>> 
>> 
>> The feature is about distributing load on hbase:meta to alleviate
>> hotspotting; it makes read replicas more live so replicas are more likely
>> to satisfy location lookups making read replicas more effective. That read
>> replicas improve HA is presumed -- it was the original justification for
>> this years old commit -- but HA is not the focus of this addition; hence no
>> reports on effectiveness in this area.
>> 
>> I have no problem working on such tests/reports but suggest that they are
>> done post merge.
>> 
>> 
>> 
>>> That said, the results of the testing thus far indicate no regression,
>>> which gives me confidence to support a merge. Specifically, a merge to
>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the
>>> default there is the feature is configured off. But please indicate in
>>> documentation and release notes that the feature is not widely tested
>> yet -
>>> as is customarily done for new functionality like this.
>>> 
>>> 
>> No problem w/ flagging the feature as new.
>> 
>> Thanks,
>> S
>> 
>> 
>> 
>>> 
>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
>>>> 
>>>> Replied on jira, I think we missed an important scenario when testing.
>>>> 
>>>> Thanks.
>>>> 
>>>> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
>>>> 
>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to the
>>>>> primary, (< second lags rather than minutes). It adds Async WAL
>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed across
>>>>> replicas as they arrive at the primary's WAL. Before this work, Async
>>> WAL
>>>>> Replication was only available on user-space tables and the only
>> option
>>> for
>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
>> period
>>>>> (minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
>>>>> policy that favors read replicas ahead of primary reads falling back
>> to
>>> the
>>>>> primary on fault. Together, these additions allow distributing
>>> hbase:meta
>>>>> read load across primary and replicas alleviating 'hotspotting'.
>>>>> 
>>>>> I would like to merge the feature to master branch Monday evening if
>>> there
>>>>> is no objection (Soon after I'll merge to branch-2 so this feature can
>>>>> hopefully be included in the upcoming 2.4.0RC).
>>>>> 
>>>>> * For the design, see [2].
>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
>>>>> feature, see [3].
>>>>> * For a PE report that compared performance before and after, see
>>>>> HBASE-25127 (no regression).
>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070 but
>>> runs
>>>>> so far show no regression with the feature enabled (ITBLL runs were
>> done
>>>>> against a backport of this feature to branch-2 as the ITBLL state of
>>> master
>>>>> is currently an unknown).
>>>>> 
>>>>> Testing continues mainly looking for further improvement and to better
>>>>> understand this feature in operation. Documentation is included but in
>>> need
>>>>> of polish (working on it).
>>>>> 
>>>>> Dump any questions here and I'll be happy to respond. If you need more
>>> time
>>>>> to review, just shout.
>>>>> 
>>>>> Thanks and thanks to all who contributed to this feature; the
>> reviewers
>>> and
>>>>> the testers in particular.
>>>>> 
>>>>> S
>>>>> 
>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
>>>>> 2.
>>>>> 
>>>>> 
>>> 
>> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
>>>>> This patch is currently missing HBASE-25280, a bug found in testing.
>>>>> 3. https://github.com/apache/hbase/pull/2643
>>>>> 
>>> 
>> 

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Huaxiang Sun <hu...@gmail.com>.
Folks,

    As I explained in the jira, before our load hits the region server's
limit, the throughput is going to be similar for primary-only and meta
replica Load Balance mode. We are examining  our test today, trying to push
up the load and
will come back.

    Best Regards,

    Huaxiang Sun



On Mon, Nov 16, 2020 at 8:15 AM Andrew Purtell <ap...@apache.org> wrote:

> I see, sure we can try to resolve it.
>
> We had two full weeks, from now until the end of the month, for the RC, to
> get it out before end of month. Further delay lessens that time frame but I
> admit it is an arbitrary target and that shouldn't be the top most concern
> given the interest of the contributors as expressed on this thread ("Our
> group were hoping to throw our shoulder behind 2.4 stabilizing so we could
> deploy it to production.").
>
> I commented on the JIRA. Perhaps the difference in technical opinion comes
> down to a replica preference policy alternative that can be resolved with
> follow up work, and that is the way forward.
>
>
> On Mon, Nov 16, 2020 at 7:59 AM Stack <st...@duboce.net> wrote:
>
> > On Mon, Nov 16, 2020 at 7:44 AM Andrew Purtell <ap...@apache.org>
> > wrote:
> >
> > > My apologies, Stack, it's time to move on for 2.4. We can revisit this
> > for
> > > 2.5.
> > >
> > >
> > One more day to allow Duo reconsider (my fault for not making this a VOTE
> > thread)?
> >
> > The work here is mostly that of others. It would be a shame it didn't
> land
> > in 2.4 because of my representation. Our group were hoping to throw our
> > shoulder behind 2.4 stabilizing so we could deploy it to production.
> > Without this feature, we'll have to reconsider.
> >
> > Thanks Andrew,
> > S
> >
> >
> >
> > > On Mon, Nov 16, 2020 at 7:41 AM Stack <st...@duboce.net> wrote:
> > >
> > > > On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <
> palomino219@gmail.com>
> > > > wrote:
> > > >
> > > > > So what is your purpose of distributing the request of region
> > location
> > > > > lookup? It is just because you want to 'distribute the request of
> > > region
> > > > > location lookup'?
> > > > >
> > > > > Then I'm -1 on merging. We should reach an agreement on what we
> want
> > to
> > > > > solve before merging at least.
> > > > >
> > > > > I've helped this issue from the design doc step. For me, the
> purpose
> > > for
> > > > > this issue is clear. We want to prevent the hotspot of meta, so the
> > > > > solution is simple, enable meta replica, and then just modify the
> > > client
> > > > to
> > > > > not always go to primary replica first(this is the old behavior
> even
> > > with
> > > > > meta replica feature on).
> > > > > And this will introduce another problem that, there is no meta
> region
> > > > > replication implementation for meta read replicas, which means the
> > > > latency
> > > > > will be large as we can only sync the data between replicas through
> > > > region
> > > > > flush, so we implement meta region replication.
> > > > >
> > > > > So I think it is very important to verify that we have truly
> > > distributed
> > > > > the request of region location lookup, and also make sure that we
> > could
> > > > > support more requests of region location lookup. Otherwise this
> > feature
> > > > is
> > > > > useless.
> > > > >
> > > > > And I agree with Andrew that, since the feature is default off on
> > > > branch-2
> > > > > and has no regression, it is OK to merge for now. Theoretically our
> > > > > approach here should work, so even it does not work for now, I
> think
> > we
> > > > > could fix the problems to make it work.
> > > > >
> > > > >
> > > > Please undo your -1. We can work on differing understandings in JIRA
> > > while
> > > > I work on the report you suggested and while 2.4.0RC proceeds.
> > > >
> > > > S
> > > >
> > > >
> > > >
> > > > > But your reply above made me wonder whether we are talking about
> the
> > > same
> > > > > thing. That's why I'm -1 here. I'm not going to force you to do the
> > > test
> > > > > suggested by me, as I said it could be done after merging, just
> want
> > to
> > > > > reach an agreement on the goal of this feature.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> > > > >
> > > > > > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> > > > andrew.purtell@gmail.com
> > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > I agree with Duo’s comment that a performance gain is unlikely
> > but
> > > > > would
> > > > > > > be orthogonal anyway;
> > > > > >
> > > > > >
> > > > > > Perf observation is just an aside in the issue. Perf is
> orthogonal
> > as
> > > > you
> > > > > > say above (as long as no regression).
> > > > > >
> > > > > >
> > > > > >
> > > > > > > it’s an availability gain that is the goal. We can assume it
> > based
> > > on
> > > > > > > theory of operation and unit test results but the gain should
> be
> > > > tested
> > > > > > and
> > > > > > > measured on a cluster too.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > The feature is about distributing load on hbase:meta to alleviate
> > > > > > hotspotting; it makes read replicas more live so replicas are
> more
> > > > likely
> > > > > > to satisfy location lookups making read replicas more effective.
> > That
> > > > > read
> > > > > > replicas improve HA is presumed -- it was the original
> > justification
> > > > for
> > > > > > this years old commit -- but HA is not the focus of this
> addition;
> > > > hence
> > > > > no
> > > > > > reports on effectiveness in this area.
> > > > > >
> > > > > > I have no problem working on such tests/reports but suggest that
> > they
> > > > are
> > > > > > done post merge.
> > > > > >
> > > > > >
> > > > > >
> > > > > > > That said, the results of the testing thus far indicate no
> > > > regression,
> > > > > > > which gives me confidence to support a merge. Specifically, a
> > merge
> > > > to
> > > > > > > “unblock” 2.4 (we aren’t really blocked, we are waiting),
> > provided
> > > > the
> > > > > > > default there is the feature is configured off. But please
> > indicate
> > > > in
> > > > > > > documentation and release notes that the feature is not widely
> > > tested
> > > > > > yet -
> > > > > > > as is customarily done for new functionality like this.
> > > > > > >
> > > > > > >
> > > > > > No problem w/ flagging the feature as new.
> > > > > >
> > > > > > Thanks,
> > > > > > S
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com>
> wrote:
> > > > > > > >
> > > > > > > > Replied on jira, I think we missed an important scenario
> when
> > > > > testing.
> > > > > > > >
> > > > > > > > Thanks.
> > > > > > > >
> > > > > > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > > > > > >
> > > > > > > >> HBASE-18070 makes it so hbase:meta read replicas can run
> > closer
> > > to
> > > > > the
> > > > > > > >> primary, (< second lags rather than minutes). It adds Async
> > WAL
> > > > > > > >> Replication[1] on the hbase:meta table; i.e. edits are
> sprayed
> > > > > across
> > > > > > > >> replicas as they arrive at the primary's WAL. Before this
> > work,
> > > > > Async
> > > > > > > WAL
> > > > > > > >> Replication was only available on user-space tables and the
> > only
> > > > > > option
> > > > > > > for
> > > > > > > >> hbase:meta read replicas was reloading the primaries hfiles
> > on a
> > > > > > period
> > > > > > > >> (minutes). HBASE-18070 also adds an optional client-side
> > > > > 'LoadBalance'
> > > > > > > >> policy that favors read replicas ahead of primary reads
> > falling
> > > > back
> > > > > > to
> > > > > > > the
> > > > > > > >> primary on fault. Together, these additions allow
> distributing
> > > > > > > hbase:meta
> > > > > > > >> read load across primary and replicas alleviating
> > 'hotspotting'.
> > > > > > > >>
> > > > > > > >> I would like to merge the feature to master branch Monday
> > > evening
> > > > if
> > > > > > > there
> > > > > > > >> is no objection (Soon after I'll merge to branch-2 so this
> > > feature
> > > > > can
> > > > > > > >> hopefully be included in the upcoming 2.4.0RC).
> > > > > > > >>
> > > > > > > >> * For the design, see [2].
> > > > > > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that
> > comprise
> > > > > this
> > > > > > > >> feature, see [3].
> > > > > > > >> * For a PE report that compared performance before and
> after,
> > > see
> > > > > > > >> HBASE-25127 (no regression).
> > > > > > > >> * A report on ITBLL runs is pending to be attached to
> > > HBASE-18070
> > > > > but
> > > > > > > runs
> > > > > > > >> so far show no regression with the feature enabled (ITBLL
> runs
> > > > were
> > > > > > done
> > > > > > > >> against a backport of this feature to branch-2 as the ITBLL
> > > state
> > > > of
> > > > > > > master
> > > > > > > >> is currently an unknown).
> > > > > > > >>
> > > > > > > >> Testing continues mainly looking for further improvement and
> > to
> > > > > better
> > > > > > > >> understand this feature in operation. Documentation is
> > included
> > > > but
> > > > > in
> > > > > > > need
> > > > > > > >> of polish (working on it).
> > > > > > > >>
> > > > > > > >> Dump any questions here and I'll be happy to respond. If you
> > > need
> > > > > more
> > > > > > > time
> > > > > > > >> to review, just shout.
> > > > > > > >>
> > > > > > > >> Thanks and thanks to all who contributed to this feature;
> the
> > > > > > reviewers
> > > > > > > and
> > > > > > > >> the testers in particular.
> > > > > > > >>
> > > > > > > >> S
> > > > > > > >>
> > > > > > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > > > > > >> 2.
> > > > > > > >>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > > > > > >> This patch is currently missing HBASE-25280, a bug found in
> > > > testing.
> > > > > > > >> 3. https://github.com/apache/hbase/pull/2643
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrew
> > >
> > > Words like orphans lost among the crosstalk, meaning torn from truth's
> > > decrepit hands
> > >    - A23, Crosstalk
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Andrew Purtell <ap...@apache.org>.
I see, sure we can try to resolve it.

We had two full weeks, from now until the end of the month, for the RC, to
get it out before end of month. Further delay lessens that time frame but I
admit it is an arbitrary target and that shouldn't be the top most concern
given the interest of the contributors as expressed on this thread ("Our
group were hoping to throw our shoulder behind 2.4 stabilizing so we could
deploy it to production.").

I commented on the JIRA. Perhaps the difference in technical opinion comes
down to a replica preference policy alternative that can be resolved with
follow up work, and that is the way forward.


On Mon, Nov 16, 2020 at 7:59 AM Stack <st...@duboce.net> wrote:

> On Mon, Nov 16, 2020 at 7:44 AM Andrew Purtell <ap...@apache.org>
> wrote:
>
> > My apologies, Stack, it's time to move on for 2.4. We can revisit this
> for
> > 2.5.
> >
> >
> One more day to allow Duo reconsider (my fault for not making this a VOTE
> thread)?
>
> The work here is mostly that of others. It would be a shame it didn't land
> in 2.4 because of my representation. Our group were hoping to throw our
> shoulder behind 2.4 stabilizing so we could deploy it to production.
> Without this feature, we'll have to reconsider.
>
> Thanks Andrew,
> S
>
>
>
> > On Mon, Nov 16, 2020 at 7:41 AM Stack <st...@duboce.net> wrote:
> >
> > > On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > >
> > > > So what is your purpose of distributing the request of region
> location
> > > > lookup? It is just because you want to 'distribute the request of
> > region
> > > > location lookup'?
> > > >
> > > > Then I'm -1 on merging. We should reach an agreement on what we want
> to
> > > > solve before merging at least.
> > > >
> > > > I've helped this issue from the design doc step. For me, the purpose
> > for
> > > > this issue is clear. We want to prevent the hotspot of meta, so the
> > > > solution is simple, enable meta replica, and then just modify the
> > client
> > > to
> > > > not always go to primary replica first(this is the old behavior even
> > with
> > > > meta replica feature on).
> > > > And this will introduce another problem that, there is no meta region
> > > > replication implementation for meta read replicas, which means the
> > > latency
> > > > will be large as we can only sync the data between replicas through
> > > region
> > > > flush, so we implement meta region replication.
> > > >
> > > > So I think it is very important to verify that we have truly
> > distributed
> > > > the request of region location lookup, and also make sure that we
> could
> > > > support more requests of region location lookup. Otherwise this
> feature
> > > is
> > > > useless.
> > > >
> > > > And I agree with Andrew that, since the feature is default off on
> > > branch-2
> > > > and has no regression, it is OK to merge for now. Theoretically our
> > > > approach here should work, so even it does not work for now, I think
> we
> > > > could fix the problems to make it work.
> > > >
> > > >
> > > Please undo your -1. We can work on differing understandings in JIRA
> > while
> > > I work on the report you suggested and while 2.4.0RC proceeds.
> > >
> > > S
> > >
> > >
> > >
> > > > But your reply above made me wonder whether we are talking about the
> > same
> > > > thing. That's why I'm -1 here. I'm not going to force you to do the
> > test
> > > > suggested by me, as I said it could be done after merging, just want
> to
> > > > reach an agreement on the goal of this feature.
> > > >
> > > > Thanks.
> > > >
> > > > Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> > > >
> > > > > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> > > andrew.purtell@gmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > I agree with Duo’s comment that a performance gain is unlikely
> but
> > > > would
> > > > > > be orthogonal anyway;
> > > > >
> > > > >
> > > > > Perf observation is just an aside in the issue. Perf is orthogonal
> as
> > > you
> > > > > say above (as long as no regression).
> > > > >
> > > > >
> > > > >
> > > > > > it’s an availability gain that is the goal. We can assume it
> based
> > on
> > > > > > theory of operation and unit test results but the gain should be
> > > tested
> > > > > and
> > > > > > measured on a cluster too.
> > > > > >
> > > > >
> > > > >
> > > > > The feature is about distributing load on hbase:meta to alleviate
> > > > > hotspotting; it makes read replicas more live so replicas are more
> > > likely
> > > > > to satisfy location lookups making read replicas more effective.
> That
> > > > read
> > > > > replicas improve HA is presumed -- it was the original
> justification
> > > for
> > > > > this years old commit -- but HA is not the focus of this addition;
> > > hence
> > > > no
> > > > > reports on effectiveness in this area.
> > > > >
> > > > > I have no problem working on such tests/reports but suggest that
> they
> > > are
> > > > > done post merge.
> > > > >
> > > > >
> > > > >
> > > > > > That said, the results of the testing thus far indicate no
> > > regression,
> > > > > > which gives me confidence to support a merge. Specifically, a
> merge
> > > to
> > > > > > “unblock” 2.4 (we aren’t really blocked, we are waiting),
> provided
> > > the
> > > > > > default there is the feature is configured off. But please
> indicate
> > > in
> > > > > > documentation and release notes that the feature is not widely
> > tested
> > > > > yet -
> > > > > > as is customarily done for new functionality like this.
> > > > > >
> > > > > >
> > > > > No problem w/ flagging the feature as new.
> > > > >
> > > > > Thanks,
> > > > > S
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > > > > > >
> > > > > > > Replied on jira, I think we missed an important scenario when
> > > > testing.
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > > > > >
> > > > > > >> HBASE-18070 makes it so hbase:meta read replicas can run
> closer
> > to
> > > > the
> > > > > > >> primary, (< second lags rather than minutes). It adds Async
> WAL
> > > > > > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> > > > across
> > > > > > >> replicas as they arrive at the primary's WAL. Before this
> work,
> > > > Async
> > > > > > WAL
> > > > > > >> Replication was only available on user-space tables and the
> only
> > > > > option
> > > > > > for
> > > > > > >> hbase:meta read replicas was reloading the primaries hfiles
> on a
> > > > > period
> > > > > > >> (minutes). HBASE-18070 also adds an optional client-side
> > > > 'LoadBalance'
> > > > > > >> policy that favors read replicas ahead of primary reads
> falling
> > > back
> > > > > to
> > > > > > the
> > > > > > >> primary on fault. Together, these additions allow distributing
> > > > > > hbase:meta
> > > > > > >> read load across primary and replicas alleviating
> 'hotspotting'.
> > > > > > >>
> > > > > > >> I would like to merge the feature to master branch Monday
> > evening
> > > if
> > > > > > there
> > > > > > >> is no objection (Soon after I'll merge to branch-2 so this
> > feature
> > > > can
> > > > > > >> hopefully be included in the upcoming 2.4.0RC).
> > > > > > >>
> > > > > > >> * For the design, see [2].
> > > > > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that
> comprise
> > > > this
> > > > > > >> feature, see [3].
> > > > > > >> * For a PE report that compared performance before and after,
> > see
> > > > > > >> HBASE-25127 (no regression).
> > > > > > >> * A report on ITBLL runs is pending to be attached to
> > HBASE-18070
> > > > but
> > > > > > runs
> > > > > > >> so far show no regression with the feature enabled (ITBLL runs
> > > were
> > > > > done
> > > > > > >> against a backport of this feature to branch-2 as the ITBLL
> > state
> > > of
> > > > > > master
> > > > > > >> is currently an unknown).
> > > > > > >>
> > > > > > >> Testing continues mainly looking for further improvement and
> to
> > > > better
> > > > > > >> understand this feature in operation. Documentation is
> included
> > > but
> > > > in
> > > > > > need
> > > > > > >> of polish (working on it).
> > > > > > >>
> > > > > > >> Dump any questions here and I'll be happy to respond. If you
> > need
> > > > more
> > > > > > time
> > > > > > >> to review, just shout.
> > > > > > >>
> > > > > > >> Thanks and thanks to all who contributed to this feature; the
> > > > > reviewers
> > > > > > and
> > > > > > >> the testers in particular.
> > > > > > >>
> > > > > > >> S
> > > > > > >>
> > > > > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > > > > >> 2.
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > > > > >> This patch is currently missing HBASE-25280, a bug found in
> > > testing.
> > > > > > >> 3. https://github.com/apache/hbase/pull/2643
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> >    - A23, Crosstalk
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Stack <st...@duboce.net>.
On Mon, Nov 16, 2020 at 7:44 AM Andrew Purtell <ap...@apache.org> wrote:

> My apologies, Stack, it's time to move on for 2.4. We can revisit this for
> 2.5.
>
>
One more day to allow Duo reconsider (my fault for not making this a VOTE
thread)?

The work here is mostly that of others. It would be a shame it didn't land
in 2.4 because of my representation. Our group were hoping to throw our
shoulder behind 2.4 stabilizing so we could deploy it to production.
Without this feature, we'll have to reconsider.

Thanks Andrew,
S



> On Mon, Nov 16, 2020 at 7:41 AM Stack <st...@duboce.net> wrote:
>
> > On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > So what is your purpose of distributing the request of region location
> > > lookup? It is just because you want to 'distribute the request of
> region
> > > location lookup'?
> > >
> > > Then I'm -1 on merging. We should reach an agreement on what we want to
> > > solve before merging at least.
> > >
> > > I've helped this issue from the design doc step. For me, the purpose
> for
> > > this issue is clear. We want to prevent the hotspot of meta, so the
> > > solution is simple, enable meta replica, and then just modify the
> client
> > to
> > > not always go to primary replica first(this is the old behavior even
> with
> > > meta replica feature on).
> > > And this will introduce another problem that, there is no meta region
> > > replication implementation for meta read replicas, which means the
> > latency
> > > will be large as we can only sync the data between replicas through
> > region
> > > flush, so we implement meta region replication.
> > >
> > > So I think it is very important to verify that we have truly
> distributed
> > > the request of region location lookup, and also make sure that we could
> > > support more requests of region location lookup. Otherwise this feature
> > is
> > > useless.
> > >
> > > And I agree with Andrew that, since the feature is default off on
> > branch-2
> > > and has no regression, it is OK to merge for now. Theoretically our
> > > approach here should work, so even it does not work for now, I think we
> > > could fix the problems to make it work.
> > >
> > >
> > Please undo your -1. We can work on differing understandings in JIRA
> while
> > I work on the report you suggested and while 2.4.0RC proceeds.
> >
> > S
> >
> >
> >
> > > But your reply above made me wonder whether we are talking about the
> same
> > > thing. That's why I'm -1 here. I'm not going to force you to do the
> test
> > > suggested by me, as I said it could be done after merging, just want to
> > > reach an agreement on the goal of this feature.
> > >
> > > Thanks.
> > >
> > > Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> > >
> > > > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> > andrew.purtell@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > I agree with Duo’s comment that a performance gain is unlikely but
> > > would
> > > > > be orthogonal anyway;
> > > >
> > > >
> > > > Perf observation is just an aside in the issue. Perf is orthogonal as
> > you
> > > > say above (as long as no regression).
> > > >
> > > >
> > > >
> > > > > it’s an availability gain that is the goal. We can assume it based
> on
> > > > > theory of operation and unit test results but the gain should be
> > tested
> > > > and
> > > > > measured on a cluster too.
> > > > >
> > > >
> > > >
> > > > The feature is about distributing load on hbase:meta to alleviate
> > > > hotspotting; it makes read replicas more live so replicas are more
> > likely
> > > > to satisfy location lookups making read replicas more effective. That
> > > read
> > > > replicas improve HA is presumed -- it was the original justification
> > for
> > > > this years old commit -- but HA is not the focus of this addition;
> > hence
> > > no
> > > > reports on effectiveness in this area.
> > > >
> > > > I have no problem working on such tests/reports but suggest that they
> > are
> > > > done post merge.
> > > >
> > > >
> > > >
> > > > > That said, the results of the testing thus far indicate no
> > regression,
> > > > > which gives me confidence to support a merge. Specifically, a merge
> > to
> > > > > “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
> > the
> > > > > default there is the feature is configured off. But please indicate
> > in
> > > > > documentation and release notes that the feature is not widely
> tested
> > > > yet -
> > > > > as is customarily done for new functionality like this.
> > > > >
> > > > >
> > > > No problem w/ flagging the feature as new.
> > > >
> > > > Thanks,
> > > > S
> > > >
> > > >
> > > >
> > > > >
> > > > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > > > > >
> > > > > > Replied on jira, I think we missed an important scenario when
> > > testing.
> > > > > >
> > > > > > Thanks.
> > > > > >
> > > > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > > > >
> > > > > >> HBASE-18070 makes it so hbase:meta read replicas can run closer
> to
> > > the
> > > > > >> primary, (< second lags rather than minutes). It adds Async WAL
> > > > > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> > > across
> > > > > >> replicas as they arrive at the primary's WAL. Before this work,
> > > Async
> > > > > WAL
> > > > > >> Replication was only available on user-space tables and the only
> > > > option
> > > > > for
> > > > > >> hbase:meta read replicas was reloading the primaries hfiles on a
> > > > period
> > > > > >> (minutes). HBASE-18070 also adds an optional client-side
> > > 'LoadBalance'
> > > > > >> policy that favors read replicas ahead of primary reads falling
> > back
> > > > to
> > > > > the
> > > > > >> primary on fault. Together, these additions allow distributing
> > > > > hbase:meta
> > > > > >> read load across primary and replicas alleviating 'hotspotting'.
> > > > > >>
> > > > > >> I would like to merge the feature to master branch Monday
> evening
> > if
> > > > > there
> > > > > >> is no objection (Soon after I'll merge to branch-2 so this
> feature
> > > can
> > > > > >> hopefully be included in the upcoming 2.4.0RC).
> > > > > >>
> > > > > >> * For the design, see [2].
> > > > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> > > this
> > > > > >> feature, see [3].
> > > > > >> * For a PE report that compared performance before and after,
> see
> > > > > >> HBASE-25127 (no regression).
> > > > > >> * A report on ITBLL runs is pending to be attached to
> HBASE-18070
> > > but
> > > > > runs
> > > > > >> so far show no regression with the feature enabled (ITBLL runs
> > were
> > > > done
> > > > > >> against a backport of this feature to branch-2 as the ITBLL
> state
> > of
> > > > > master
> > > > > >> is currently an unknown).
> > > > > >>
> > > > > >> Testing continues mainly looking for further improvement and to
> > > better
> > > > > >> understand this feature in operation. Documentation is included
> > but
> > > in
> > > > > need
> > > > > >> of polish (working on it).
> > > > > >>
> > > > > >> Dump any questions here and I'll be happy to respond. If you
> need
> > > more
> > > > > time
> > > > > >> to review, just shout.
> > > > > >>
> > > > > >> Thanks and thanks to all who contributed to this feature; the
> > > > reviewers
> > > > > and
> > > > > >> the testers in particular.
> > > > > >>
> > > > > >> S
> > > > > >>
> > > > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > > > >> 2.
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > > > >> This patch is currently missing HBASE-25280, a bug found in
> > testing.
> > > > > >> 3. https://github.com/apache/hbase/pull/2643
> > > > > >>
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Andrew Purtell <ap...@apache.org>.
My apologies, Stack, it's time to move on for 2.4. We can revisit this for
2.5.

On Mon, Nov 16, 2020 at 7:41 AM Stack <st...@duboce.net> wrote:

> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > So what is your purpose of distributing the request of region location
> > lookup? It is just because you want to 'distribute the request of region
> > location lookup'?
> >
> > Then I'm -1 on merging. We should reach an agreement on what we want to
> > solve before merging at least.
> >
> > I've helped this issue from the design doc step. For me, the purpose for
> > this issue is clear. We want to prevent the hotspot of meta, so the
> > solution is simple, enable meta replica, and then just modify the client
> to
> > not always go to primary replica first(this is the old behavior even with
> > meta replica feature on).
> > And this will introduce another problem that, there is no meta region
> > replication implementation for meta read replicas, which means the
> latency
> > will be large as we can only sync the data between replicas through
> region
> > flush, so we implement meta region replication.
> >
> > So I think it is very important to verify that we have truly distributed
> > the request of region location lookup, and also make sure that we could
> > support more requests of region location lookup. Otherwise this feature
> is
> > useless.
> >
> > And I agree with Andrew that, since the feature is default off on
> branch-2
> > and has no regression, it is OK to merge for now. Theoretically our
> > approach here should work, so even it does not work for now, I think we
> > could fix the problems to make it work.
> >
> >
> Please undo your -1. We can work on differing understandings in JIRA while
> I work on the report you suggested and while 2.4.0RC proceeds.
>
> S
>
>
>
> > But your reply above made me wonder whether we are talking about the same
> > thing. That's why I'm -1 here. I'm not going to force you to do the test
> > suggested by me, as I said it could be done after merging, just want to
> > reach an agreement on the goal of this feature.
> >
> > Thanks.
> >
> > Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> >
> > > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> andrew.purtell@gmail.com
> > >
> > > wrote:
> > >
> > > > I agree with Duo’s comment that a performance gain is unlikely but
> > would
> > > > be orthogonal anyway;
> > >
> > >
> > > Perf observation is just an aside in the issue. Perf is orthogonal as
> you
> > > say above (as long as no regression).
> > >
> > >
> > >
> > > > it’s an availability gain that is the goal. We can assume it based on
> > > > theory of operation and unit test results but the gain should be
> tested
> > > and
> > > > measured on a cluster too.
> > > >
> > >
> > >
> > > The feature is about distributing load on hbase:meta to alleviate
> > > hotspotting; it makes read replicas more live so replicas are more
> likely
> > > to satisfy location lookups making read replicas more effective. That
> > read
> > > replicas improve HA is presumed -- it was the original justification
> for
> > > this years old commit -- but HA is not the focus of this addition;
> hence
> > no
> > > reports on effectiveness in this area.
> > >
> > > I have no problem working on such tests/reports but suggest that they
> are
> > > done post merge.
> > >
> > >
> > >
> > > > That said, the results of the testing thus far indicate no
> regression,
> > > > which gives me confidence to support a merge. Specifically, a merge
> to
> > > > “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
> the
> > > > default there is the feature is configured off. But please indicate
> in
> > > > documentation and release notes that the feature is not widely tested
> > > yet -
> > > > as is customarily done for new functionality like this.
> > > >
> > > >
> > > No problem w/ flagging the feature as new.
> > >
> > > Thanks,
> > > S
> > >
> > >
> > >
> > > >
> > > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > > > >
> > > > > Replied on jira, I think we missed an important scenario when
> > testing.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > > >
> > > > >> HBASE-18070 makes it so hbase:meta read replicas can run closer to
> > the
> > > > >> primary, (< second lags rather than minutes). It adds Async WAL
> > > > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> > across
> > > > >> replicas as they arrive at the primary's WAL. Before this work,
> > Async
> > > > WAL
> > > > >> Replication was only available on user-space tables and the only
> > > option
> > > > for
> > > > >> hbase:meta read replicas was reloading the primaries hfiles on a
> > > period
> > > > >> (minutes). HBASE-18070 also adds an optional client-side
> > 'LoadBalance'
> > > > >> policy that favors read replicas ahead of primary reads falling
> back
> > > to
> > > > the
> > > > >> primary on fault. Together, these additions allow distributing
> > > > hbase:meta
> > > > >> read load across primary and replicas alleviating 'hotspotting'.
> > > > >>
> > > > >> I would like to merge the feature to master branch Monday evening
> if
> > > > there
> > > > >> is no objection (Soon after I'll merge to branch-2 so this feature
> > can
> > > > >> hopefully be included in the upcoming 2.4.0RC).
> > > > >>
> > > > >> * For the design, see [2].
> > > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> > this
> > > > >> feature, see [3].
> > > > >> * For a PE report that compared performance before and after, see
> > > > >> HBASE-25127 (no regression).
> > > > >> * A report on ITBLL runs is pending to be attached to HBASE-18070
> > but
> > > > runs
> > > > >> so far show no regression with the feature enabled (ITBLL runs
> were
> > > done
> > > > >> against a backport of this feature to branch-2 as the ITBLL state
> of
> > > > master
> > > > >> is currently an unknown).
> > > > >>
> > > > >> Testing continues mainly looking for further improvement and to
> > better
> > > > >> understand this feature in operation. Documentation is included
> but
> > in
> > > > need
> > > > >> of polish (working on it).
> > > > >>
> > > > >> Dump any questions here and I'll be happy to respond. If you need
> > more
> > > > time
> > > > >> to review, just shout.
> > > > >>
> > > > >> Thanks and thanks to all who contributed to this feature; the
> > > reviewers
> > > > and
> > > > >> the testers in particular.
> > > > >>
> > > > >> S
> > > > >>
> > > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > > >> 2.
> > > > >>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > > >> This patch is currently missing HBASE-25280, a bug found in
> testing.
> > > > >> 3. https://github.com/apache/hbase/pull/2643
> > > > >>
> > > >
> > >
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Stack <st...@duboce.net>.
On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> So what is your purpose of distributing the request of region location
> lookup? It is just because you want to 'distribute the request of region
> location lookup'?
>
> Then I'm -1 on merging. We should reach an agreement on what we want to
> solve before merging at least.
>
> I've helped this issue from the design doc step. For me, the purpose for
> this issue is clear. We want to prevent the hotspot of meta, so the
> solution is simple, enable meta replica, and then just modify the client to
> not always go to primary replica first(this is the old behavior even with
> meta replica feature on).
> And this will introduce another problem that, there is no meta region
> replication implementation for meta read replicas, which means the latency
> will be large as we can only sync the data between replicas through region
> flush, so we implement meta region replication.
>
> So I think it is very important to verify that we have truly distributed
> the request of region location lookup, and also make sure that we could
> support more requests of region location lookup. Otherwise this feature is
> useless.
>
> And I agree with Andrew that, since the feature is default off on branch-2
> and has no regression, it is OK to merge for now. Theoretically our
> approach here should work, so even it does not work for now, I think we
> could fix the problems to make it work.
>
>
Please undo your -1. We can work on differing understandings in JIRA while
I work on the report you suggested and while 2.4.0RC proceeds.

S



> But your reply above made me wonder whether we are talking about the same
> thing. That's why I'm -1 here. I'm not going to force you to do the test
> suggested by me, as I said it could be done after merging, just want to
> reach an agreement on the goal of this feature.
>
> Thanks.
>
> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
>
> > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <andrew.purtell@gmail.com
> >
> > wrote:
> >
> > > I agree with Duo’s comment that a performance gain is unlikely but
> would
> > > be orthogonal anyway;
> >
> >
> > Perf observation is just an aside in the issue. Perf is orthogonal as you
> > say above (as long as no regression).
> >
> >
> >
> > > it’s an availability gain that is the goal. We can assume it based on
> > > theory of operation and unit test results but the gain should be tested
> > and
> > > measured on a cluster too.
> > >
> >
> >
> > The feature is about distributing load on hbase:meta to alleviate
> > hotspotting; it makes read replicas more live so replicas are more likely
> > to satisfy location lookups making read replicas more effective. That
> read
> > replicas improve HA is presumed -- it was the original justification for
> > this years old commit -- but HA is not the focus of this addition; hence
> no
> > reports on effectiveness in this area.
> >
> > I have no problem working on such tests/reports but suggest that they are
> > done post merge.
> >
> >
> >
> > > That said, the results of the testing thus far indicate no regression,
> > > which gives me confidence to support a merge. Specifically, a merge to
> > > “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the
> > > default there is the feature is configured off. But please indicate in
> > > documentation and release notes that the feature is not widely tested
> > yet -
> > > as is customarily done for new functionality like this.
> > >
> > >
> > No problem w/ flagging the feature as new.
> >
> > Thanks,
> > S
> >
> >
> >
> > >
> > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > > >
> > > > Replied on jira, I think we missed an important scenario when
> testing.
> > > >
> > > > Thanks.
> > > >
> > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > >
> > > >> HBASE-18070 makes it so hbase:meta read replicas can run closer to
> the
> > > >> primary, (< second lags rather than minutes). It adds Async WAL
> > > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> across
> > > >> replicas as they arrive at the primary's WAL. Before this work,
> Async
> > > WAL
> > > >> Replication was only available on user-space tables and the only
> > option
> > > for
> > > >> hbase:meta read replicas was reloading the primaries hfiles on a
> > period
> > > >> (minutes). HBASE-18070 also adds an optional client-side
> 'LoadBalance'
> > > >> policy that favors read replicas ahead of primary reads falling back
> > to
> > > the
> > > >> primary on fault. Together, these additions allow distributing
> > > hbase:meta
> > > >> read load across primary and replicas alleviating 'hotspotting'.
> > > >>
> > > >> I would like to merge the feature to master branch Monday evening if
> > > there
> > > >> is no objection (Soon after I'll merge to branch-2 so this feature
> can
> > > >> hopefully be included in the upcoming 2.4.0RC).
> > > >>
> > > >> * For the design, see [2].
> > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> this
> > > >> feature, see [3].
> > > >> * For a PE report that compared performance before and after, see
> > > >> HBASE-25127 (no regression).
> > > >> * A report on ITBLL runs is pending to be attached to HBASE-18070
> but
> > > runs
> > > >> so far show no regression with the feature enabled (ITBLL runs were
> > done
> > > >> against a backport of this feature to branch-2 as the ITBLL state of
> > > master
> > > >> is currently an unknown).
> > > >>
> > > >> Testing continues mainly looking for further improvement and to
> better
> > > >> understand this feature in operation. Documentation is included but
> in
> > > need
> > > >> of polish (working on it).
> > > >>
> > > >> Dump any questions here and I'll be happy to respond. If you need
> more
> > > time
> > > >> to review, just shout.
> > > >>
> > > >> Thanks and thanks to all who contributed to this feature; the
> > reviewers
> > > and
> > > >> the testers in particular.
> > > >>
> > > >> S
> > > >>
> > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > >> 2.
> > > >>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > >> This patch is currently missing HBASE-25280, a bug found in testing.
> > > >> 3. https://github.com/apache/hbase/pull/2643
> > > >>
> > >
> >
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Stack <st...@duboce.net>.
The VOTE on the adjacent thread has passed.

Will rebase and rerun the hadoopqa check to be sure all is good, and then
merge the current unaltered state of branch HBASE-18070 to the master
branch (this afternoon if all goes well).

I will then work on doing the same for branch-2 merging
HBASE-18070.branch-2 (contingent on how the master merge goes).

Will work on the matter of landing outstanding design doc edits
concurrently (Andrew, if ok, please hold on the RC until this is ironed out
-- thanks).

S

On Tue, Nov 17, 2020 at 9:13 AM Stack <st...@duboce.net> wrote:

> I've started an adjacent VOTE thread in an attempt at clarity of
> how-to-go-forward here.
> Thanks,
> S
>
> On Tue, Nov 17, 2020 at 7:56 AM Andrew Purtell <an...@gmail.com>
> wrote:
>
>> Hi Duo,
>>
>> Just to be clear: You are saying go ahead with the merge, but then also
>> go back and start this discussion fresh, to see if anything was missed and
>> more can be done?
>>
>> > On Nov 16, 2020, at 11:25 PM, 张铎 <pa...@gmail.com> wrote:
>> >
>> > Oh, this is my fault. I mean the old behavior IS to go to primary
>> replica
>> > first, which is what we want to change here.
>> >
>> > And what I commented  on jira, is to say that we do not need to get a
>> > performance improvement before merging, it is not the goal of this
>> issue.
>> > And I suggested that if we want to show our advantage, we need to get
>> the
>> > primary replica fucked up. I do not know why then the discussion went to
>> > the HedgeRead and I could not poll it back. I do not think this should
>> > block the merging but even though it was still very hard to
>> communicate, so
>> > I assumed this means we still have a big gap on what we want to solve
>> here,
>> > thus I voted a -1 here.
>> >
>> > I think we need to go back to the beginning, to reach an agreement on
>> the
>> > goal here. Let’s review the design doc again to see if we missed
>> something
>> > which lead us to this situation.
>> >
>> > And I need to say that, I do not want to block the issue to be merged. I
>> > tried my best to speed up the process. I suggested to land the changes
>> at
>> > client side to master directly but was refused. I helped to add scan on
>> > specific replica feature soon on branch-2 to let the port to branch-2
>> can
>> > be landed cleanly.
>> >
>> > On a mobile device so can not review the code or PR. Very busy these
>> days.
>> > And the health examination this morning told me that I had a high blood
>> > pressure. Not a good birthday present. Will get back to the issue when
>> > possible.
>> >
>> > Thanks.
>> >
>> > Stack <st...@duboce.net>于2020年11月17日 周二06:34写道:
>> >
>> >>> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <palomino219@gmail.com
>> >
>> >>> wrote:
>> >>>
>> >>> So what is your purpose of distributing the request of region location
>> >>> lookup? It is just because you want to 'distribute the request of
>> region
>> >>> location lookup'?
>> >>>
>> >>> Then I'm -1 on merging. We should reach an agreement on what we want
>> to
>> >>> solve before merging at least.
>> >>>
>> >>>
>> >> HERE.1
>> >>
>> >>
>> >>> I've helped this issue from the design doc step. For me, the purpose
>> for
>> >>> this issue is clear. We want to prevent the hotspot of meta, so the
>> >>> solution is simple, enable meta replica, and then just modify the
>> client
>> >> to
>> >>> not always go to primary replica first(this is the old behavior even
>> with
>> >>> meta replica feature on).
>> >>> And this will introduce another problem that, there is no meta region
>> >>> replication implementation for meta read replicas, which means the
>> >> latency
>> >>> will be large as we can only sync the data between replicas through
>> >> region
>> >>> flush, so we implement meta region replication.
>> >>>
>> >>> So I think it is very important to verify that we have truly
>> distributed
>> >>> the request of region location lookup, and also make sure that we
>> could
>> >>> support more requests of region location lookup. Otherwise this
>> feature
>> >> is
>> >>> useless.
>> >>>
>> >>> And I agree with Andrew that, since the feature is default off on
>> >> branch-2
>> >>> and has no regression, it is OK to merge for now. Theoretically our
>> >>> approach here should work, so even it does not work for now, I think
>> we
>> >>> could fix the problems to make it work.
>> >>>
>> >>>
>> >> HERE.2
>> >>
>> >> I agree with all of the above between HERE.1 and HERE.2 (except the
>> >> suggestion that the old behavior of read replicas is that they went to
>> the
>> >> replica first; they go to the primary first -- see [1], [2]).
>> >>
>> >> Lets work with any misalignment of understanding/communication offline
>> and
>> >> not in the way of merge.
>> >>
>> >> Thanks,
>> >> S
>> >>
>> >> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a
>> read
>> >> is performed with Consistency.TIMELINE, then the read RPC will be sent
>> to
>> >> the primary region server first."
>> >> 2.
>> >>
>> >>
>> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195
>> >>
>> >>
>> >>
>> >>> But your reply above made me wonder whether we are talking about the
>> same
>> >>> thing. That's why I'm -1 here. I'm not going to force you to do the
>> test
>> >>> suggested by me, as I said it could be done after merging, just want
>> to
>> >>> reach an agreement on the goal of this feature.
>> >>>
>> >>> Thanks.
>> >>>
>> >>> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
>> >>>
>> >>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
>> >> andrew.purtell@gmail.com
>> >>>>
>> >>>> wrote:
>> >>>>
>> >>>>> I agree with Duo’s comment that a performance gain is unlikely but
>> >>> would
>> >>>>> be orthogonal anyway;
>> >>>>
>> >>>>
>> >>>> Perf observation is just an aside in the issue. Perf is orthogonal as
>> >> you
>> >>>> say above (as long as no regression).
>> >>>>
>> >>>>
>> >>>>
>> >>>>> it’s an availability gain that is the goal. We can assume it based
>> on
>> >>>>> theory of operation and unit test results but the gain should be
>> >> tested
>> >>>> and
>> >>>>> measured on a cluster too.
>> >>>>>
>> >>>>
>> >>>>
>> >>>> The feature is about distributing load on hbase:meta to alleviate
>> >>>> hotspotting; it makes read replicas more live so replicas are more
>> >> likely
>> >>>> to satisfy location lookups making read replicas more effective. That
>> >>> read
>> >>>> replicas improve HA is presumed -- it was the original justification
>> >> for
>> >>>> this years old commit -- but HA is not the focus of this addition;
>> >> hence
>> >>> no
>> >>>> reports on effectiveness in this area.
>> >>>>
>> >>>> I have no problem working on such tests/reports but suggest that they
>> >> are
>> >>>> done post merge.
>> >>>>
>> >>>>
>> >>>>
>> >>>>> That said, the results of the testing thus far indicate no
>> >> regression,
>> >>>>> which gives me confidence to support a merge. Specifically, a merge
>> >> to
>> >>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
>> >> the
>> >>>>> default there is the feature is configured off. But please indicate
>> >> in
>> >>>>> documentation and release notes that the feature is not widely
>> tested
>> >>>> yet -
>> >>>>> as is customarily done for new functionality like this.
>> >>>>>
>> >>>>>
>> >>>> No problem w/ flagging the feature as new.
>> >>>>
>> >>>> Thanks,
>> >>>> S
>> >>>>
>> >>>>
>> >>>>
>> >>>>>
>> >>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> Replied on jira, I think we missed an important scenario when
>> >>> testing.
>> >>>>>>
>> >>>>>> Thanks.
>> >>>>>>
>> >>>>>> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
>> >>>>>>
>> >>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to
>> >>> the
>> >>>>>>> primary, (< second lags rather than minutes). It adds Async WAL
>> >>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed
>> >>> across
>> >>>>>>> replicas as they arrive at the primary's WAL. Before this work,
>> >>> Async
>> >>>>> WAL
>> >>>>>>> Replication was only available on user-space tables and the only
>> >>>> option
>> >>>>> for
>> >>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
>> >>>> period
>> >>>>>>> (minutes). HBASE-18070 also adds an optional client-side
>> >>> 'LoadBalance'
>> >>>>>>> policy that favors read replicas ahead of primary reads falling
>> >> back
>> >>>> to
>> >>>>> the
>> >>>>>>> primary on fault. Together, these additions allow distributing
>> >>>>> hbase:meta
>> >>>>>>> read load across primary and replicas alleviating 'hotspotting'.
>> >>>>>>>
>> >>>>>>> I would like to merge the feature to master branch Monday evening
>> >> if
>> >>>>> there
>> >>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature
>> >>> can
>> >>>>>>> hopefully be included in the upcoming 2.4.0RC).
>> >>>>>>>
>> >>>>>>> * For the design, see [2].
>> >>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
>> >>> this
>> >>>>>>> feature, see [3].
>> >>>>>>> * For a PE report that compared performance before and after, see
>> >>>>>>> HBASE-25127 (no regression).
>> >>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070
>> >>> but
>> >>>>> runs
>> >>>>>>> so far show no regression with the feature enabled (ITBLL runs
>> >> were
>> >>>> done
>> >>>>>>> against a backport of this feature to branch-2 as the ITBLL state
>> >> of
>> >>>>> master
>> >>>>>>> is currently an unknown).
>> >>>>>>>
>> >>>>>>> Testing continues mainly looking for further improvement and to
>> >>> better
>> >>>>>>> understand this feature in operation. Documentation is included
>> >> but
>> >>> in
>> >>>>> need
>> >>>>>>> of polish (working on it).
>> >>>>>>>
>> >>>>>>> Dump any questions here and I'll be happy to respond. If you need
>> >>> more
>> >>>>> time
>> >>>>>>> to review, just shout.
>> >>>>>>>
>> >>>>>>> Thanks and thanks to all who contributed to this feature; the
>> >>>> reviewers
>> >>>>> and
>> >>>>>>> the testers in particular.
>> >>>>>>>
>> >>>>>>> S
>> >>>>>>>
>> >>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
>> >>>>>>> 2.
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
>> >>>>>>> This patch is currently missing HBASE-25280, a bug found in
>> >> testing.
>> >>>>>>> 3. https://github.com/apache/hbase/pull/2643
>> >>>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Stack <st...@duboce.net>.
I've started an adjacent VOTE thread in an attempt at clarity of
how-to-go-forward here.
Thanks,
S

On Tue, Nov 17, 2020 at 7:56 AM Andrew Purtell <an...@gmail.com>
wrote:

> Hi Duo,
>
> Just to be clear: You are saying go ahead with the merge, but then also go
> back and start this discussion fresh, to see if anything was missed and
> more can be done?
>
> > On Nov 16, 2020, at 11:25 PM, 张铎 <pa...@gmail.com> wrote:
> >
> > Oh, this is my fault. I mean the old behavior IS to go to primary
> replica
> > first, which is what we want to change here.
> >
> > And what I commented  on jira, is to say that we do not need to get a
> > performance improvement before merging, it is not the goal of this issue.
> > And I suggested that if we want to show our advantage, we need to get the
> > primary replica fucked up. I do not know why then the discussion went to
> > the HedgeRead and I could not poll it back. I do not think this should
> > block the merging but even though it was still very hard to communicate,
> so
> > I assumed this means we still have a big gap on what we want to solve
> here,
> > thus I voted a -1 here.
> >
> > I think we need to go back to the beginning, to reach an agreement on the
> > goal here. Let’s review the design doc again to see if we missed
> something
> > which lead us to this situation.
> >
> > And I need to say that, I do not want to block the issue to be merged. I
> > tried my best to speed up the process. I suggested to land the changes at
> > client side to master directly but was refused. I helped to add scan on
> > specific replica feature soon on branch-2 to let the port to branch-2 can
> > be landed cleanly.
> >
> > On a mobile device so can not review the code or PR. Very busy these
> days.
> > And the health examination this morning told me that I had a high blood
> > pressure. Not a good birthday present. Will get back to the issue when
> > possible.
> >
> > Thanks.
> >
> > Stack <st...@duboce.net>于2020年11月17日 周二06:34写道:
> >
> >>> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
> >>> wrote:
> >>>
> >>> So what is your purpose of distributing the request of region location
> >>> lookup? It is just because you want to 'distribute the request of
> region
> >>> location lookup'?
> >>>
> >>> Then I'm -1 on merging. We should reach an agreement on what we want to
> >>> solve before merging at least.
> >>>
> >>>
> >> HERE.1
> >>
> >>
> >>> I've helped this issue from the design doc step. For me, the purpose
> for
> >>> this issue is clear. We want to prevent the hotspot of meta, so the
> >>> solution is simple, enable meta replica, and then just modify the
> client
> >> to
> >>> not always go to primary replica first(this is the old behavior even
> with
> >>> meta replica feature on).
> >>> And this will introduce another problem that, there is no meta region
> >>> replication implementation for meta read replicas, which means the
> >> latency
> >>> will be large as we can only sync the data between replicas through
> >> region
> >>> flush, so we implement meta region replication.
> >>>
> >>> So I think it is very important to verify that we have truly
> distributed
> >>> the request of region location lookup, and also make sure that we could
> >>> support more requests of region location lookup. Otherwise this feature
> >> is
> >>> useless.
> >>>
> >>> And I agree with Andrew that, since the feature is default off on
> >> branch-2
> >>> and has no regression, it is OK to merge for now. Theoretically our
> >>> approach here should work, so even it does not work for now, I think we
> >>> could fix the problems to make it work.
> >>>
> >>>
> >> HERE.2
> >>
> >> I agree with all of the above between HERE.1 and HERE.2 (except the
> >> suggestion that the old behavior of read replicas is that they went to
> the
> >> replica first; they go to the primary first -- see [1], [2]).
> >>
> >> Lets work with any misalignment of understanding/communication offline
> and
> >> not in the way of merge.
> >>
> >> Thanks,
> >> S
> >>
> >> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a
> read
> >> is performed with Consistency.TIMELINE, then the read RPC will be sent
> to
> >> the primary region server first."
> >> 2.
> >>
> >>
> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195
> >>
> >>
> >>
> >>> But your reply above made me wonder whether we are talking about the
> same
> >>> thing. That's why I'm -1 here. I'm not going to force you to do the
> test
> >>> suggested by me, as I said it could be done after merging, just want to
> >>> reach an agreement on the goal of this feature.
> >>>
> >>> Thanks.
> >>>
> >>> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> >>>
> >>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> >> andrew.purtell@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> I agree with Duo’s comment that a performance gain is unlikely but
> >>> would
> >>>>> be orthogonal anyway;
> >>>>
> >>>>
> >>>> Perf observation is just an aside in the issue. Perf is orthogonal as
> >> you
> >>>> say above (as long as no regression).
> >>>>
> >>>>
> >>>>
> >>>>> it’s an availability gain that is the goal. We can assume it based on
> >>>>> theory of operation and unit test results but the gain should be
> >> tested
> >>>> and
> >>>>> measured on a cluster too.
> >>>>>
> >>>>
> >>>>
> >>>> The feature is about distributing load on hbase:meta to alleviate
> >>>> hotspotting; it makes read replicas more live so replicas are more
> >> likely
> >>>> to satisfy location lookups making read replicas more effective. That
> >>> read
> >>>> replicas improve HA is presumed -- it was the original justification
> >> for
> >>>> this years old commit -- but HA is not the focus of this addition;
> >> hence
> >>> no
> >>>> reports on effectiveness in this area.
> >>>>
> >>>> I have no problem working on such tests/reports but suggest that they
> >> are
> >>>> done post merge.
> >>>>
> >>>>
> >>>>
> >>>>> That said, the results of the testing thus far indicate no
> >> regression,
> >>>>> which gives me confidence to support a merge. Specifically, a merge
> >> to
> >>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
> >> the
> >>>>> default there is the feature is configured off. But please indicate
> >> in
> >>>>> documentation and release notes that the feature is not widely tested
> >>>> yet -
> >>>>> as is customarily done for new functionality like this.
> >>>>>
> >>>>>
> >>>> No problem w/ flagging the feature as new.
> >>>>
> >>>> Thanks,
> >>>> S
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> >>>>>>
> >>>>>> Replied on jira, I think we missed an important scenario when
> >>> testing.
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> >>>>>>
> >>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to
> >>> the
> >>>>>>> primary, (< second lags rather than minutes). It adds Async WAL
> >>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> >>> across
> >>>>>>> replicas as they arrive at the primary's WAL. Before this work,
> >>> Async
> >>>>> WAL
> >>>>>>> Replication was only available on user-space tables and the only
> >>>> option
> >>>>> for
> >>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
> >>>> period
> >>>>>>> (minutes). HBASE-18070 also adds an optional client-side
> >>> 'LoadBalance'
> >>>>>>> policy that favors read replicas ahead of primary reads falling
> >> back
> >>>> to
> >>>>> the
> >>>>>>> primary on fault. Together, these additions allow distributing
> >>>>> hbase:meta
> >>>>>>> read load across primary and replicas alleviating 'hotspotting'.
> >>>>>>>
> >>>>>>> I would like to merge the feature to master branch Monday evening
> >> if
> >>>>> there
> >>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature
> >>> can
> >>>>>>> hopefully be included in the upcoming 2.4.0RC).
> >>>>>>>
> >>>>>>> * For the design, see [2].
> >>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> >>> this
> >>>>>>> feature, see [3].
> >>>>>>> * For a PE report that compared performance before and after, see
> >>>>>>> HBASE-25127 (no regression).
> >>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070
> >>> but
> >>>>> runs
> >>>>>>> so far show no regression with the feature enabled (ITBLL runs
> >> were
> >>>> done
> >>>>>>> against a backport of this feature to branch-2 as the ITBLL state
> >> of
> >>>>> master
> >>>>>>> is currently an unknown).
> >>>>>>>
> >>>>>>> Testing continues mainly looking for further improvement and to
> >>> better
> >>>>>>> understand this feature in operation. Documentation is included
> >> but
> >>> in
> >>>>> need
> >>>>>>> of polish (working on it).
> >>>>>>>
> >>>>>>> Dump any questions here and I'll be happy to respond. If you need
> >>> more
> >>>>> time
> >>>>>>> to review, just shout.
> >>>>>>>
> >>>>>>> Thanks and thanks to all who contributed to this feature; the
> >>>> reviewers
> >>>>> and
> >>>>>>> the testers in particular.
> >>>>>>>
> >>>>>>> S
> >>>>>>>
> >>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> >>>>>>> 2.
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> >>>>>>> This patch is currently missing HBASE-25280, a bug found in
> >> testing.
> >>>>>>> 3. https://github.com/apache/hbase/pull/2643
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Huaxiang Sun <hu...@gmail.com>.
Hi Duo,

    Happy birthday! Let me explain the reasons that why we chose to land
the client patch to master along with the backend changes (HBASE-18070
branch).

        1. Client patch does not work very well by itself (without
"real-time" replication of meta wal edits, the gap between primary and
replication regions is too big.
        2. Extra unittest effort. Per your suggestion, I put up the client
patch against the master for review. There is some tradeoff for unittests
as it needs simulation of
            real-time replication by flushing meta table memstores and
waiting for replica hfile refresher threads to pick up the updated hfiles.
There are couple other unittest
            cases which are added to
TestMetaRegionReplicaReplicationEndpoint. To avoid this test rewrite issue,
we decided to merge the client patch into the
            feature and merge back the feature branch to the master.

     Best Regards,

     Huaxiang



On Tue, Nov 17, 2020 at 7:56 AM Andrew Purtell <an...@gmail.com>
wrote:

> Hi Duo,
>
> Just to be clear: You are saying go ahead with the merge, but then also go
> back and start this discussion fresh, to see if anything was missed and
> more can be done?
>
> > On Nov 16, 2020, at 11:25 PM, 张铎 <pa...@gmail.com> wrote:
> >
> > Oh, this is my fault. I mean the old behavior IS to go to primary
> replica
> > first, which is what we want to change here.
> >
> > And what I commented  on jira, is to say that we do not need to get a
> > performance improvement before merging, it is not the goal of this issue.
> > And I suggested that if we want to show our advantage, we need to get the
> > primary replica fucked up. I do not know why then the discussion went to
> > the HedgeRead and I could not poll it back. I do not think this should
> > block the merging but even though it was still very hard to communicate,
> so
> > I assumed this means we still have a big gap on what we want to solve
> here,
> > thus I voted a -1 here.
> >
> > I think we need to go back to the beginning, to reach an agreement on the
> > goal here. Let’s review the design doc again to see if we missed
> something
> > which lead us to this situation.
> >
> > And I need to say that, I do not want to block the issue to be merged. I
> > tried my best to speed up the process. I suggested to land the changes at
> > client side to master directly but was refused. I helped to add scan on
> > specific replica feature soon on branch-2 to let the port to branch-2 can
> > be landed cleanly.
> >
> > On a mobile device so can not review the code or PR. Very busy these
> days.
> > And the health examination this morning told me that I had a high blood
> > pressure. Not a good birthday present. Will get back to the issue when
> > possible.
> >
> > Thanks.
> >
> > Stack <st...@duboce.net>于2020年11月17日 周二06:34写道:
> >
> >>> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
> >>> wrote:
> >>>
> >>> So what is your purpose of distributing the request of region location
> >>> lookup? It is just because you want to 'distribute the request of
> region
> >>> location lookup'?
> >>>
> >>> Then I'm -1 on merging. We should reach an agreement on what we want to
> >>> solve before merging at least.
> >>>
> >>>
> >> HERE.1
> >>
> >>
> >>> I've helped this issue from the design doc step. For me, the purpose
> for
> >>> this issue is clear. We want to prevent the hotspot of meta, so the
> >>> solution is simple, enable meta replica, and then just modify the
> client
> >> to
> >>> not always go to primary replica first(this is the old behavior even
> with
> >>> meta replica feature on).
> >>> And this will introduce another problem that, there is no meta region
> >>> replication implementation for meta read replicas, which means the
> >> latency
> >>> will be large as we can only sync the data between replicas through
> >> region
> >>> flush, so we implement meta region replication.
> >>>
> >>> So I think it is very important to verify that we have truly
> distributed
> >>> the request of region location lookup, and also make sure that we could
> >>> support more requests of region location lookup. Otherwise this feature
> >> is
> >>> useless.
> >>>
> >>> And I agree with Andrew that, since the feature is default off on
> >> branch-2
> >>> and has no regression, it is OK to merge for now. Theoretically our
> >>> approach here should work, so even it does not work for now, I think we
> >>> could fix the problems to make it work.
> >>>
> >>>
> >> HERE.2
> >>
> >> I agree with all of the above between HERE.1 and HERE.2 (except the
> >> suggestion that the old behavior of read replicas is that they went to
> the
> >> replica first; they go to the primary first -- see [1], [2]).
> >>
> >> Lets work with any misalignment of understanding/communication offline
> and
> >> not in the way of merge.
> >>
> >> Thanks,
> >> S
> >>
> >> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a
> read
> >> is performed with Consistency.TIMELINE, then the read RPC will be sent
> to
> >> the primary region server first."
> >> 2.
> >>
> >>
> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195
> >>
> >>
> >>
> >>> But your reply above made me wonder whether we are talking about the
> same
> >>> thing. That's why I'm -1 here. I'm not going to force you to do the
> test
> >>> suggested by me, as I said it could be done after merging, just want to
> >>> reach an agreement on the goal of this feature.
> >>>
> >>> Thanks.
> >>>
> >>> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> >>>
> >>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> >> andrew.purtell@gmail.com
> >>>>
> >>>> wrote:
> >>>>
> >>>>> I agree with Duo’s comment that a performance gain is unlikely but
> >>> would
> >>>>> be orthogonal anyway;
> >>>>
> >>>>
> >>>> Perf observation is just an aside in the issue. Perf is orthogonal as
> >> you
> >>>> say above (as long as no regression).
> >>>>
> >>>>
> >>>>
> >>>>> it’s an availability gain that is the goal. We can assume it based on
> >>>>> theory of operation and unit test results but the gain should be
> >> tested
> >>>> and
> >>>>> measured on a cluster too.
> >>>>>
> >>>>
> >>>>
> >>>> The feature is about distributing load on hbase:meta to alleviate
> >>>> hotspotting; it makes read replicas more live so replicas are more
> >> likely
> >>>> to satisfy location lookups making read replicas more effective. That
> >>> read
> >>>> replicas improve HA is presumed -- it was the original justification
> >> for
> >>>> this years old commit -- but HA is not the focus of this addition;
> >> hence
> >>> no
> >>>> reports on effectiveness in this area.
> >>>>
> >>>> I have no problem working on such tests/reports but suggest that they
> >> are
> >>>> done post merge.
> >>>>
> >>>>
> >>>>
> >>>>> That said, the results of the testing thus far indicate no
> >> regression,
> >>>>> which gives me confidence to support a merge. Specifically, a merge
> >> to
> >>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
> >> the
> >>>>> default there is the feature is configured off. But please indicate
> >> in
> >>>>> documentation and release notes that the feature is not widely tested
> >>>> yet -
> >>>>> as is customarily done for new functionality like this.
> >>>>>
> >>>>>
> >>>> No problem w/ flagging the feature as new.
> >>>>
> >>>> Thanks,
> >>>> S
> >>>>
> >>>>
> >>>>
> >>>>>
> >>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> >>>>>>
> >>>>>> Replied on jira, I think we missed an important scenario when
> >>> testing.
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> >>>>>>
> >>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to
> >>> the
> >>>>>>> primary, (< second lags rather than minutes). It adds Async WAL
> >>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> >>> across
> >>>>>>> replicas as they arrive at the primary's WAL. Before this work,
> >>> Async
> >>>>> WAL
> >>>>>>> Replication was only available on user-space tables and the only
> >>>> option
> >>>>> for
> >>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
> >>>> period
> >>>>>>> (minutes). HBASE-18070 also adds an optional client-side
> >>> 'LoadBalance'
> >>>>>>> policy that favors read replicas ahead of primary reads falling
> >> back
> >>>> to
> >>>>> the
> >>>>>>> primary on fault. Together, these additions allow distributing
> >>>>> hbase:meta
> >>>>>>> read load across primary and replicas alleviating 'hotspotting'.
> >>>>>>>
> >>>>>>> I would like to merge the feature to master branch Monday evening
> >> if
> >>>>> there
> >>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature
> >>> can
> >>>>>>> hopefully be included in the upcoming 2.4.0RC).
> >>>>>>>
> >>>>>>> * For the design, see [2].
> >>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> >>> this
> >>>>>>> feature, see [3].
> >>>>>>> * For a PE report that compared performance before and after, see
> >>>>>>> HBASE-25127 (no regression).
> >>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070
> >>> but
> >>>>> runs
> >>>>>>> so far show no regression with the feature enabled (ITBLL runs
> >> were
> >>>> done
> >>>>>>> against a backport of this feature to branch-2 as the ITBLL state
> >> of
> >>>>> master
> >>>>>>> is currently an unknown).
> >>>>>>>
> >>>>>>> Testing continues mainly looking for further improvement and to
> >>> better
> >>>>>>> understand this feature in operation. Documentation is included
> >> but
> >>> in
> >>>>> need
> >>>>>>> of polish (working on it).
> >>>>>>>
> >>>>>>> Dump any questions here and I'll be happy to respond. If you need
> >>> more
> >>>>> time
> >>>>>>> to review, just shout.
> >>>>>>>
> >>>>>>> Thanks and thanks to all who contributed to this feature; the
> >>>> reviewers
> >>>>> and
> >>>>>>> the testers in particular.
> >>>>>>>
> >>>>>>> S
> >>>>>>>
> >>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> >>>>>>> 2.
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> >>>>>>> This patch is currently missing HBASE-25280, a bug found in
> >> testing.
> >>>>>>> 3. https://github.com/apache/hbase/pull/2643
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Andrew Purtell <an...@gmail.com>.
Hi Duo,

Just to be clear: You are saying go ahead with the merge, but then also go back and start this discussion fresh, to see if anything was missed and more can be done?

> On Nov 16, 2020, at 11:25 PM, 张铎 <pa...@gmail.com> wrote:
> 
> Oh, this is my fault. I mean the old behavior IS to go to primary replica
> first, which is what we want to change here.
> 
> And what I commented  on jira, is to say that we do not need to get a
> performance improvement before merging, it is not the goal of this issue.
> And I suggested that if we want to show our advantage, we need to get the
> primary replica fucked up. I do not know why then the discussion went to
> the HedgeRead and I could not poll it back. I do not think this should
> block the merging but even though it was still very hard to communicate, so
> I assumed this means we still have a big gap on what we want to solve here,
> thus I voted a -1 here.
> 
> I think we need to go back to the beginning, to reach an agreement on the
> goal here. Let’s review the design doc again to see if we missed something
> which lead us to this situation.
> 
> And I need to say that, I do not want to block the issue to be merged. I
> tried my best to speed up the process. I suggested to land the changes at
> client side to master directly but was refused. I helped to add scan on
> specific replica feature soon on branch-2 to let the port to branch-2 can
> be landed cleanly.
> 
> On a mobile device so can not review the code or PR. Very busy these days.
> And the health examination this morning told me that I had a high blood
> pressure. Not a good birthday present. Will get back to the issue when
> possible.
> 
> Thanks.
> 
> Stack <st...@duboce.net>于2020年11月17日 周二06:34写道:
> 
>>> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
>>> wrote:
>>> 
>>> So what is your purpose of distributing the request of region location
>>> lookup? It is just because you want to 'distribute the request of region
>>> location lookup'?
>>> 
>>> Then I'm -1 on merging. We should reach an agreement on what we want to
>>> solve before merging at least.
>>> 
>>> 
>> HERE.1
>> 
>> 
>>> I've helped this issue from the design doc step. For me, the purpose for
>>> this issue is clear. We want to prevent the hotspot of meta, so the
>>> solution is simple, enable meta replica, and then just modify the client
>> to
>>> not always go to primary replica first(this is the old behavior even with
>>> meta replica feature on).
>>> And this will introduce another problem that, there is no meta region
>>> replication implementation for meta read replicas, which means the
>> latency
>>> will be large as we can only sync the data between replicas through
>> region
>>> flush, so we implement meta region replication.
>>> 
>>> So I think it is very important to verify that we have truly distributed
>>> the request of region location lookup, and also make sure that we could
>>> support more requests of region location lookup. Otherwise this feature
>> is
>>> useless.
>>> 
>>> And I agree with Andrew that, since the feature is default off on
>> branch-2
>>> and has no regression, it is OK to merge for now. Theoretically our
>>> approach here should work, so even it does not work for now, I think we
>>> could fix the problems to make it work.
>>> 
>>> 
>> HERE.2
>> 
>> I agree with all of the above between HERE.1 and HERE.2 (except the
>> suggestion that the old behavior of read replicas is that they went to the
>> replica first; they go to the primary first -- see [1], [2]).
>> 
>> Lets work with any misalignment of understanding/communication offline and
>> not in the way of merge.
>> 
>> Thanks,
>> S
>> 
>> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a read
>> is performed with Consistency.TIMELINE, then the read RPC will be sent to
>> the primary region server first."
>> 2.
>> 
>> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195
>> 
>> 
>> 
>>> But your reply above made me wonder whether we are talking about the same
>>> thing. That's why I'm -1 here. I'm not going to force you to do the test
>>> suggested by me, as I said it could be done after merging, just want to
>>> reach an agreement on the goal of this feature.
>>> 
>>> Thanks.
>>> 
>>> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
>>> 
>>>> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
>> andrew.purtell@gmail.com
>>>> 
>>>> wrote:
>>>> 
>>>>> I agree with Duo’s comment that a performance gain is unlikely but
>>> would
>>>>> be orthogonal anyway;
>>>> 
>>>> 
>>>> Perf observation is just an aside in the issue. Perf is orthogonal as
>> you
>>>> say above (as long as no regression).
>>>> 
>>>> 
>>>> 
>>>>> it’s an availability gain that is the goal. We can assume it based on
>>>>> theory of operation and unit test results but the gain should be
>> tested
>>>> and
>>>>> measured on a cluster too.
>>>>> 
>>>> 
>>>> 
>>>> The feature is about distributing load on hbase:meta to alleviate
>>>> hotspotting; it makes read replicas more live so replicas are more
>> likely
>>>> to satisfy location lookups making read replicas more effective. That
>>> read
>>>> replicas improve HA is presumed -- it was the original justification
>> for
>>>> this years old commit -- but HA is not the focus of this addition;
>> hence
>>> no
>>>> reports on effectiveness in this area.
>>>> 
>>>> I have no problem working on such tests/reports but suggest that they
>> are
>>>> done post merge.
>>>> 
>>>> 
>>>> 
>>>>> That said, the results of the testing thus far indicate no
>> regression,
>>>>> which gives me confidence to support a merge. Specifically, a merge
>> to
>>>>> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
>> the
>>>>> default there is the feature is configured off. But please indicate
>> in
>>>>> documentation and release notes that the feature is not widely tested
>>>> yet -
>>>>> as is customarily done for new functionality like this.
>>>>> 
>>>>> 
>>>> No problem w/ flagging the feature as new.
>>>> 
>>>> Thanks,
>>>> S
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>>> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
>>>>>> 
>>>>>> Replied on jira, I think we missed an important scenario when
>>> testing.
>>>>>> 
>>>>>> Thanks.
>>>>>> 
>>>>>> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
>>>>>> 
>>>>>>> HBASE-18070 makes it so hbase:meta read replicas can run closer to
>>> the
>>>>>>> primary, (< second lags rather than minutes). It adds Async WAL
>>>>>>> Replication[1] on the hbase:meta table; i.e. edits are sprayed
>>> across
>>>>>>> replicas as they arrive at the primary's WAL. Before this work,
>>> Async
>>>>> WAL
>>>>>>> Replication was only available on user-space tables and the only
>>>> option
>>>>> for
>>>>>>> hbase:meta read replicas was reloading the primaries hfiles on a
>>>> period
>>>>>>> (minutes). HBASE-18070 also adds an optional client-side
>>> 'LoadBalance'
>>>>>>> policy that favors read replicas ahead of primary reads falling
>> back
>>>> to
>>>>> the
>>>>>>> primary on fault. Together, these additions allow distributing
>>>>> hbase:meta
>>>>>>> read load across primary and replicas alleviating 'hotspotting'.
>>>>>>> 
>>>>>>> I would like to merge the feature to master branch Monday evening
>> if
>>>>> there
>>>>>>> is no objection (Soon after I'll merge to branch-2 so this feature
>>> can
>>>>>>> hopefully be included in the upcoming 2.4.0RC).
>>>>>>> 
>>>>>>> * For the design, see [2].
>>>>>>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
>>> this
>>>>>>> feature, see [3].
>>>>>>> * For a PE report that compared performance before and after, see
>>>>>>> HBASE-25127 (no regression).
>>>>>>> * A report on ITBLL runs is pending to be attached to HBASE-18070
>>> but
>>>>> runs
>>>>>>> so far show no regression with the feature enabled (ITBLL runs
>> were
>>>> done
>>>>>>> against a backport of this feature to branch-2 as the ITBLL state
>> of
>>>>> master
>>>>>>> is currently an unknown).
>>>>>>> 
>>>>>>> Testing continues mainly looking for further improvement and to
>>> better
>>>>>>> understand this feature in operation. Documentation is included
>> but
>>> in
>>>>> need
>>>>>>> of polish (working on it).
>>>>>>> 
>>>>>>> Dump any questions here and I'll be happy to respond. If you need
>>> more
>>>>> time
>>>>>>> to review, just shout.
>>>>>>> 
>>>>>>> Thanks and thanks to all who contributed to this feature; the
>>>> reviewers
>>>>> and
>>>>>>> the testers in particular.
>>>>>>> 
>>>>>>> S
>>>>>>> 
>>>>>>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
>>>>>>> 2.
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
>>>>>>> This patch is currently missing HBASE-25280, a bug found in
>> testing.
>>>>>>> 3. https://github.com/apache/hbase/pull/2643
>>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
Oh, this is my fault. I mean the old behavior IS to go to primary replica
first, which is what we want to change here.

And what I commented  on jira, is to say that we do not need to get a
performance improvement before merging, it is not the goal of this issue.
And I suggested that if we want to show our advantage, we need to get the
primary replica fucked up. I do not know why then the discussion went to
the HedgeRead and I could not poll it back. I do not think this should
block the merging but even though it was still very hard to communicate, so
I assumed this means we still have a big gap on what we want to solve here,
thus I voted a -1 here.

I think we need to go back to the beginning, to reach an agreement on the
goal here. Let’s review the design doc again to see if we missed something
which lead us to this situation.

And I need to say that, I do not want to block the issue to be merged. I
tried my best to speed up the process. I suggested to land the changes at
client side to master directly but was refused. I helped to add scan on
specific replica feature soon on branch-2 to let the port to branch-2 can
be landed cleanly.

On a mobile device so can not review the code or PR. Very busy these days.
And the health examination this morning told me that I had a high blood
pressure. Not a good birthday present. Will get back to the issue when
possible.

Thanks.

Stack <st...@duboce.net>于2020年11月17日 周二06:34写道:

> On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > So what is your purpose of distributing the request of region location
> > lookup? It is just because you want to 'distribute the request of region
> > location lookup'?
> >
> > Then I'm -1 on merging. We should reach an agreement on what we want to
> > solve before merging at least.
> >
> >
> HERE.1
>
>
> > I've helped this issue from the design doc step. For me, the purpose for
> > this issue is clear. We want to prevent the hotspot of meta, so the
> > solution is simple, enable meta replica, and then just modify the client
> to
> > not always go to primary replica first(this is the old behavior even with
> > meta replica feature on).
> > And this will introduce another problem that, there is no meta region
> > replication implementation for meta read replicas, which means the
> latency
> > will be large as we can only sync the data between replicas through
> region
> > flush, so we implement meta region replication.
> >
> > So I think it is very important to verify that we have truly distributed
> > the request of region location lookup, and also make sure that we could
> > support more requests of region location lookup. Otherwise this feature
> is
> > useless.
> >
> > And I agree with Andrew that, since the feature is default off on
> branch-2
> > and has no regression, it is OK to merge for now. Theoretically our
> > approach here should work, so even it does not work for now, I think we
> > could fix the problems to make it work.
> >
> >
> HERE.2
>
> I agree with all of the above between HERE.1 and HERE.2 (except the
> suggestion that the old behavior of read replicas is that they went to the
> replica first; they go to the primary first -- see [1], [2]).
>
> Lets work with any misalignment of understanding/communication offline and
> not in the way of merge.
>
> Thanks,
> S
>
> 1. http://hbase.apache.org/book.html#_timeline_consistency "In case a read
> is performed with Consistency.TIMELINE, then the read RPC will be sent to
> the primary region server first."
> 2.
>
> https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195
>
>
>
> > But your reply above made me wonder whether we are talking about the same
> > thing. That's why I'm -1 here. I'm not going to force you to do the test
> > suggested by me, as I said it could be done after merging, just want to
> > reach an agreement on the goal of this feature.
> >
> > Thanks.
> >
> > Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
> >
> > > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <
> andrew.purtell@gmail.com
> > >
> > > wrote:
> > >
> > > > I agree with Duo’s comment that a performance gain is unlikely but
> > would
> > > > be orthogonal anyway;
> > >
> > >
> > > Perf observation is just an aside in the issue. Perf is orthogonal as
> you
> > > say above (as long as no regression).
> > >
> > >
> > >
> > > > it’s an availability gain that is the goal. We can assume it based on
> > > > theory of operation and unit test results but the gain should be
> tested
> > > and
> > > > measured on a cluster too.
> > > >
> > >
> > >
> > > The feature is about distributing load on hbase:meta to alleviate
> > > hotspotting; it makes read replicas more live so replicas are more
> likely
> > > to satisfy location lookups making read replicas more effective. That
> > read
> > > replicas improve HA is presumed -- it was the original justification
> for
> > > this years old commit -- but HA is not the focus of this addition;
> hence
> > no
> > > reports on effectiveness in this area.
> > >
> > > I have no problem working on such tests/reports but suggest that they
> are
> > > done post merge.
> > >
> > >
> > >
> > > > That said, the results of the testing thus far indicate no
> regression,
> > > > which gives me confidence to support a merge. Specifically, a merge
> to
> > > > “unblock” 2.4 (we aren’t really blocked, we are waiting), provided
> the
> > > > default there is the feature is configured off. But please indicate
> in
> > > > documentation and release notes that the feature is not widely tested
> > > yet -
> > > > as is customarily done for new functionality like this.
> > > >
> > > >
> > > No problem w/ flagging the feature as new.
> > >
> > > Thanks,
> > > S
> > >
> > >
> > >
> > > >
> > > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > > > >
> > > > > Replied on jira, I think we missed an important scenario when
> > testing.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > > >
> > > > >> HBASE-18070 makes it so hbase:meta read replicas can run closer to
> > the
> > > > >> primary, (< second lags rather than minutes). It adds Async WAL
> > > > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> > across
> > > > >> replicas as they arrive at the primary's WAL. Before this work,
> > Async
> > > > WAL
> > > > >> Replication was only available on user-space tables and the only
> > > option
> > > > for
> > > > >> hbase:meta read replicas was reloading the primaries hfiles on a
> > > period
> > > > >> (minutes). HBASE-18070 also adds an optional client-side
> > 'LoadBalance'
> > > > >> policy that favors read replicas ahead of primary reads falling
> back
> > > to
> > > > the
> > > > >> primary on fault. Together, these additions allow distributing
> > > > hbase:meta
> > > > >> read load across primary and replicas alleviating 'hotspotting'.
> > > > >>
> > > > >> I would like to merge the feature to master branch Monday evening
> if
> > > > there
> > > > >> is no objection (Soon after I'll merge to branch-2 so this feature
> > can
> > > > >> hopefully be included in the upcoming 2.4.0RC).
> > > > >>
> > > > >> * For the design, see [2].
> > > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> > this
> > > > >> feature, see [3].
> > > > >> * For a PE report that compared performance before and after, see
> > > > >> HBASE-25127 (no regression).
> > > > >> * A report on ITBLL runs is pending to be attached to HBASE-18070
> > but
> > > > runs
> > > > >> so far show no regression with the feature enabled (ITBLL runs
> were
> > > done
> > > > >> against a backport of this feature to branch-2 as the ITBLL state
> of
> > > > master
> > > > >> is currently an unknown).
> > > > >>
> > > > >> Testing continues mainly looking for further improvement and to
> > better
> > > > >> understand this feature in operation. Documentation is included
> but
> > in
> > > > need
> > > > >> of polish (working on it).
> > > > >>
> > > > >> Dump any questions here and I'll be happy to respond. If you need
> > more
> > > > time
> > > > >> to review, just shout.
> > > > >>
> > > > >> Thanks and thanks to all who contributed to this feature; the
> > > reviewers
> > > > and
> > > > >> the testers in particular.
> > > > >>
> > > > >> S
> > > > >>
> > > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > > >> 2.
> > > > >>
> > > > >>
> > > >
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > > >> This patch is currently missing HBASE-25280, a bug found in
> testing.
> > > > >> 3. https://github.com/apache/hbase/pull/2643
> > > > >>
> > > >
> > >
> >
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Stack <st...@duboce.net>.
On Sun, Nov 15, 2020 at 11:20 PM 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> So what is your purpose of distributing the request of region location
> lookup? It is just because you want to 'distribute the request of region
> location lookup'?
>
> Then I'm -1 on merging. We should reach an agreement on what we want to
> solve before merging at least.
>
>
HERE.1


> I've helped this issue from the design doc step. For me, the purpose for
> this issue is clear. We want to prevent the hotspot of meta, so the
> solution is simple, enable meta replica, and then just modify the client to
> not always go to primary replica first(this is the old behavior even with
> meta replica feature on).
> And this will introduce another problem that, there is no meta region
> replication implementation for meta read replicas, which means the latency
> will be large as we can only sync the data between replicas through region
> flush, so we implement meta region replication.
>
> So I think it is very important to verify that we have truly distributed
> the request of region location lookup, and also make sure that we could
> support more requests of region location lookup. Otherwise this feature is
> useless.
>
> And I agree with Andrew that, since the feature is default off on branch-2
> and has no regression, it is OK to merge for now. Theoretically our
> approach here should work, so even it does not work for now, I think we
> could fix the problems to make it work.
>
>
HERE.2

I agree with all of the above between HERE.1 and HERE.2 (except the
suggestion that the old behavior of read replicas is that they went to the
replica first; they go to the primary first -- see [1], [2]).

Lets work with any misalignment of understanding/communication offline and
not in the way of merge.

Thanks,
S

1. http://hbase.apache.org/book.html#_timeline_consistency "In case a read
is performed with Consistency.TIMELINE, then the read RPC will be sent to
the primary region server first."
2.
https://github.com/apache/hbase/blob/branch-2/hbase-client/src/main/java/org/apache/hadoop/hbase/client/ScannerCallableWithReplicas.java#L195



> But your reply above made me wonder whether we are talking about the same
> thing. That's why I'm -1 here. I'm not going to force you to do the test
> suggested by me, as I said it could be done after merging, just want to
> reach an agreement on the goal of this feature.
>
> Thanks.
>
> Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:
>
> > On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <andrew.purtell@gmail.com
> >
> > wrote:
> >
> > > I agree with Duo’s comment that a performance gain is unlikely but
> would
> > > be orthogonal anyway;
> >
> >
> > Perf observation is just an aside in the issue. Perf is orthogonal as you
> > say above (as long as no regression).
> >
> >
> >
> > > it’s an availability gain that is the goal. We can assume it based on
> > > theory of operation and unit test results but the gain should be tested
> > and
> > > measured on a cluster too.
> > >
> >
> >
> > The feature is about distributing load on hbase:meta to alleviate
> > hotspotting; it makes read replicas more live so replicas are more likely
> > to satisfy location lookups making read replicas more effective. That
> read
> > replicas improve HA is presumed -- it was the original justification for
> > this years old commit -- but HA is not the focus of this addition; hence
> no
> > reports on effectiveness in this area.
> >
> > I have no problem working on such tests/reports but suggest that they are
> > done post merge.
> >
> >
> >
> > > That said, the results of the testing thus far indicate no regression,
> > > which gives me confidence to support a merge. Specifically, a merge to
> > > “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the
> > > default there is the feature is configured off. But please indicate in
> > > documentation and release notes that the feature is not widely tested
> > yet -
> > > as is customarily done for new functionality like this.
> > >
> > >
> > No problem w/ flagging the feature as new.
> >
> > Thanks,
> > S
> >
> >
> >
> > >
> > > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > > >
> > > > Replied on jira, I think we missed an important scenario when
> testing.
> > > >
> > > > Thanks.
> > > >
> > > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > > >
> > > >> HBASE-18070 makes it so hbase:meta read replicas can run closer to
> the
> > > >> primary, (< second lags rather than minutes). It adds Async WAL
> > > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed
> across
> > > >> replicas as they arrive at the primary's WAL. Before this work,
> Async
> > > WAL
> > > >> Replication was only available on user-space tables and the only
> > option
> > > for
> > > >> hbase:meta read replicas was reloading the primaries hfiles on a
> > period
> > > >> (minutes). HBASE-18070 also adds an optional client-side
> 'LoadBalance'
> > > >> policy that favors read replicas ahead of primary reads falling back
> > to
> > > the
> > > >> primary on fault. Together, these additions allow distributing
> > > hbase:meta
> > > >> read load across primary and replicas alleviating 'hotspotting'.
> > > >>
> > > >> I would like to merge the feature to master branch Monday evening if
> > > there
> > > >> is no objection (Soon after I'll merge to branch-2 so this feature
> can
> > > >> hopefully be included in the upcoming 2.4.0RC).
> > > >>
> > > >> * For the design, see [2].
> > > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise
> this
> > > >> feature, see [3].
> > > >> * For a PE report that compared performance before and after, see
> > > >> HBASE-25127 (no regression).
> > > >> * A report on ITBLL runs is pending to be attached to HBASE-18070
> but
> > > runs
> > > >> so far show no regression with the feature enabled (ITBLL runs were
> > done
> > > >> against a backport of this feature to branch-2 as the ITBLL state of
> > > master
> > > >> is currently an unknown).
> > > >>
> > > >> Testing continues mainly looking for further improvement and to
> better
> > > >> understand this feature in operation. Documentation is included but
> in
> > > need
> > > >> of polish (working on it).
> > > >>
> > > >> Dump any questions here and I'll be happy to respond. If you need
> more
> > > time
> > > >> to review, just shout.
> > > >>
> > > >> Thanks and thanks to all who contributed to this feature; the
> > reviewers
> > > and
> > > >> the testers in particular.
> > > >>
> > > >> S
> > > >>
> > > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > > >> 2.
> > > >>
> > > >>
> > >
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > > >> This patch is currently missing HBASE-25280, a bug found in testing.
> > > >> 3. https://github.com/apache/hbase/pull/2643
> > > >>
> > >
> >
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
So what is your purpose of distributing the request of region location
lookup? It is just because you want to 'distribute the request of region
location lookup'?

Then I'm -1 on merging. We should reach an agreement on what we want to
solve before merging at least.

I've helped this issue from the design doc step. For me, the purpose for
this issue is clear. We want to prevent the hotspot of meta, so the
solution is simple, enable meta replica, and then just modify the client to
not always go to primary replica first(this is the old behavior even with
meta replica feature on).
And this will introduce another problem that, there is no meta region
replication implementation for meta read replicas, which means the latency
will be large as we can only sync the data between replicas through region
flush, so we implement meta region replication.

So I think it is very important to verify that we have truly distributed
the request of region location lookup, and also make sure that we could
support more requests of region location lookup. Otherwise this feature is
useless.

And I agree with Andrew that, since the feature is default off on branch-2
and has no regression, it is OK to merge for now. Theoretically our
approach here should work, so even it does not work for now, I think we
could fix the problems to make it work.

But your reply above made me wonder whether we are talking about the same
thing. That's why I'm -1 here. I'm not going to force you to do the test
suggested by me, as I said it could be done after merging, just want to
reach an agreement on the goal of this feature.

Thanks.

Stack <st...@duboce.net> 于2020年11月16日周一 下午12:35写道:

> On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <an...@gmail.com>
> wrote:
>
> > I agree with Duo’s comment that a performance gain is unlikely but would
> > be orthogonal anyway;
>
>
> Perf observation is just an aside in the issue. Perf is orthogonal as you
> say above (as long as no regression).
>
>
>
> > it’s an availability gain that is the goal. We can assume it based on
> > theory of operation and unit test results but the gain should be tested
> and
> > measured on a cluster too.
> >
>
>
> The feature is about distributing load on hbase:meta to alleviate
> hotspotting; it makes read replicas more live so replicas are more likely
> to satisfy location lookups making read replicas more effective. That read
> replicas improve HA is presumed -- it was the original justification for
> this years old commit -- but HA is not the focus of this addition; hence no
> reports on effectiveness in this area.
>
> I have no problem working on such tests/reports but suggest that they are
> done post merge.
>
>
>
> > That said, the results of the testing thus far indicate no regression,
> > which gives me confidence to support a merge. Specifically, a merge to
> > “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the
> > default there is the feature is configured off. But please indicate in
> > documentation and release notes that the feature is not widely tested
> yet -
> > as is customarily done for new functionality like this.
> >
> >
> No problem w/ flagging the feature as new.
>
> Thanks,
> S
>
>
>
> >
> > > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> > >
> > > Replied on jira, I think we missed an important scenario when testing.
> > >
> > > Thanks.
> > >
> > > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> > >
> > >> HBASE-18070 makes it so hbase:meta read replicas can run closer to the
> > >> primary, (< second lags rather than minutes). It adds Async WAL
> > >> Replication[1] on the hbase:meta table; i.e. edits are sprayed across
> > >> replicas as they arrive at the primary's WAL. Before this work, Async
> > WAL
> > >> Replication was only available on user-space tables and the only
> option
> > for
> > >> hbase:meta read replicas was reloading the primaries hfiles on a
> period
> > >> (minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
> > >> policy that favors read replicas ahead of primary reads falling back
> to
> > the
> > >> primary on fault. Together, these additions allow distributing
> > hbase:meta
> > >> read load across primary and replicas alleviating 'hotspotting'.
> > >>
> > >> I would like to merge the feature to master branch Monday evening if
> > there
> > >> is no objection (Soon after I'll merge to branch-2 so this feature can
> > >> hopefully be included in the upcoming 2.4.0RC).
> > >>
> > >> * For the design, see [2].
> > >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
> > >> feature, see [3].
> > >> * For a PE report that compared performance before and after, see
> > >> HBASE-25127 (no regression).
> > >> * A report on ITBLL runs is pending to be attached to HBASE-18070 but
> > runs
> > >> so far show no regression with the feature enabled (ITBLL runs were
> done
> > >> against a backport of this feature to branch-2 as the ITBLL state of
> > master
> > >> is currently an unknown).
> > >>
> > >> Testing continues mainly looking for further improvement and to better
> > >> understand this feature in operation. Documentation is included but in
> > need
> > >> of polish (working on it).
> > >>
> > >> Dump any questions here and I'll be happy to respond. If you need more
> > time
> > >> to review, just shout.
> > >>
> > >> Thanks and thanks to all who contributed to this feature; the
> reviewers
> > and
> > >> the testers in particular.
> > >>
> > >> S
> > >>
> > >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> > >> 2.
> > >>
> > >>
> >
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> > >> This patch is currently missing HBASE-25280, a bug found in testing.
> > >> 3. https://github.com/apache/hbase/pull/2643
> > >>
> >
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Stack <st...@duboce.net>.
On Sun, Nov 15, 2020 at 9:16 AM Andrew Purtell <an...@gmail.com>
wrote:

> I agree with Duo’s comment that a performance gain is unlikely but would
> be orthogonal anyway;


Perf observation is just an aside in the issue. Perf is orthogonal as you
say above (as long as no regression).



> it’s an availability gain that is the goal. We can assume it based on
> theory of operation and unit test results but the gain should be tested and
> measured on a cluster too.
>


The feature is about distributing load on hbase:meta to alleviate
hotspotting; it makes read replicas more live so replicas are more likely
to satisfy location lookups making read replicas more effective. That read
replicas improve HA is presumed -- it was the original justification for
this years old commit -- but HA is not the focus of this addition; hence no
reports on effectiveness in this area.

I have no problem working on such tests/reports but suggest that they are
done post merge.



> That said, the results of the testing thus far indicate no regression,
> which gives me confidence to support a merge. Specifically, a merge to
> “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the
> default there is the feature is configured off. But please indicate in
> documentation and release notes that the feature is not widely tested yet -
> as is customarily done for new functionality like this.
>
>
No problem w/ flagging the feature as new.

Thanks,
S



>
> > On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> >
> > Replied on jira, I think we missed an important scenario when testing.
> >
> > Thanks.
> >
> > Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> >
> >> HBASE-18070 makes it so hbase:meta read replicas can run closer to the
> >> primary, (< second lags rather than minutes). It adds Async WAL
> >> Replication[1] on the hbase:meta table; i.e. edits are sprayed across
> >> replicas as they arrive at the primary's WAL. Before this work, Async
> WAL
> >> Replication was only available on user-space tables and the only option
> for
> >> hbase:meta read replicas was reloading the primaries hfiles on a period
> >> (minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
> >> policy that favors read replicas ahead of primary reads falling back to
> the
> >> primary on fault. Together, these additions allow distributing
> hbase:meta
> >> read load across primary and replicas alleviating 'hotspotting'.
> >>
> >> I would like to merge the feature to master branch Monday evening if
> there
> >> is no objection (Soon after I'll merge to branch-2 so this feature can
> >> hopefully be included in the upcoming 2.4.0RC).
> >>
> >> * For the design, see [2].
> >> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
> >> feature, see [3].
> >> * For a PE report that compared performance before and after, see
> >> HBASE-25127 (no regression).
> >> * A report on ITBLL runs is pending to be attached to HBASE-18070 but
> runs
> >> so far show no regression with the feature enabled (ITBLL runs were done
> >> against a backport of this feature to branch-2 as the ITBLL state of
> master
> >> is currently an unknown).
> >>
> >> Testing continues mainly looking for further improvement and to better
> >> understand this feature in operation. Documentation is included but in
> need
> >> of polish (working on it).
> >>
> >> Dump any questions here and I'll be happy to respond. If you need more
> time
> >> to review, just shout.
> >>
> >> Thanks and thanks to all who contributed to this feature; the reviewers
> and
> >> the testers in particular.
> >>
> >> S
> >>
> >> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> >> 2.
> >>
> >>
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> >> This patch is currently missing HBASE-25280, a bug found in testing.
> >> 3. https://github.com/apache/hbase/pull/2643
> >>
>

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by Andrew Purtell <an...@gmail.com>.
I agree with Duo’s comment that a performance gain is unlikely but would be orthogonal anyway; it’s an availability gain that is the goal. We can assume it based on theory of operation and unit test results but the gain should be tested and measured on a cluster too. 

That said, the results of the testing thus far indicate no regression, which gives me confidence to support a merge. Specifically, a merge to “unblock” 2.4 (we aren’t really blocked, we are waiting), provided the default there is the feature is configured off. But please indicate in documentation and release notes that the feature is not widely tested yet - as is customarily done for new functionality like this. 


> On Nov 15, 2020, at 5:20 AM, 张铎 <pa...@gmail.com> wrote:
> 
> Replied on jira, I think we missed an important scenario when testing.
> 
> Thanks.
> 
> Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:
> 
>> HBASE-18070 makes it so hbase:meta read replicas can run closer to the
>> primary, (< second lags rather than minutes). It adds Async WAL
>> Replication[1] on the hbase:meta table; i.e. edits are sprayed across
>> replicas as they arrive at the primary's WAL. Before this work, Async WAL
>> Replication was only available on user-space tables and the only option for
>> hbase:meta read replicas was reloading the primaries hfiles on a period
>> (minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
>> policy that favors read replicas ahead of primary reads falling back to the
>> primary on fault. Together, these additions allow distributing hbase:meta
>> read load across primary and replicas alleviating 'hotspotting'.
>> 
>> I would like to merge the feature to master branch Monday evening if there
>> is no objection (Soon after I'll merge to branch-2 so this feature can
>> hopefully be included in the upcoming 2.4.0RC).
>> 
>> * For the design, see [2].
>> * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
>> feature, see [3].
>> * For a PE report that compared performance before and after, see
>> HBASE-25127 (no regression).
>> * A report on ITBLL runs is pending to be attached to HBASE-18070 but runs
>> so far show no regression with the feature enabled (ITBLL runs were done
>> against a backport of this feature to branch-2 as the ITBLL state of master
>> is currently an unknown).
>> 
>> Testing continues mainly looking for further improvement and to better
>> understand this feature in operation. Documentation is included but in need
>> of polish (working on it).
>> 
>> Dump any questions here and I'll be happy to respond. If you need more time
>> to review, just shout.
>> 
>> Thanks and thanks to all who contributed to this feature; the reviewers and
>> the testers in particular.
>> 
>> S
>> 
>> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
>> 2.
>> 
>> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
>> This patch is currently missing HBASE-25280, a bug found in testing.
>> 3. https://github.com/apache/hbase/pull/2643
>> 

Re: HEAD-UP: Merging HBASE-18070 "Enable memstore replication for meta replica" to master and then back to branch-2

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
Replied on jira, I think we missed an important scenario when testing.

Thanks.

Stack <st...@duboce.net> 于2020年11月15日周日 上午2:30写道:

> HBASE-18070 makes it so hbase:meta read replicas can run closer to the
> primary, (< second lags rather than minutes). It adds Async WAL
> Replication[1] on the hbase:meta table; i.e. edits are sprayed across
> replicas as they arrive at the primary's WAL. Before this work, Async WAL
> Replication was only available on user-space tables and the only option for
> hbase:meta read replicas was reloading the primaries hfiles on a period
> (minutes). HBASE-18070 also adds an optional client-side 'LoadBalance'
> policy that favors read replicas ahead of primary reads falling back to the
> primary on fault. Together, these additions allow distributing hbase:meta
> read load across primary and replicas alleviating 'hotspotting'.
>
> I would like to merge the feature to master branch Monday evening if there
> is no objection (Soon after I'll merge to branch-2 so this feature can
> hopefully be included in the upcoming 2.4.0RC).
>
>  * For the design, see [2].
>  * For an amalgamated PR of the 5 or 6 reviewed PRs that comprise this
> feature, see [3].
>  * For a PE report that compared performance before and after, see
> HBASE-25127 (no regression).
>  * A report on ITBLL runs is pending to be attached to HBASE-18070 but runs
> so far show no regression with the feature enabled (ITBLL runs were done
> against a backport of this feature to branch-2 as the ITBLL state of master
> is currently an unknown).
>
> Testing continues mainly looking for further improvement and to better
> understand this feature in operation. Documentation is included but in need
> of polish (working on it).
>
> Dump any questions here and I'll be happy to respond. If you need more time
> to review, just shout.
>
> Thanks and thanks to all who contributed to this feature; the reviewers and
> the testers in particular.
>
> S
>
> 1. http://hbase.apache.org/book.html#_asnyc_wal_replication
> 2.
>
> https://docs.google.com/document/d/1jJWVc-idHhhgL4KDRpjMsQJKCl_NRaCLGiH3Wqwd3O8/edit#
> This patch is currently missing HBASE-25280, a bug found in testing.
> 3. https://github.com/apache/hbase/pull/2643
>