You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Duo Zhang <zh...@apache.org> on 2018/01/06 06:54:38 UTC

[VOTE] Merge branch HBASE-19397 back to master and branch-2.

https://issues.apache.org/jira/browse/HBASE-19397

We aim to move the peer modification framework from zk watcher to procedure
v2 in this issue and the work is done now.

Copy the release note here:

Introduce 5 procedures to do peer modifications:
> AddPeerProcedure
> RemovePeerProcedure
> UpdatePeerConfigProcedure
> EnablePeerProcedure
> DisablePeerProcedure
>
> The procedures are all executed with the following stage:
> 1. Call pre CP hook, if an exception is thrown then give up
> 2. Check whether the operation is valid, if not then give up
> 3. Update peer storage. Notice that if we have entered this stage, then we
> can not rollback any more.
> 4. Schedule sub procedures to refresh the peer config on every RS.
> 5. Do post cleanup if any.
> 6. Call post CP hook. The exception thrown will be ignored since we have
> already done the work.
>
> The procedure will hold an exclusive lock on the peer id, so now there is
> no concurrent modifications on a single peer.
>
> And now it is guaranteed that once the procedure is done, the peer
> modification has already taken effect on all RSes.
>
> Abstracte a storage layer for replication peer/queue manangement, and
> refactored the upper layer to remove zk related naming/code/comment.
>
> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and add
> permission check for executeProcedures method which requires the caller to
> be system user or super user.
>
> On rolling upgrade: just do not do any replication peer modifications
> during the rolling upgrading. There is no pb/layout changes on the
> peer/queue storage on zk.
>

And there are other benefits.
First, we have introduced a general procedure framework to send tasks to RS
and report the report back to Master. It can be used to implement other
operations such as ACL change.
Second, zk is used as a external storage now since we do not depend on zk
watcher any more, it will be much easier to implement a 'table based'
replication peer/queue storage.

Please vote:
[+1] Agree
[-1] Disagree
[0] Neutral

Thanks.

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Duo Zhang <zh...@apache.org>.

Here is my +1.

Executed the test suites several times,
with -Dsurefire.secondPartForkCount=1, and also the exclusion of flakey
tests(including TestFromClientSide) I could get a successful build.

Started two clusters with the code of the latest HBASE-19397, tried adding
a new peer, it worked fine. Loaded 1M rows with LTT at the source cluster,
and verified at the dest cluster, passed.

During the loading, I disabled the peer for a while and then enabled it. It
did not effect the correctness. And after disabling the peer, the qps on
dest cluster became 0 immediately, which is the expected behavior(compare
to the old, asynchronous zk watcher approach).

Both clusters have 5 nodes, and the add/remove/enable/disable peer commands
in shell can always return within 2 seconds, which is acceptable I think.

2018-01-06 14:54 GMT+08:00 Duo Zhang <zh...@apache.org>:

> https://issues.apache.org/jira/browse/HBASE-19397
>
> We aim to move the peer modification framework from zk watcher to
> procedure v2 in this issue and the work is done now.
>
> Copy the release note here:
>
> Introduce 5 procedures to do peer modifications:
>> AddPeerProcedure
>> RemovePeerProcedure
>> UpdatePeerConfigProcedure
>> EnablePeerProcedure
>> DisablePeerProcedure
>>
>> The procedures are all executed with the following stage:
>> 1. Call pre CP hook, if an exception is thrown then give up
>> 2. Check whether the operation is valid, if not then give up
>> 3. Update peer storage. Notice that if we have entered this stage, then
>> we can not rollback any more.
>> 4. Schedule sub procedures to refresh the peer config on every RS.
>> 5. Do post cleanup if any.
>> 6. Call post CP hook. The exception thrown will be ignored since we have
>> already done the work.
>>
>> The procedure will hold an exclusive lock on the peer id, so now there is
>> no concurrent modifications on a single peer.
>>
>> And now it is guaranteed that once the procedure is done, the peer
>> modification has already taken effect on all RSes.
>>
>> Abstracte a storage layer for replication peer/queue manangement, and
>> refactored the upper layer to remove zk related naming/code/comment.
>>
>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and add
>> permission check for executeProcedures method which requires the caller to
>> be system user or super user.
>>
>> On rolling upgrade: just do not do any replication peer modifications
>> during the rolling upgrading. There is no pb/layout changes on the
>> peer/queue storage on zk.
>>
>
> And there are other benefits.
> First, we have introduced a general procedure framework to send tasks to
> RS and report the report back to Master. It can be used to implement other
> operations such as ACL change.
> Second, zk is used as a external storage now since we do not depend on zk
> watcher any more, it will be much easier to implement a 'table based'
> replication peer/queue storage.
>
> Please vote:
> [+1] Agree
> [-1] Disagree
> [0] Neutral
>
> Thanks.
>
>
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Josh Elser <el...@apache.org>.

To be clear, I didn't mean for this discussion to be some snark directed 
at you or the serial-replication feature, Duo. You just happened to be 
the first person who put up a merge-vote like this into branch-2 and it 
got me thinking. I was/am not trying to throw shade at you.

Let me spin this out into a DISCUSS. I think lumping it into your VOTE 
thread was a bad idea (I was thinking out loud, but obviously the 
expectation is that others think about it and weigh in too).

Sorry for the noise.

On 3/13/18 6:44 AM, 张铎(Duo Zhang) wrote:
> I've already merged it to branch-2...
> 
> And for the procedure based replication peer modification, the feature has
> been finished, at least no critical TODOs. I will not say it has no bugs
> but I do not think it will block the 2.1 or 3.0 release too much.
> 
> And please trust my judgement, I'm not a man who only want to show off. For
> example I just reverted the serial replication feature from branch-2 before
> we release beta2 and tried to redo it on master because we found some
> critical problems. And since the basic idea has not been changed so we
> decided to do it on master because it will not spend too much time to
> finish. And for HBASE-19064 it is a big feature so we decided to do it on a
> feature branch. We will select the branches which we want to merge back at
> the time we finish the feature.
> 
> For the CP cleanup, I think the problem is we started too late, and in the
> past we made mistakes and let the related projects inject too much into the
> internal of HBase. And for B&R, it is something like the serial replication
> feature. For serial replication feature is that we tested it and found some
> critical problems, and for B&R it was not well tested(IIRC) so we were
> afraid of there will be critical problems. And for the AMv2, I think the
> problem is premature optimization. We even implemented our own AvlTree, and
> also a very complicated scheduler which always makes us dead lock...
> 
> These are all very good cases for software engineering in the real world...
> Should be avoided in the future... That's also my duty...
> 
> Thanks.
> 
> 2018-03-13 17:50 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> 
>> I thought the problem with releasing 2.0 was that there were too many open
>> features dragging along not coming to finish- AMv2  (biggest one), CP
>> cleanup, B&R, etc.
>> If it was not the case, it wouldn't have been 1 year between first alpha
>> and GA, rather just few months.
>>
>> That said, i don't mind new but hardened features which are already ready
>> to ship (implementation only or not) going in minor version, but that's my
>> personal opinion. But going too aggressive on that can indeed lead to the
>> trap Josh mentioned above.
>>
>> For this particular change, my +1 was based on following aspects:
>> -  it's internal
>> - moving ops procv2 framework (gives failure recovery, locking, etc)
>> - Duo's reasoning - "....the synchronous guarantee is really a good thing
>> for our replication related UTs..." and "For the replication peer tracking,
>> it is the same problem. It is hard to do fencing with zk watcher since it
>> is asynchronous, so the UTs are always kind of flakey in theoretical."
>>
>> That said, it's pending review.
>> @Duo: As you know it's not possible to spend cycles on it right now -
>> pending 2.0GA - can you please hold it off for few weeks (ideally, until GA
>> + 2-3 weeks) which will give community (whoever interested, at least
>> me..smile) a decent change to review it.
>> Thanks
>>
>> -- Appy
>>
>>
>> On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org> wrote:
>>
>>> (Sorry for something other than just a vote)
>>>
>>> I worry seeing a "big" feature branch merge as us falling back into the
>>> 1.x trap. Starting to backport features into 2.x will keep us delaying
>> 3.0
>>> as we have less and less incentive to push to release 3.0 in a timely
>>> manner.
>>>
>>> That said, I also don't want my worries to bar a feature which appears to
>>> be related to implementation only (based on one high-level read of the
>>> changes). Perhaps we need to re-think what is allowable for a Y release
>> in
>>> x.y.z...
>>>
>>> +1 for master (which already happened, maybe?)
>>> +0 for branch-2 (simply because I haven't looked closely enough at
>>> changes, can read through and try to change to +1 if you need the votes)
>>>
>>>
>>> On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
>>>
>>>> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
>>>> merge branch HBASE-19397-branch-2 back to branch-2.
>>>>
>>>> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
>>>>
>>>> If branch-2.0 will be out soon then let's target this to 2.1. No
>> problem.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
>>>>>
>>>>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> OK, let me merge it master first. And then create a
>> HBASE-19397-branch-2
>>>>>>> which will keep rebasing with the newest branch-2 to see if it is
>>>>>>> stable
>>>>>>> enough. Since we can define this as a bug fix/refactoring rather
>> than a
>>>>>>>
>>>>>> big
>>>>>>
>>>>>>> new feature, it is OK to integrate it at any time. If we think it is
>>>>>>>
>>>>>> stable
>>>>>>
>>>>>>> enough before cutting branch-2.0 then we can include it in the 2.0.0
>>>>>>> release, else let's include it in 2.1(Maybe we can backport it to 2.0
>>>>>>> later?).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> I need to cut the Appy-suggested branch-2.0. Shout if
>>>>>> HBASE-19397-branch-2
>>>>>> gets to be too much work and I'll do it sooner rather than later. Or,
>> if
>>>>>> easier on you, just say and I'll make the branch-2.0 now so you can
>> just
>>>>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
>>>>>> become
>>>>>> hbase2.1...).
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks all here.
>>>>>>>
>>>>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
>>>>>>>
>>>>>>> +1 to merge on master and 2.1.
>>>>>>>> Great work.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Ashish
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
>>>>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
>>>>>>>> To: dev@hbase.apache.org
>>>>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
>>>>>>>>
>>>>>>> branch-2.
>>>>>>
>>>>>>>
>>>>>>>> Anyway, if no objections on merging this into master, let's do it?
>> So
>>>>>>>>
>>>>>>> that
>>>>>>>
>>>>>>>> we can start working on the follow-on features, such as table based
>>>>>>>> replication storage, and synchronous replication, etc.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>>>>>>>>
>>>>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
>>>>>>>>>
>>>>>>>> palomino219@gmail.com>
>>>>>>
>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> This 'new' feature only changes DDL part, not the core part of
>>>>>>>>>>
>>>>>>>>> replication,
>>>>>>>>>
>>>>>>>>>> i.e, how to read wal entries and how to replicate it to the remote
>>>>>>>>>>
>>>>>>>>> cluster,
>>>>>>>>>
>>>>>>>>>> etc. And also there is no pb message/storage layout change, you
>>>>>>>>>>
>>>>>>>>> can
>>>>>>
>>>>>>> think of this as a big refactoring. Theoretically we even do not
>>>>>>>>>> need to add
>>>>>>>>>>
>>>>>>>>> new
>>>>>>>>>
>>>>>>>>>> UTs for this feature, i.e, no extra stability works. The only
>>>>>>>>>> visible change to users is that it may require more time on
>>>>>>>>>> modifying peers in shell. So in general I think it is less hurt to
>>>>>>>>>> include it in the coming release?
>>>>>>>>>>
>>>>>>>>>> And why I think it SHOULD be included in our 2.0 release is that,
>>>>>>>>>> the synchronous guarantee is really a good thing for our
>>>>>>>>>>
>>>>>>>>> replication
>>>>>>
>>>>>>> related UTs. The correctness of the current Test***Replication
>>>>>>>>>> usually depends
>>>>>>>>>>
>>>>>>>>> on a
>>>>>>>>>
>>>>>>>>>> flakey condition - we will not do a log rolling between the
>>>>>>>>>> modification
>>>>>>>>>>
>>>>>>>>> on
>>>>>>>>>
>>>>>>>>>> zk and the actual loading of the modification on RS. And we have
>>>>>>>>>> also
>>>>>>>>>>
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>>> a hard work to cleanup the lockings in ReplicationSourceManager
>>>>>>>>>>
>>>>>>>>> and
>>>>>>
>>>>>>> add a fat comment to say why it should be synchronized in this
>>>>>>>>>>
>>>>>>>>> way.
>>>>>>
>>>>>>> In general, the new code is much easier to read, test and debug,
>>>>>>>>>>
>>>>>>>>> and
>>>>>>
>>>>>>> also reduce the possibility of flakeyness, which could help us a
>>>>>>>>>>
>>>>>>>>> lot
>>>>>>
>>>>>>> when we start to stabilize our build.
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> You see it as a big bug fix Duo?
>>>>>>>>>
>>>>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it more
>>>>>>>> stable, but we know that we can not fix all the problems since we
>>>>>>>>
>>>>>>> store
>>>>>>
>>>>>>> state in several places and it is a 'mission impossible' to make all
>>>>>>>>
>>>>>>> the
>>>>>>
>>>>>>> states stay in sync under every situation... So we introduce AM v2.
>>>>>>>> For the replication peer tracking, it is the same problem. It is
>> hard
>>>>>>>>
>>>>>>> to
>>>>>>
>>>>>>> do fencing with zk watcher since it is asynchronous, so the UTs are
>>>>>>>>
>>>>>>> always
>>>>>>>
>>>>>>>> kind of flakey in theoretical. And we depend on replication heavily
>> at
>>>>>>>> Xiaomi, it is always a pain for us.
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm late to review. Will have a look after beta-1 goes out. This
>>>>>>>>>
>>>>>>>> stuff
>>>>>>
>>>>>>> looks great from outside, especially distributed procedure framework
>>>>>>>>> which we need all over the place.
>>>>>>>>>
>>>>>>>>> In general I have no problem w/ this in master and an hbase 2.1
>> (2.1
>>>>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but let
>>>>>>>>>
>>>>>>>> me
>>>>>>
>>>>>>> take a looksee sir.
>>>>>>>>>
>>>>>>>>> Thanks sir. All the concerns here about whether we should merge
>> this
>>>>>>>>
>>>>>>> into
>>>>>>
>>>>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
>>>>>>>>
>>>>>>> because I
>>>>>>
>>>>>>> believe it will help a lot for stabilizing,  I'm OK with merge it to
>>>>>>>>
>>>>>>> 2.1
>>>>>>
>>>>>>> only if you guys all think so.
>>>>>>>>
>>>>>>>>
>>>>>>>>> St.Ack
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>>>>>>>>>>
>>>>>>>>>> Same questions as Josh's.
>>>>>>>>>>> 1) We have RCs for beta1 now, which means only commits that can
>>>>>>>>>>>
>>>>>>>>>> go
>>>>>>
>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>> are
>>>>>>>>>
>>>>>>>>>> bug fixes only. This change - although important, needed from
>>>>>>>>>>>
>>>>>>>>>> long
>>>>>>
>>>>>>> time
>>>>>>>>>>>
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> well done (testing, summary, etc) - seems rather very large to
>>>>>>>>>>>
>>>>>>>>>> get
>>>>>>
>>>>>>> into
>>>>>>>>>>>
>>>>>>>>>> 2.0
>>>>>>>>>>
>>>>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
>>>>>>>>>>>
>>>>>>>>>> 2.0.
>>>>>>
>>>>>>>
>>>>>>>>>>> -- Appy
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
>>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>> -0 From a general project planning point-of-view (not based on
>>>>>>>>>>>> the technical merit of the code) I am uncomfortable about
>>>>>>>>>>>> pulling in a
>>>>>>>>>>>>
>>>>>>>>>>> brand
>>>>>>>>>>
>>>>>>>>>>> new feature after we've already made one beta RC.
>>>>>>>>>>>>
>>>>>>>>>>>> Duo -- can you expand on why this feature is so important that
>>>>>>>>>>>> we
>>>>>>>>>>>>
>>>>>>>>>>> should
>>>>>>>>>>
>>>>>>>>>>> break our release plan? Are there problems that would make
>>>>>>>>>>>> including
>>>>>>>>>>>>
>>>>>>>>>>> this
>>>>>>>>>>
>>>>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
>>>>>>>>>>>>
>>>>>>>>>>> can
>>>>>>
>>>>>>> provide
>>>>>>>>>
>>>>>>>>>> on
>>>>>>>>>>>
>>>>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>>>>>>>>>>>>>
>>>>>>>>>>>>> We aim to move the peer modification framework from zk
>>>>>>>>>>>>>
>>>>>>>>>>>> watcher
>>>>>>
>>>>>>> to procedure
>>>>>>>>>>>>> v2 in this issue and the work is done now.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Copy the release note here:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Introduce 5 procedures to do peer modifications:
>>>>>>>>>>>>>
>>>>>>>>>>>>> AddPeerProcedure
>>>>>>>>>>>>>> RemovePeerProcedure
>>>>>>>>>>>>>> UpdatePeerConfigProcedure
>>>>>>>>>>>>>> EnablePeerProcedure
>>>>>>>>>>>>>> DisablePeerProcedure
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The procedures are all executed with the following stage:
>>>>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
>>>>>>>>>>>>>>
>>>>>>>>>>>>> 2.
>>>>>>
>>>>>>> Check whether the operation is valid, if not then give up 3.
>>>>>>>>>>>>>> Update peer storage. Notice that if we have entered this
>>>>>>>>>>>>>> stage,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> then
>>>>>>>>>>
>>>>>>>>>>> we
>>>>>>>>>>>>>> can not rollback any more.
>>>>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
>>>>>>>>>>>>>>
>>>>>>>>>>>>> every
>>>>>>
>>>>>>> RS.
>>>>>>>>
>>>>>>>>> 5. Do post cleanup if any.
>>>>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
>>>>>>>>>>>>>> since we
>>>>>>>>>>>>>>
>>>>>>>>>>>>> have
>>>>>>>>>>>
>>>>>>>>>>>> already done the work.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id, so
>>>>>>>>>>>>>> now
>>>>>>>>>>>>>>
>>>>>>>>>>>>> there
>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>>> no concurrent modifications on a single peer.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> the
>>>>>>
>>>>>>> peer modification has already taken effect on all RSes.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
>>>>>>>>>>>>>> manangement,
>>>>>>>>>>>>>>
>>>>>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> refactored the upper layer to remove zk related
>>>>>>>>>>>>>>
>>>>>>>>>>>>> naming/code/comment.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
>>>>>>>>>>>>>> RegionServerObserver, and
>>>>>>>>>>>>>>
>>>>>>>>>>>>> add
>>>>>>>>>>
>>>>>>>>>>> permission check for executeProcedures method which requires
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> caller
>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>>>> be system user or super user.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
>>>>>>>>>>>>>>
>>>>>>>>>>>>> modifications
>>>>>>>>>
>>>>>>>>>> during the rolling upgrading. There is no pb/layout changes
>>>>>>>>>>>>>>
>>>>>>>>>>>>> on
>>>>>>
>>>>>>> the peer/queue storage on zk.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And there are other benefits.
>>>>>>>>>>>>> First, we have introduced a general procedure framework to
>>>>>>>>>>>>>
>>>>>>>>>>>> send
>>>>>>
>>>>>>> tasks
>>>>>>>>>
>>>>>>>>>> to
>>>>>>>>>>
>>>>>>>>>>> RS
>>>>>>>>>>>>> and report the report back to Master. It can be used to
>>>>>>>>>>>>> implement
>>>>>>>>>>>>>
>>>>>>>>>>>> other
>>>>>>>>>>
>>>>>>>>>>> operations such as ACL change.
>>>>>>>>>>>>> Second, zk is used as a external storage now since we do not
>>>>>>>>>>>>> depend
>>>>>>>>>>>>>
>>>>>>>>>>>> on
>>>>>>>>>
>>>>>>>>>> zk
>>>>>>>>>>>
>>>>>>>>>>>> watcher any more, it will be much easier to implement a
>>>>>>>>>>>>>
>>>>>>>>>>>> 'table
>>>>>>
>>>>>>> based'
>>>>>>>>>
>>>>>>>>>> replication peer/queue storage.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please vote:
>>>>>>>>>>>>> [+1] Agree
>>>>>>>>>>>>> [-1] Disagree
>>>>>>>>>>>>> [0] Neutral
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> -- Appy
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>> --
>>
>> -- Appy
>>
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

On Tue, Mar 13, 2018 at 8:44 PM, Stack <st...@duboce.net> wrote:

> On Tue, Mar 13, 2018 at 6:30 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
>> I would be interested to be the RM for 2.1 :)
>>
>>
> I could help.
>

Or, I meant, good... We need more RMs. I volunteer to help smooth the
experience.

S

S
>
>
>
>> Stack <st...@duboce.net>于2018年3月14日 周三02:48写道：
>>
>> > I took a look. Its great. Our replication is in bad need of loving. The
>> > patch does a refactor/revamp/fix of an internal facility putting our
>> peer
>> > management up on a Pv2 basis. It also moves forward the long-rumored
>> > distributed procedure project.
>> >
>> > Its a big change though. Would have been great to have it make 2.0.0
>> but it
>> > didn't. I'd be up for it in a minor release rather than wait on 3.0.0
>> given
>> > we allow ourselves some leeway adding facility on minors and that it is
>> at
>> > core a solidifying fix. Needs to be doc'd and tested, verified on a
>> deploy
>> > beyond unit test. I could help test. Is it proven compatible with
>> existing
>> > replication deploys?
>> >
>> > Who's the 3.0.0 RM? When is it going to roll? Having this in place is
>> best
>> > argument if folks propose back-ports. Without, we are doomed to repeat
>> the
>> > 2.0.0 experience.
>> >
>> > Thanks,
>> > St.Ack
>> >
>> >
>> >
>> > On Tue, Mar 13, 2018 at 3:44 AM, 张铎(Duo Zhang) <pa...@gmail.com>
>> > wrote:
>> >
>> > > I've already merged it to branch-2...
>> > >
>> > > And for the procedure based replication peer modification, the feature
>> > has
>> > > been finished, at least no critical TODOs. I will not say it has no
>> bugs
>> > > but I do not think it will block the 2.1 or 3.0 release too much.
>> > >
>> > > And please trust my judgement, I'm not a man who only want to show
>> off.
>> > For
>> > > example I just reverted the serial replication feature from branch-2
>> > before
>> > > we release beta2 and tried to redo it on master because we found some
>> > > critical problems. And since the basic idea has not been changed so we
>> > > decided to do it on master because it will not spend too much time to
>> > > finish. And for HBASE-19064 it is a big feature so we decided to do it
>> > on a
>> > > feature branch. We will select the branches which we want to merge
>> back
>> > at
>> > > the time we finish the feature.
>> > >
>> > > For the CP cleanup, I think the problem is we started too late, and in
>> > the
>> > > past we made mistakes and let the related projects inject too much
>> into
>> > the
>> > > internal of HBase. And for B&R, it is something like the serial
>> > replication
>> > > feature. For serial replication feature is that we tested it and found
>> > some
>> > > critical problems, and for B&R it was not well tested(IIRC) so we were
>> > > afraid of there will be critical problems. And for the AMv2, I think
>> the
>> > > problem is premature optimization. We even implemented our own
>> AvlTree,
>> > and
>> > > also a very complicated scheduler which always makes us dead lock...
>> > >
>> > > These are all very good cases for software engineering in the real
>> > world...
>> > > Should be avoided in the future... That's also my duty...
>> > >
>> > > Thanks.
>> > >
>> > > 2018-03-13 17:50 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>> > >
>> > > > I thought the problem with releasing 2.0 was that there were too
>> many
>> > > open
>> > > > features dragging along not coming to finish- AMv2  (biggest one),
>> CP
>> > > > cleanup, B&R, etc.
>> > > > If it was not the case, it wouldn't have been 1 year between first
>> > alpha
>> > > > and GA, rather just few months.
>> > > >
>> > > > That said, i don't mind new but hardened features which are already
>> > ready
>> > > > to ship (implementation only or not) going in minor version, but
>> that's
>> > > my
>> > > > personal opinion. But going too aggressive on that can indeed lead
>> to
>> > the
>> > > > trap Josh mentioned above.
>> > > >
>> > > > For this particular change, my +1 was based on following aspects:
>> > > > -  it's internal
>> > > > - moving ops procv2 framework (gives failure recovery, locking, etc)
>> > > > - Duo's reasoning - "....the synchronous guarantee is really a good
>> > thing
>> > > > for our replication related UTs..." and "For the replication peer
>> > > tracking,
>> > > > it is the same problem. It is hard to do fencing with zk watcher
>> since
>> > it
>> > > > is asynchronous, so the UTs are always kind of flakey in
>> theoretical."
>> > > >
>> > > > That said, it's pending review.
>> > > > @Duo: As you know it's not possible to spend cycles on it right now
>> -
>> > > > pending 2.0GA - can you please hold it off for few weeks (ideally,
>> > until
>> > > GA
>> > > > + 2-3 weeks) which will give community (whoever interested, at least
>> > > > me..smile) a decent change to review it.
>> > > > Thanks
>> > > >
>> > > > -- Appy
>> > > >
>> > > >
>> > > > On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org>
>> wrote:
>> > > >
>> > > > > (Sorry for something other than just a vote)
>> > > > >
>> > > > > I worry seeing a "big" feature branch merge as us falling back
>> into
>> > the
>> > > > > 1.x trap. Starting to backport features into 2.x will keep us
>> > delaying
>> > > > 3.0
>> > > > > as we have less and less incentive to push to release 3.0 in a
>> timely
>> > > > > manner.
>> > > > >
>> > > > > That said, I also don't want my worries to bar a feature which
>> > appears
>> > > to
>> > > > > be related to implementation only (based on one high-level read of
>> > the
>> > > > > changes). Perhaps we need to re-think what is allowable for a Y
>> > release
>> > > > in
>> > > > > x.y.z...
>> > > > >
>> > > > > +1 for master (which already happened, maybe?)
>> > > > > +0 for branch-2 (simply because I haven't looked closely enough at
>> > > > > changes, can read through and try to change to +1 if you need the
>> > > votes)
>> > > > >
>> > > > >
>> > > > > On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
>> > > > >
>> > > > >> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT,
>> > will
>> > > > >> merge branch HBASE-19397-branch-2 back to branch-2.
>> > > > >>
>> > > > >> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
>> > > > >>
>> > > > >> If branch-2.0 will be out soon then let's target this to 2.1. No
>> > > > problem.
>> > > > >>>
>> > > > >>> Thanks.
>> > > > >>>
>> > > > >>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
>> > > > >>>
>> > > > >>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <
>> > palomino219@gmail.com
>> > > >
>> > > > >>>> wrote:
>> > > > >>>>
>> > > > >>>> OK, let me merge it master first. And then create a
>> > > > HBASE-19397-branch-2
>> > > > >>>>> which will keep rebasing with the newest branch-2 to see if
>> it is
>> > > > >>>>> stable
>> > > > >>>>> enough. Since we can define this as a bug fix/refactoring
>> rather
>> > > > than a
>> > > > >>>>>
>> > > > >>>> big
>> > > > >>>>
>> > > > >>>>> new feature, it is OK to integrate it at any time. If we
>> think it
>> > > is
>> > > > >>>>>
>> > > > >>>> stable
>> > > > >>>>
>> > > > >>>>> enough before cutting branch-2.0 then we can include it in the
>> > > 2.0.0
>> > > > >>>>> release, else let's include it in 2.1(Maybe we can backport
>> it to
>> > > 2.0
>> > > > >>>>> later?).
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>> I need to cut the Appy-suggested branch-2.0. Shout if
>> > > > >>>> HBASE-19397-branch-2
>> > > > >>>> gets to be too much work and I'll do it sooner rather than
>> later.
>> > > Or,
>> > > > if
>> > > > >>>> easier on you, just say and I'll make the branch-2.0 now so you
>> > can
>> > > > just
>> > > > >>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2
>> will
>> > > > >>>> become
>> > > > >>>> hbase2.1...).
>> > > > >>>>
>> > > > >>>> St.Ack
>> > > > >>>>
>> > > > >>>>
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> Thanks all here.
>> > > > >>>>>
>> > > > >>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <
>> > ashish.singhi@huawei.com
>> > > >:
>> > > > >>>>>
>> > > > >>>>> +1 to merge on master and 2.1.
>> > > > >>>>>> Great work.
>> > > > >>>>>>
>> > > > >>>>>> Thanks,
>> > > > >>>>>> Ashish
>> > > > >>>>>>
>> > > > >>>>>> -----Original Message-----
>> > > > >>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
>> > > > >>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
>> > > > >>>>>> To: dev@hbase.apache.org
>> > > > >>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master
>> and
>> > > > >>>>>>
>> > > > >>>>> branch-2.
>> > > > >>>>
>> > > > >>>>>
>> > > > >>>>>> Anyway, if no objections on merging this into master, let's
>> do
>> > it?
>> > > > So
>> > > > >>>>>>
>> > > > >>>>> that
>> > > > >>>>>
>> > > > >>>>>> we can start working on the follow-on features, such as table
>> > > based
>> > > > >>>>>> replication storage, and synchronous replication, etc.
>> > > > >>>>>>
>> > > > >>>>>> Thanks.
>> > > > >>>>>>
>> > > > >>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>> > > > >>>>>>
>> > > > >>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
>> > > > >>>>>>>
>> > > > >>>>>> palomino219@gmail.com>
>> > > > >>>>
>> > > > >>>>> wrote:
>> > > > >>>>>>>
>> > > > >>>>>>> This 'new' feature only changes DDL part, not the core part
>> of
>> > > > >>>>>>>>
>> > > > >>>>>>> replication,
>> > > > >>>>>>>
>> > > > >>>>>>>> i.e, how to read wal entries and how to replicate it to the
>> > > remote
>> > > > >>>>>>>>
>> > > > >>>>>>> cluster,
>> > > > >>>>>>>
>> > > > >>>>>>>> etc. And also there is no pb message/storage layout change,
>> > you
>> > > > >>>>>>>>
>> > > > >>>>>>> can
>> > > > >>>>
>> > > > >>>>> think of this as a big refactoring. Theoretically we even do
>> not
>> > > > >>>>>>>> need to add
>> > > > >>>>>>>>
>> > > > >>>>>>> new
>> > > > >>>>>>>
>> > > > >>>>>>>> UTs for this feature, i.e, no extra stability works. The
>> only
>> > > > >>>>>>>> visible change to users is that it may require more time on
>> > > > >>>>>>>> modifying peers in shell. So in general I think it is less
>> > hurt
>> > > to
>> > > > >>>>>>>> include it in the coming release?
>> > > > >>>>>>>>
>> > > > >>>>>>>> And why I think it SHOULD be included in our 2.0 release is
>> > > that,
>> > > > >>>>>>>> the synchronous guarantee is really a good thing for our
>> > > > >>>>>>>>
>> > > > >>>>>>> replication
>> > > > >>>>
>> > > > >>>>> related UTs. The correctness of the current Test***Replication
>> > > > >>>>>>>> usually depends
>> > > > >>>>>>>>
>> > > > >>>>>>> on a
>> > > > >>>>>>>
>> > > > >>>>>>>> flakey condition - we will not do a log rolling between the
>> > > > >>>>>>>> modification
>> > > > >>>>>>>>
>> > > > >>>>>>> on
>> > > > >>>>>>>
>> > > > >>>>>>>> zk and the actual loading of the modification on RS. And we
>> > have
>> > > > >>>>>>>> also
>> > > > >>>>>>>>
>> > > > >>>>>>> done
>> > > > >>>>>>>
>> > > > >>>>>>>> a hard work to cleanup the lockings in
>> > ReplicationSourceManager
>> > > > >>>>>>>>
>> > > > >>>>>>> and
>> > > > >>>>
>> > > > >>>>> add a fat comment to say why it should be synchronized in this
>> > > > >>>>>>>>
>> > > > >>>>>>> way.
>> > > > >>>>
>> > > > >>>>> In general, the new code is much easier to read, test and
>> debug,
>> > > > >>>>>>>>
>> > > > >>>>>>> and
>> > > > >>>>
>> > > > >>>>> also reduce the possibility of flakeyness, which could help
>> us a
>> > > > >>>>>>>>
>> > > > >>>>>>> lot
>> > > > >>>>
>> > > > >>>>> when we start to stabilize our build.
>> > > > >>>>>>>>
>> > > > >>>>>>>> Thanks.
>> > > > >>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>> You see it as a big bug fix Duo?
>> > > > >>>>>>>
>> > > > >>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make
>> it
>> > > more
>> > > > >>>>>> stable, but we know that we can not fix all the problems
>> since
>> > we
>> > > > >>>>>>
>> > > > >>>>> store
>> > > > >>>>
>> > > > >>>>> state in several places and it is a 'mission impossible' to
>> make
>> > > all
>> > > > >>>>>>
>> > > > >>>>> the
>> > > > >>>>
>> > > > >>>>> states stay in sync under every situation... So we introduce
>> AM
>> > v2.
>> > > > >>>>>> For the replication peer tracking, it is the same problem.
>> It is
>> > > > hard
>> > > > >>>>>>
>> > > > >>>>> to
>> > > > >>>>
>> > > > >>>>> do fencing with zk watcher since it is asynchronous, so the
>> UTs
>> > are
>> > > > >>>>>>
>> > > > >>>>> always
>> > > > >>>>>
>> > > > >>>>>> kind of flakey in theoretical. And we depend on replication
>> > > heavily
>> > > > at
>> > > > >>>>>> Xiaomi, it is always a pain for us.
>> > > > >>>>>>
>> > > > >>>>>>
>> > > > >>>>>>> I'm late to review. Will have a look after beta-1 goes out.
>> > This
>> > > > >>>>>>>
>> > > > >>>>>> stuff
>> > > > >>>>
>> > > > >>>>> looks great from outside, especially distributed procedure
>> > > framework
>> > > > >>>>>>> which we need all over the place.
>> > > > >>>>>>>
>> > > > >>>>>>> In general I have no problem w/ this in master and an hbase
>> 2.1
>> > > > (2.1
>> > > > >>>>>>> could be soon after 2.0). Its late for big stuff in 2.0....
>> but
>> > > let
>> > > > >>>>>>>
>> > > > >>>>>> me
>> > > > >>>>
>> > > > >>>>> take a looksee sir.
>> > > > >>>>>>>
>> > > > >>>>>>> Thanks sir. All the concerns here about whether we should
>> merge
>> > > > this
>> > > > >>>>>>
>> > > > >>>>> into
>> > > > >>>>
>> > > > >>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
>> > > > >>>>>>
>> > > > >>>>> because I
>> > > > >>>>
>> > > > >>>>> believe it will help a lot for stabilizing,  I'm OK with
>> merge it
>> > > to
>> > > > >>>>>>
>> > > > >>>>> 2.1
>> > > > >>>>
>> > > > >>>>> only if you guys all think so.
>> > > > >>>>>>
>> > > > >>>>>>
>> > > > >>>>>>> St.Ack
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <
>> appy@cloudera.com>:
>> > > > >>>>>>>>
>> > > > >>>>>>>> Same questions as Josh's.
>> > > > >>>>>>>>> 1) We have RCs for beta1 now, which means only commits
>> that
>> > can
>> > > > >>>>>>>>>
>> > > > >>>>>>>> go
>> > > > >>>>
>> > > > >>>>> in
>> > > > >>>>>>>>>
>> > > > >>>>>>>> are
>> > > > >>>>>>>
>> > > > >>>>>>>> bug fixes only. This change - although important, needed
>> from
>> > > > >>>>>>>>>
>> > > > >>>>>>>> long
>> > > > >>>>
>> > > > >>>>> time
>> > > > >>>>>>>>>
>> > > > >>>>>>>> and
>> > > > >>>>>>>>
>> > > > >>>>>>>>> well done (testing, summary, etc) - seems rather very
>> large
>> > to
>> > > > >>>>>>>>>
>> > > > >>>>>>>> get
>> > > > >>>>
>> > > > >>>>> into
>> > > > >>>>>>>>>
>> > > > >>>>>>>> 2.0
>> > > > >>>>>>>>
>> > > > >>>>>>>>> now. Needs good justification why it has to be 2.1
>> instead of
>> > > > >>>>>>>>>
>> > > > >>>>>>>> 2.0.
>> > > > >>>>
>> > > > >>>>>
>> > > > >>>>>>>>> -- Appy
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <
>> > elserj@apache.org>
>> > > > >>>>>>>>>
>> > > > >>>>>>>> wrote:
>> > > > >>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>>>> -0 From a general project planning point-of-view (not
>> based
>> > on
>> > > > >>>>>>>>>> the technical merit of the code) I am uncomfortable about
>> > > > >>>>>>>>>> pulling in a
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> brand
>> > > > >>>>>>>>
>> > > > >>>>>>>>> new feature after we've already made one beta RC.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> Duo -- can you expand on why this feature is so important
>> > that
>> > > > >>>>>>>>>> we
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> should
>> > > > >>>>>>>>
>> > > > >>>>>>>>> break our release plan? Are there problems that would make
>> > > > >>>>>>>>>> including
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> this
>> > > > >>>>>>>>
>> > > > >>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color
>> you
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>> can
>> > > > >>>>
>> > > > >>>>> provide
>> > > > >>>>>>>
>> > > > >>>>>>>> on
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>> > > > >>>>>>>>>>
>> > > > >>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> We aim to move the peer modification framework from zk
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> watcher
>> > > > >>>>
>> > > > >>>>> to procedure
>> > > > >>>>>>>>>>> v2 in this issue and the work is done now.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> Copy the release note here:
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> Introduce 5 procedures to do peer modifications:
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> AddPeerProcedure
>> > > > >>>>>>>>>>>> RemovePeerProcedure
>> > > > >>>>>>>>>>>> UpdatePeerConfigProcedure
>> > > > >>>>>>>>>>>> EnablePeerProcedure
>> > > > >>>>>>>>>>>> DisablePeerProcedure
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> The procedures are all executed with the following
>> stage:
>> > > > >>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then
>> give
>> > up
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> 2.
>> > > > >>>>
>> > > > >>>>> Check whether the operation is valid, if not then give up 3.
>> > > > >>>>>>>>>>>> Update peer storage. Notice that if we have entered
>> this
>> > > > >>>>>>>>>>>> stage,
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> then
>> > > > >>>>>>>>
>> > > > >>>>>>>>> we
>> > > > >>>>>>>>>>>> can not rollback any more.
>> > > > >>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config
>> on
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> every
>> > > > >>>>
>> > > > >>>>> RS.
>> > > > >>>>>>
>> > > > >>>>>>> 5. Do post cleanup if any.
>> > > > >>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be
>> ignored
>> > > > >>>>>>>>>>>> since we
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> have
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> already done the work.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> The procedure will hold an exclusive lock on the peer
>> id,
>> > so
>> > > > >>>>>>>>>>>> now
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> there
>> > > > >>>>>>>>
>> > > > >>>>>>>>> is
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> no concurrent modifications on a single peer.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> And now it is guaranteed that once the procedure is
>> done,
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> the
>> > > > >>>>
>> > > > >>>>> peer modification has already taken effect on all RSes.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
>> > > > >>>>>>>>>>>> manangement,
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> and
>> > > > >>>>>>>
>> > > > >>>>>>>> refactored the upper layer to remove zk related
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> naming/code/comment.
>> > > > >>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
>> > > > >>>>>>>>>>>> RegionServerObserver, and
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> add
>> > > > >>>>>>>>
>> > > > >>>>>>>>> permission check for executeProcedures method which
>> requires
>> > > > >>>>>>>>>>>> the
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> caller
>> > > > >>>>>>>>
>> > > > >>>>>>>>> to
>> > > > >>>>>>>>>>>> be system user or super user.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> modifications
>> > > > >>>>>>>
>> > > > >>>>>>>> during the rolling upgrading. There is no pb/layout changes
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>> on
>> > > > >>>>
>> > > > >>>>> the peer/queue storage on zk.
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>>
>> > > > >>>>>>>>>>>> And there are other benefits.
>> > > > >>>>>>>>>>> First, we have introduced a general procedure framework
>> to
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> send
>> > > > >>>>
>> > > > >>>>> tasks
>> > > > >>>>>>>
>> > > > >>>>>>>> to
>> > > > >>>>>>>>
>> > > > >>>>>>>>> RS
>> > > > >>>>>>>>>>> and report the report back to Master. It can be used to
>> > > > >>>>>>>>>>> implement
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> other
>> > > > >>>>>>>>
>> > > > >>>>>>>>> operations such as ACL change.
>> > > > >>>>>>>>>>> Second, zk is used as a external storage now since we do
>> > not
>> > > > >>>>>>>>>>> depend
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> on
>> > > > >>>>>>>
>> > > > >>>>>>>> zk
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>> watcher any more, it will be much easier to implement a
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>> 'table
>> > > > >>>>
>> > > > >>>>> based'
>> > > > >>>>>>>
>> > > > >>>>>>>> replication peer/queue storage.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> Please vote:
>> > > > >>>>>>>>>>> [+1] Agree
>> > > > >>>>>>>>>>> [-1] Disagree
>> > > > >>>>>>>>>>> [0] Neutral
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>> Thanks.
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>>>
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> --
>> > > > >>>>>>>>>
>> > > > >>>>>>>>> -- Appy
>> > > > >>>>>>>>>
>> > > > >>>>>>>>>
>> > > > >>>>>>>>
>> > > > >>>>>>>
>> > > > >>>>>>
>> > > > >>>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>>
>> > > > >>
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > > -- Appy
>> > > >
>> > >
>> >
>>
>
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

On Tue, Mar 13, 2018 at 6:30 PM, 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> I would be interested to be the RM for 2.1 :)
>
>
I could help.
S



> Stack <st...@duboce.net>于2018年3月14日 周三02:48写道：
>
> > I took a look. Its great. Our replication is in bad need of loving. The
> > patch does a refactor/revamp/fix of an internal facility putting our peer
> > management up on a Pv2 basis. It also moves forward the long-rumored
> > distributed procedure project.
> >
> > Its a big change though. Would have been great to have it make 2.0.0 but
> it
> > didn't. I'd be up for it in a minor release rather than wait on 3.0.0
> given
> > we allow ourselves some leeway adding facility on minors and that it is
> at
> > core a solidifying fix. Needs to be doc'd and tested, verified on a
> deploy
> > beyond unit test. I could help test. Is it proven compatible with
> existing
> > replication deploys?
> >
> > Who's the 3.0.0 RM? When is it going to roll? Having this in place is
> best
> > argument if folks propose back-ports. Without, we are doomed to repeat
> the
> > 2.0.0 experience.
> >
> > Thanks,
> > St.Ack
> >
> >
> >
> > On Tue, Mar 13, 2018 at 3:44 AM, 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > I've already merged it to branch-2...
> > >
> > > And for the procedure based replication peer modification, the feature
> > has
> > > been finished, at least no critical TODOs. I will not say it has no
> bugs
> > > but I do not think it will block the 2.1 or 3.0 release too much.
> > >
> > > And please trust my judgement, I'm not a man who only want to show off.
> > For
> > > example I just reverted the serial replication feature from branch-2
> > before
> > > we release beta2 and tried to redo it on master because we found some
> > > critical problems. And since the basic idea has not been changed so we
> > > decided to do it on master because it will not spend too much time to
> > > finish. And for HBASE-19064 it is a big feature so we decided to do it
> > on a
> > > feature branch. We will select the branches which we want to merge back
> > at
> > > the time we finish the feature.
> > >
> > > For the CP cleanup, I think the problem is we started too late, and in
> > the
> > > past we made mistakes and let the related projects inject too much into
> > the
> > > internal of HBase. And for B&R, it is something like the serial
> > replication
> > > feature. For serial replication feature is that we tested it and found
> > some
> > > critical problems, and for B&R it was not well tested(IIRC) so we were
> > > afraid of there will be critical problems. And for the AMv2, I think
> the
> > > problem is premature optimization. We even implemented our own AvlTree,
> > and
> > > also a very complicated scheduler which always makes us dead lock...
> > >
> > > These are all very good cases for software engineering in the real
> > world...
> > > Should be avoided in the future... That's also my duty...
> > >
> > > Thanks.
> > >
> > > 2018-03-13 17:50 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > >
> > > > I thought the problem with releasing 2.0 was that there were too many
> > > open
> > > > features dragging along not coming to finish- AMv2  (biggest one), CP
> > > > cleanup, B&R, etc.
> > > > If it was not the case, it wouldn't have been 1 year between first
> > alpha
> > > > and GA, rather just few months.
> > > >
> > > > That said, i don't mind new but hardened features which are already
> > ready
> > > > to ship (implementation only or not) going in minor version, but
> that's
> > > my
> > > > personal opinion. But going too aggressive on that can indeed lead to
> > the
> > > > trap Josh mentioned above.
> > > >
> > > > For this particular change, my +1 was based on following aspects:
> > > > -  it's internal
> > > > - moving ops procv2 framework (gives failure recovery, locking, etc)
> > > > - Duo's reasoning - "....the synchronous guarantee is really a good
> > thing
> > > > for our replication related UTs..." and "For the replication peer
> > > tracking,
> > > > it is the same problem. It is hard to do fencing with zk watcher
> since
> > it
> > > > is asynchronous, so the UTs are always kind of flakey in
> theoretical."
> > > >
> > > > That said, it's pending review.
> > > > @Duo: As you know it's not possible to spend cycles on it right now -
> > > > pending 2.0GA - can you please hold it off for few weeks (ideally,
> > until
> > > GA
> > > > + 2-3 weeks) which will give community (whoever interested, at least
> > > > me..smile) a decent change to review it.
> > > > Thanks
> > > >
> > > > -- Appy
> > > >
> > > >
> > > > On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org>
> wrote:
> > > >
> > > > > (Sorry for something other than just a vote)
> > > > >
> > > > > I worry seeing a "big" feature branch merge as us falling back into
> > the
> > > > > 1.x trap. Starting to backport features into 2.x will keep us
> > delaying
> > > > 3.0
> > > > > as we have less and less incentive to push to release 3.0 in a
> timely
> > > > > manner.
> > > > >
> > > > > That said, I also don't want my worries to bar a feature which
> > appears
> > > to
> > > > > be related to implementation only (based on one high-level read of
> > the
> > > > > changes). Perhaps we need to re-think what is allowable for a Y
> > release
> > > > in
> > > > > x.y.z...
> > > > >
> > > > > +1 for master (which already happened, maybe?)
> > > > > +0 for branch-2 (simply because I haven't looked closely enough at
> > > > > changes, can read through and try to change to +1 if you need the
> > > votes)
> > > > >
> > > > >
> > > > > On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
> > > > >
> > > > >> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT,
> > will
> > > > >> merge branch HBASE-19397-branch-2 back to branch-2.
> > > > >>
> > > > >> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
> > > > >>
> > > > >> If branch-2.0 will be out soon then let's target this to 2.1. No
> > > > problem.
> > > > >>>
> > > > >>> Thanks.
> > > > >>>
> > > > >>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> > > > >>>
> > > > >>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <
> > palomino219@gmail.com
> > > >
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>> OK, let me merge it master first. And then create a
> > > > HBASE-19397-branch-2
> > > > >>>>> which will keep rebasing with the newest branch-2 to see if it
> is
> > > > >>>>> stable
> > > > >>>>> enough. Since we can define this as a bug fix/refactoring
> rather
> > > > than a
> > > > >>>>>
> > > > >>>> big
> > > > >>>>
> > > > >>>>> new feature, it is OK to integrate it at any time. If we think
> it
> > > is
> > > > >>>>>
> > > > >>>> stable
> > > > >>>>
> > > > >>>>> enough before cutting branch-2.0 then we can include it in the
> > > 2.0.0
> > > > >>>>> release, else let's include it in 2.1(Maybe we can backport it
> to
> > > 2.0
> > > > >>>>> later?).
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>> I need to cut the Appy-suggested branch-2.0. Shout if
> > > > >>>> HBASE-19397-branch-2
> > > > >>>> gets to be too much work and I'll do it sooner rather than
> later.
> > > Or,
> > > > if
> > > > >>>> easier on you, just say and I'll make the branch-2.0 now so you
> > can
> > > > just
> > > > >>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2
> will
> > > > >>>> become
> > > > >>>> hbase2.1...).
> > > > >>>>
> > > > >>>> St.Ack
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Thanks all here.
> > > > >>>>>
> > > > >>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <
> > ashish.singhi@huawei.com
> > > >:
> > > > >>>>>
> > > > >>>>> +1 to merge on master and 2.1.
> > > > >>>>>> Great work.
> > > > >>>>>>
> > > > >>>>>> Thanks,
> > > > >>>>>> Ashish
> > > > >>>>>>
> > > > >>>>>> -----Original Message-----
> > > > >>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> > > > >>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
> > > > >>>>>> To: dev@hbase.apache.org
> > > > >>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master
> and
> > > > >>>>>>
> > > > >>>>> branch-2.
> > > > >>>>
> > > > >>>>>
> > > > >>>>>> Anyway, if no objections on merging this into master, let's do
> > it?
> > > > So
> > > > >>>>>>
> > > > >>>>> that
> > > > >>>>>
> > > > >>>>>> we can start working on the follow-on features, such as table
> > > based
> > > > >>>>>> replication storage, and synchronous replication, etc.
> > > > >>>>>>
> > > > >>>>>> Thanks.
> > > > >>>>>>
> > > > >>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > > > >>>>>>
> > > > >>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> > > > >>>>>>>
> > > > >>>>>> palomino219@gmail.com>
> > > > >>>>
> > > > >>>>> wrote:
> > > > >>>>>>>
> > > > >>>>>>> This 'new' feature only changes DDL part, not the core part
> of
> > > > >>>>>>>>
> > > > >>>>>>> replication,
> > > > >>>>>>>
> > > > >>>>>>>> i.e, how to read wal entries and how to replicate it to the
> > > remote
> > > > >>>>>>>>
> > > > >>>>>>> cluster,
> > > > >>>>>>>
> > > > >>>>>>>> etc. And also there is no pb message/storage layout change,
> > you
> > > > >>>>>>>>
> > > > >>>>>>> can
> > > > >>>>
> > > > >>>>> think of this as a big refactoring. Theoretically we even do
> not
> > > > >>>>>>>> need to add
> > > > >>>>>>>>
> > > > >>>>>>> new
> > > > >>>>>>>
> > > > >>>>>>>> UTs for this feature, i.e, no extra stability works. The
> only
> > > > >>>>>>>> visible change to users is that it may require more time on
> > > > >>>>>>>> modifying peers in shell. So in general I think it is less
> > hurt
> > > to
> > > > >>>>>>>> include it in the coming release?
> > > > >>>>>>>>
> > > > >>>>>>>> And why I think it SHOULD be included in our 2.0 release is
> > > that,
> > > > >>>>>>>> the synchronous guarantee is really a good thing for our
> > > > >>>>>>>>
> > > > >>>>>>> replication
> > > > >>>>
> > > > >>>>> related UTs. The correctness of the current Test***Replication
> > > > >>>>>>>> usually depends
> > > > >>>>>>>>
> > > > >>>>>>> on a
> > > > >>>>>>>
> > > > >>>>>>>> flakey condition - we will not do a log rolling between the
> > > > >>>>>>>> modification
> > > > >>>>>>>>
> > > > >>>>>>> on
> > > > >>>>>>>
> > > > >>>>>>>> zk and the actual loading of the modification on RS. And we
> > have
> > > > >>>>>>>> also
> > > > >>>>>>>>
> > > > >>>>>>> done
> > > > >>>>>>>
> > > > >>>>>>>> a hard work to cleanup the lockings in
> > ReplicationSourceManager
> > > > >>>>>>>>
> > > > >>>>>>> and
> > > > >>>>
> > > > >>>>> add a fat comment to say why it should be synchronized in this
> > > > >>>>>>>>
> > > > >>>>>>> way.
> > > > >>>>
> > > > >>>>> In general, the new code is much easier to read, test and
> debug,
> > > > >>>>>>>>
> > > > >>>>>>> and
> > > > >>>>
> > > > >>>>> also reduce the possibility of flakeyness, which could help us
> a
> > > > >>>>>>>>
> > > > >>>>>>> lot
> > > > >>>>
> > > > >>>>> when we start to stabilize our build.
> > > > >>>>>>>>
> > > > >>>>>>>> Thanks.
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> You see it as a big bug fix Duo?
> > > > >>>>>>>
> > > > >>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make
> it
> > > more
> > > > >>>>>> stable, but we know that we can not fix all the problems since
> > we
> > > > >>>>>>
> > > > >>>>> store
> > > > >>>>
> > > > >>>>> state in several places and it is a 'mission impossible' to
> make
> > > all
> > > > >>>>>>
> > > > >>>>> the
> > > > >>>>
> > > > >>>>> states stay in sync under every situation... So we introduce AM
> > v2.
> > > > >>>>>> For the replication peer tracking, it is the same problem. It
> is
> > > > hard
> > > > >>>>>>
> > > > >>>>> to
> > > > >>>>
> > > > >>>>> do fencing with zk watcher since it is asynchronous, so the UTs
> > are
> > > > >>>>>>
> > > > >>>>> always
> > > > >>>>>
> > > > >>>>>> kind of flakey in theoretical. And we depend on replication
> > > heavily
> > > > at
> > > > >>>>>> Xiaomi, it is always a pain for us.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>> I'm late to review. Will have a look after beta-1 goes out.
> > This
> > > > >>>>>>>
> > > > >>>>>> stuff
> > > > >>>>
> > > > >>>>> looks great from outside, especially distributed procedure
> > > framework
> > > > >>>>>>> which we need all over the place.
> > > > >>>>>>>
> > > > >>>>>>> In general I have no problem w/ this in master and an hbase
> 2.1
> > > > (2.1
> > > > >>>>>>> could be soon after 2.0). Its late for big stuff in 2.0....
> but
> > > let
> > > > >>>>>>>
> > > > >>>>>> me
> > > > >>>>
> > > > >>>>> take a looksee sir.
> > > > >>>>>>>
> > > > >>>>>>> Thanks sir. All the concerns here about whether we should
> merge
> > > > this
> > > > >>>>>>
> > > > >>>>> into
> > > > >>>>
> > > > >>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
> > > > >>>>>>
> > > > >>>>> because I
> > > > >>>>
> > > > >>>>> believe it will help a lot for stabilizing,  I'm OK with merge
> it
> > > to
> > > > >>>>>>
> > > > >>>>> 2.1
> > > > >>>>
> > > > >>>>> only if you guys all think so.
> > > > >>>>>>
> > > > >>>>>>
> > > > >>>>>>> St.Ack
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>>
> > > > >>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <appy@cloudera.com
> >:
> > > > >>>>>>>>
> > > > >>>>>>>> Same questions as Josh's.
> > > > >>>>>>>>> 1) We have RCs for beta1 now, which means only commits that
> > can
> > > > >>>>>>>>>
> > > > >>>>>>>> go
> > > > >>>>
> > > > >>>>> in
> > > > >>>>>>>>>
> > > > >>>>>>>> are
> > > > >>>>>>>
> > > > >>>>>>>> bug fixes only. This change - although important, needed
> from
> > > > >>>>>>>>>
> > > > >>>>>>>> long
> > > > >>>>
> > > > >>>>> time
> > > > >>>>>>>>>
> > > > >>>>>>>> and
> > > > >>>>>>>>
> > > > >>>>>>>>> well done (testing, summary, etc) - seems rather very large
> > to
> > > > >>>>>>>>>
> > > > >>>>>>>> get
> > > > >>>>
> > > > >>>>> into
> > > > >>>>>>>>>
> > > > >>>>>>>> 2.0
> > > > >>>>>>>>
> > > > >>>>>>>>> now. Needs good justification why it has to be 2.1 instead
> of
> > > > >>>>>>>>>
> > > > >>>>>>>> 2.0.
> > > > >>>>
> > > > >>>>>
> > > > >>>>>>>>> -- Appy
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <
> > elserj@apache.org>
> > > > >>>>>>>>>
> > > > >>>>>>>> wrote:
> > > > >>>>>>
> > > > >>>>>>>
> > > > >>>>>>>>> -0 From a general project planning point-of-view (not based
> > on
> > > > >>>>>>>>>> the technical merit of the code) I am uncomfortable about
> > > > >>>>>>>>>> pulling in a
> > > > >>>>>>>>>>
> > > > >>>>>>>>> brand
> > > > >>>>>>>>
> > > > >>>>>>>>> new feature after we've already made one beta RC.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> Duo -- can you expand on why this feature is so important
> > that
> > > > >>>>>>>>>> we
> > > > >>>>>>>>>>
> > > > >>>>>>>>> should
> > > > >>>>>>>>
> > > > >>>>>>>>> break our release plan? Are there problems that would make
> > > > >>>>>>>>>> including
> > > > >>>>>>>>>>
> > > > >>>>>>>>> this
> > > > >>>>>>>>
> > > > >>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> > > > >>>>>>>>>>
> > > > >>>>>>>>> can
> > > > >>>>
> > > > >>>>> provide
> > > > >>>>>>>
> > > > >>>>>>>> on
> > > > >>>>>>>>>
> > > > >>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
> > > > >>>>>>>>>>
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > >>>>>>>>>>
> > > > >>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> We aim to move the peer modification framework from zk
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> watcher
> > > > >>>>
> > > > >>>>> to procedure
> > > > >>>>>>>>>>> v2 in this issue and the work is done now.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Copy the release note here:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Introduce 5 procedures to do peer modifications:
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> AddPeerProcedure
> > > > >>>>>>>>>>>> RemovePeerProcedure
> > > > >>>>>>>>>>>> UpdatePeerConfigProcedure
> > > > >>>>>>>>>>>> EnablePeerProcedure
> > > > >>>>>>>>>>>> DisablePeerProcedure
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> The procedures are all executed with the following
> stage:
> > > > >>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give
> > up
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> 2.
> > > > >>>>
> > > > >>>>> Check whether the operation is valid, if not then give up 3.
> > > > >>>>>>>>>>>> Update peer storage. Notice that if we have entered this
> > > > >>>>>>>>>>>> stage,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> then
> > > > >>>>>>>>
> > > > >>>>>>>>> we
> > > > >>>>>>>>>>>> can not rollback any more.
> > > > >>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> every
> > > > >>>>
> > > > >>>>> RS.
> > > > >>>>>>
> > > > >>>>>>> 5. Do post cleanup if any.
> > > > >>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be
> ignored
> > > > >>>>>>>>>>>> since we
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> have
> > > > >>>>>>>>>
> > > > >>>>>>>>>> already done the work.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> The procedure will hold an exclusive lock on the peer
> id,
> > so
> > > > >>>>>>>>>>>> now
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> there
> > > > >>>>>>>>
> > > > >>>>>>>>> is
> > > > >>>>>>>>>
> > > > >>>>>>>>>> no concurrent modifications on a single peer.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> And now it is guaranteed that once the procedure is
> done,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> the
> > > > >>>>
> > > > >>>>> peer modification has already taken effect on all RSes.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
> > > > >>>>>>>>>>>> manangement,
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> and
> > > > >>>>>>>
> > > > >>>>>>>> refactored the upper layer to remove zk related
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> naming/code/comment.
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
> > > > >>>>>>>>>>>> RegionServerObserver, and
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> add
> > > > >>>>>>>>
> > > > >>>>>>>>> permission check for executeProcedures method which
> requires
> > > > >>>>>>>>>>>> the
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> caller
> > > > >>>>>>>>
> > > > >>>>>>>>> to
> > > > >>>>>>>>>>>> be system user or super user.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> modifications
> > > > >>>>>>>
> > > > >>>>>>>> during the rolling upgrading. There is no pb/layout changes
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>> on
> > > > >>>>
> > > > >>>>> the peer/queue storage on zk.
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>>
> > > > >>>>>>>>>>>> And there are other benefits.
> > > > >>>>>>>>>>> First, we have introduced a general procedure framework
> to
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> send
> > > > >>>>
> > > > >>>>> tasks
> > > > >>>>>>>
> > > > >>>>>>>> to
> > > > >>>>>>>>
> > > > >>>>>>>>> RS
> > > > >>>>>>>>>>> and report the report back to Master. It can be used to
> > > > >>>>>>>>>>> implement
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> other
> > > > >>>>>>>>
> > > > >>>>>>>>> operations such as ACL change.
> > > > >>>>>>>>>>> Second, zk is used as a external storage now since we do
> > not
> > > > >>>>>>>>>>> depend
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> on
> > > > >>>>>>>
> > > > >>>>>>>> zk
> > > > >>>>>>>>>
> > > > >>>>>>>>>> watcher any more, it will be much easier to implement a
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>> 'table
> > > > >>>>
> > > > >>>>> based'
> > > > >>>>>>>
> > > > >>>>>>>> replication peer/queue storage.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Please vote:
> > > > >>>>>>>>>>> [+1] Agree
> > > > >>>>>>>>>>> [-1] Disagree
> > > > >>>>>>>>>>> [0] Neutral
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>> Thanks.
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>> --
> > > > >>>>>>>>>
> > > > >>>>>>>>> -- Appy
> > > > >>>>>>>>>
> > > > >>>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > -- Appy
> > > >
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

I would be interested to be the RM for 2.1 :)

Stack <st...@duboce.net>于2018年3月14日 周三02:48写道：

> I took a look. Its great. Our replication is in bad need of loving. The
> patch does a refactor/revamp/fix of an internal facility putting our peer
> management up on a Pv2 basis. It also moves forward the long-rumored
> distributed procedure project.
>
> Its a big change though. Would have been great to have it make 2.0.0 but it
> didn't. I'd be up for it in a minor release rather than wait on 3.0.0 given
> we allow ourselves some leeway adding facility on minors and that it is at
> core a solidifying fix. Needs to be doc'd and tested, verified on a deploy
> beyond unit test. I could help test. Is it proven compatible with existing
> replication deploys?
>
> Who's the 3.0.0 RM? When is it going to roll? Having this in place is best
> argument if folks propose back-ports. Without, we are doomed to repeat the
> 2.0.0 experience.
>
> Thanks,
> St.Ack
>
>
>
> On Tue, Mar 13, 2018 at 3:44 AM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > I've already merged it to branch-2...
> >
> > And for the procedure based replication peer modification, the feature
> has
> > been finished, at least no critical TODOs. I will not say it has no bugs
> > but I do not think it will block the 2.1 or 3.0 release too much.
> >
> > And please trust my judgement, I'm not a man who only want to show off.
> For
> > example I just reverted the serial replication feature from branch-2
> before
> > we release beta2 and tried to redo it on master because we found some
> > critical problems. And since the basic idea has not been changed so we
> > decided to do it on master because it will not spend too much time to
> > finish. And for HBASE-19064 it is a big feature so we decided to do it
> on a
> > feature branch. We will select the branches which we want to merge back
> at
> > the time we finish the feature.
> >
> > For the CP cleanup, I think the problem is we started too late, and in
> the
> > past we made mistakes and let the related projects inject too much into
> the
> > internal of HBase. And for B&R, it is something like the serial
> replication
> > feature. For serial replication feature is that we tested it and found
> some
> > critical problems, and for B&R it was not well tested(IIRC) so we were
> > afraid of there will be critical problems. And for the AMv2, I think the
> > problem is premature optimization. We even implemented our own AvlTree,
> and
> > also a very complicated scheduler which always makes us dead lock...
> >
> > These are all very good cases for software engineering in the real
> world...
> > Should be avoided in the future... That's also my duty...
> >
> > Thanks.
> >
> > 2018-03-13 17:50 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> >
> > > I thought the problem with releasing 2.0 was that there were too many
> > open
> > > features dragging along not coming to finish- AMv2  (biggest one), CP
> > > cleanup, B&R, etc.
> > > If it was not the case, it wouldn't have been 1 year between first
> alpha
> > > and GA, rather just few months.
> > >
> > > That said, i don't mind new but hardened features which are already
> ready
> > > to ship (implementation only or not) going in minor version, but that's
> > my
> > > personal opinion. But going too aggressive on that can indeed lead to
> the
> > > trap Josh mentioned above.
> > >
> > > For this particular change, my +1 was based on following aspects:
> > > -  it's internal
> > > - moving ops procv2 framework (gives failure recovery, locking, etc)
> > > - Duo's reasoning - "....the synchronous guarantee is really a good
> thing
> > > for our replication related UTs..." and "For the replication peer
> > tracking,
> > > it is the same problem. It is hard to do fencing with zk watcher since
> it
> > > is asynchronous, so the UTs are always kind of flakey in theoretical."
> > >
> > > That said, it's pending review.
> > > @Duo: As you know it's not possible to spend cycles on it right now -
> > > pending 2.0GA - can you please hold it off for few weeks (ideally,
> until
> > GA
> > > + 2-3 weeks) which will give community (whoever interested, at least
> > > me..smile) a decent change to review it.
> > > Thanks
> > >
> > > -- Appy
> > >
> > >
> > > On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org> wrote:
> > >
> > > > (Sorry for something other than just a vote)
> > > >
> > > > I worry seeing a "big" feature branch merge as us falling back into
> the
> > > > 1.x trap. Starting to backport features into 2.x will keep us
> delaying
> > > 3.0
> > > > as we have less and less incentive to push to release 3.0 in a timely
> > > > manner.
> > > >
> > > > That said, I also don't want my worries to bar a feature which
> appears
> > to
> > > > be related to implementation only (based on one high-level read of
> the
> > > > changes). Perhaps we need to re-think what is allowable for a Y
> release
> > > in
> > > > x.y.z...
> > > >
> > > > +1 for master (which already happened, maybe?)
> > > > +0 for branch-2 (simply because I haven't looked closely enough at
> > > > changes, can read through and try to change to +1 if you need the
> > votes)
> > > >
> > > >
> > > > On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
> > > >
> > > >> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT,
> will
> > > >> merge branch HBASE-19397-branch-2 back to branch-2.
> > > >>
> > > >> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
> > > >>
> > > >> If branch-2.0 will be out soon then let's target this to 2.1. No
> > > problem.
> > > >>>
> > > >>> Thanks.
> > > >>>
> > > >>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> > > >>>
> > > >>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <
> palomino219@gmail.com
> > >
> > > >>>> wrote:
> > > >>>>
> > > >>>> OK, let me merge it master first. And then create a
> > > HBASE-19397-branch-2
> > > >>>>> which will keep rebasing with the newest branch-2 to see if it is
> > > >>>>> stable
> > > >>>>> enough. Since we can define this as a bug fix/refactoring rather
> > > than a
> > > >>>>>
> > > >>>> big
> > > >>>>
> > > >>>>> new feature, it is OK to integrate it at any time. If we think it
> > is
> > > >>>>>
> > > >>>> stable
> > > >>>>
> > > >>>>> enough before cutting branch-2.0 then we can include it in the
> > 2.0.0
> > > >>>>> release, else let's include it in 2.1(Maybe we can backport it to
> > 2.0
> > > >>>>> later?).
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>> I need to cut the Appy-suggested branch-2.0. Shout if
> > > >>>> HBASE-19397-branch-2
> > > >>>> gets to be too much work and I'll do it sooner rather than later.
> > Or,
> > > if
> > > >>>> easier on you, just say and I'll make the branch-2.0 now so you
> can
> > > just
> > > >>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> > > >>>> become
> > > >>>> hbase2.1...).
> > > >>>>
> > > >>>> St.Ack
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Thanks all here.
> > > >>>>>
> > > >>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <
> ashish.singhi@huawei.com
> > >:
> > > >>>>>
> > > >>>>> +1 to merge on master and 2.1.
> > > >>>>>> Great work.
> > > >>>>>>
> > > >>>>>> Thanks,
> > > >>>>>> Ashish
> > > >>>>>>
> > > >>>>>> -----Original Message-----
> > > >>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> > > >>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
> > > >>>>>> To: dev@hbase.apache.org
> > > >>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> > > >>>>>>
> > > >>>>> branch-2.
> > > >>>>
> > > >>>>>
> > > >>>>>> Anyway, if no objections on merging this into master, let's do
> it?
> > > So
> > > >>>>>>
> > > >>>>> that
> > > >>>>>
> > > >>>>>> we can start working on the follow-on features, such as table
> > based
> > > >>>>>> replication storage, and synchronous replication, etc.
> > > >>>>>>
> > > >>>>>> Thanks.
> > > >>>>>>
> > > >>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > > >>>>>>
> > > >>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> > > >>>>>>>
> > > >>>>>> palomino219@gmail.com>
> > > >>>>
> > > >>>>> wrote:
> > > >>>>>>>
> > > >>>>>>> This 'new' feature only changes DDL part, not the core part of
> > > >>>>>>>>
> > > >>>>>>> replication,
> > > >>>>>>>
> > > >>>>>>>> i.e, how to read wal entries and how to replicate it to the
> > remote
> > > >>>>>>>>
> > > >>>>>>> cluster,
> > > >>>>>>>
> > > >>>>>>>> etc. And also there is no pb message/storage layout change,
> you
> > > >>>>>>>>
> > > >>>>>>> can
> > > >>>>
> > > >>>>> think of this as a big refactoring. Theoretically we even do not
> > > >>>>>>>> need to add
> > > >>>>>>>>
> > > >>>>>>> new
> > > >>>>>>>
> > > >>>>>>>> UTs for this feature, i.e, no extra stability works. The only
> > > >>>>>>>> visible change to users is that it may require more time on
> > > >>>>>>>> modifying peers in shell. So in general I think it is less
> hurt
> > to
> > > >>>>>>>> include it in the coming release?
> > > >>>>>>>>
> > > >>>>>>>> And why I think it SHOULD be included in our 2.0 release is
> > that,
> > > >>>>>>>> the synchronous guarantee is really a good thing for our
> > > >>>>>>>>
> > > >>>>>>> replication
> > > >>>>
> > > >>>>> related UTs. The correctness of the current Test***Replication
> > > >>>>>>>> usually depends
> > > >>>>>>>>
> > > >>>>>>> on a
> > > >>>>>>>
> > > >>>>>>>> flakey condition - we will not do a log rolling between the
> > > >>>>>>>> modification
> > > >>>>>>>>
> > > >>>>>>> on
> > > >>>>>>>
> > > >>>>>>>> zk and the actual loading of the modification on RS. And we
> have
> > > >>>>>>>> also
> > > >>>>>>>>
> > > >>>>>>> done
> > > >>>>>>>
> > > >>>>>>>> a hard work to cleanup the lockings in
> ReplicationSourceManager
> > > >>>>>>>>
> > > >>>>>>> and
> > > >>>>
> > > >>>>> add a fat comment to say why it should be synchronized in this
> > > >>>>>>>>
> > > >>>>>>> way.
> > > >>>>
> > > >>>>> In general, the new code is much easier to read, test and debug,
> > > >>>>>>>>
> > > >>>>>>> and
> > > >>>>
> > > >>>>> also reduce the possibility of flakeyness, which could help us a
> > > >>>>>>>>
> > > >>>>>>> lot
> > > >>>>
> > > >>>>> when we start to stabilize our build.
> > > >>>>>>>>
> > > >>>>>>>> Thanks.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> You see it as a big bug fix Duo?
> > > >>>>>>>
> > > >>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it
> > more
> > > >>>>>> stable, but we know that we can not fix all the problems since
> we
> > > >>>>>>
> > > >>>>> store
> > > >>>>
> > > >>>>> state in several places and it is a 'mission impossible' to make
> > all
> > > >>>>>>
> > > >>>>> the
> > > >>>>
> > > >>>>> states stay in sync under every situation... So we introduce AM
> v2.
> > > >>>>>> For the replication peer tracking, it is the same problem. It is
> > > hard
> > > >>>>>>
> > > >>>>> to
> > > >>>>
> > > >>>>> do fencing with zk watcher since it is asynchronous, so the UTs
> are
> > > >>>>>>
> > > >>>>> always
> > > >>>>>
> > > >>>>>> kind of flakey in theoretical. And we depend on replication
> > heavily
> > > at
> > > >>>>>> Xiaomi, it is always a pain for us.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> I'm late to review. Will have a look after beta-1 goes out.
> This
> > > >>>>>>>
> > > >>>>>> stuff
> > > >>>>
> > > >>>>> looks great from outside, especially distributed procedure
> > framework
> > > >>>>>>> which we need all over the place.
> > > >>>>>>>
> > > >>>>>>> In general I have no problem w/ this in master and an hbase 2.1
> > > (2.1
> > > >>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but
> > let
> > > >>>>>>>
> > > >>>>>> me
> > > >>>>
> > > >>>>> take a looksee sir.
> > > >>>>>>>
> > > >>>>>>> Thanks sir. All the concerns here about whether we should merge
> > > this
> > > >>>>>>
> > > >>>>> into
> > > >>>>
> > > >>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
> > > >>>>>>
> > > >>>>> because I
> > > >>>>
> > > >>>>> believe it will help a lot for stabilizing,  I'm OK with merge it
> > to
> > > >>>>>>
> > > >>>>> 2.1
> > > >>>>
> > > >>>>> only if you guys all think so.
> > > >>>>>>
> > > >>>>>>
> > > >>>>>>> St.Ack
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > > >>>>>>>>
> > > >>>>>>>> Same questions as Josh's.
> > > >>>>>>>>> 1) We have RCs for beta1 now, which means only commits that
> can
> > > >>>>>>>>>
> > > >>>>>>>> go
> > > >>>>
> > > >>>>> in
> > > >>>>>>>>>
> > > >>>>>>>> are
> > > >>>>>>>
> > > >>>>>>>> bug fixes only. This change - although important, needed from
> > > >>>>>>>>>
> > > >>>>>>>> long
> > > >>>>
> > > >>>>> time
> > > >>>>>>>>>
> > > >>>>>>>> and
> > > >>>>>>>>
> > > >>>>>>>>> well done (testing, summary, etc) - seems rather very large
> to
> > > >>>>>>>>>
> > > >>>>>>>> get
> > > >>>>
> > > >>>>> into
> > > >>>>>>>>>
> > > >>>>>>>> 2.0
> > > >>>>>>>>
> > > >>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
> > > >>>>>>>>>
> > > >>>>>>>> 2.0.
> > > >>>>
> > > >>>>>
> > > >>>>>>>>> -- Appy
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <
> elserj@apache.org>
> > > >>>>>>>>>
> > > >>>>>>>> wrote:
> > > >>>>>>
> > > >>>>>>>
> > > >>>>>>>>> -0 From a general project planning point-of-view (not based
> on
> > > >>>>>>>>>> the technical merit of the code) I am uncomfortable about
> > > >>>>>>>>>> pulling in a
> > > >>>>>>>>>>
> > > >>>>>>>>> brand
> > > >>>>>>>>
> > > >>>>>>>>> new feature after we've already made one beta RC.
> > > >>>>>>>>>>
> > > >>>>>>>>>> Duo -- can you expand on why this feature is so important
> that
> > > >>>>>>>>>> we
> > > >>>>>>>>>>
> > > >>>>>>>>> should
> > > >>>>>>>>
> > > >>>>>>>>> break our release plan? Are there problems that would make
> > > >>>>>>>>>> including
> > > >>>>>>>>>>
> > > >>>>>>>>> this
> > > >>>>>>>>
> > > >>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> > > >>>>>>>>>>
> > > >>>>>>>>> can
> > > >>>>
> > > >>>>> provide
> > > >>>>>>>
> > > >>>>>>>> on
> > > >>>>>>>>>
> > > >>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
> > > >>>>>>>>>>
> > > >>>>>>>>>>
> > > >>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> We aim to move the peer modification framework from zk
> > > >>>>>>>>>>>
> > > >>>>>>>>>> watcher
> > > >>>>
> > > >>>>> to procedure
> > > >>>>>>>>>>> v2 in this issue and the work is done now.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Copy the release note here:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Introduce 5 procedures to do peer modifications:
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> AddPeerProcedure
> > > >>>>>>>>>>>> RemovePeerProcedure
> > > >>>>>>>>>>>> UpdatePeerConfigProcedure
> > > >>>>>>>>>>>> EnablePeerProcedure
> > > >>>>>>>>>>>> DisablePeerProcedure
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The procedures are all executed with the following stage:
> > > >>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give
> up
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> 2.
> > > >>>>
> > > >>>>> Check whether the operation is valid, if not then give up 3.
> > > >>>>>>>>>>>> Update peer storage. Notice that if we have entered this
> > > >>>>>>>>>>>> stage,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> then
> > > >>>>>>>>
> > > >>>>>>>>> we
> > > >>>>>>>>>>>> can not rollback any more.
> > > >>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> every
> > > >>>>
> > > >>>>> RS.
> > > >>>>>>
> > > >>>>>>> 5. Do post cleanup if any.
> > > >>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
> > > >>>>>>>>>>>> since we
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> have
> > > >>>>>>>>>
> > > >>>>>>>>>> already done the work.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id,
> so
> > > >>>>>>>>>>>> now
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> there
> > > >>>>>>>>
> > > >>>>>>>>> is
> > > >>>>>>>>>
> > > >>>>>>>>>> no concurrent modifications on a single peer.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> the
> > > >>>>
> > > >>>>> peer modification has already taken effect on all RSes.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
> > > >>>>>>>>>>>> manangement,
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> and
> > > >>>>>>>
> > > >>>>>>>> refactored the upper layer to remove zk related
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> naming/code/comment.
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
> > > >>>>>>>>>>>> RegionServerObserver, and
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> add
> > > >>>>>>>>
> > > >>>>>>>>> permission check for executeProcedures method which requires
> > > >>>>>>>>>>>> the
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> caller
> > > >>>>>>>>
> > > >>>>>>>>> to
> > > >>>>>>>>>>>> be system user or super user.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> modifications
> > > >>>>>>>
> > > >>>>>>>> during the rolling upgrading. There is no pb/layout changes
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>> on
> > > >>>>
> > > >>>>> the peer/queue storage on zk.
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>>
> > > >>>>>>>>>>>> And there are other benefits.
> > > >>>>>>>>>>> First, we have introduced a general procedure framework to
> > > >>>>>>>>>>>
> > > >>>>>>>>>> send
> > > >>>>
> > > >>>>> tasks
> > > >>>>>>>
> > > >>>>>>>> to
> > > >>>>>>>>
> > > >>>>>>>>> RS
> > > >>>>>>>>>>> and report the report back to Master. It can be used to
> > > >>>>>>>>>>> implement
> > > >>>>>>>>>>>
> > > >>>>>>>>>> other
> > > >>>>>>>>
> > > >>>>>>>>> operations such as ACL change.
> > > >>>>>>>>>>> Second, zk is used as a external storage now since we do
> not
> > > >>>>>>>>>>> depend
> > > >>>>>>>>>>>
> > > >>>>>>>>>> on
> > > >>>>>>>
> > > >>>>>>>> zk
> > > >>>>>>>>>
> > > >>>>>>>>>> watcher any more, it will be much easier to implement a
> > > >>>>>>>>>>>
> > > >>>>>>>>>> 'table
> > > >>>>
> > > >>>>> based'
> > > >>>>>>>
> > > >>>>>>>> replication peer/queue storage.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Please vote:
> > > >>>>>>>>>>> [+1] Agree
> > > >>>>>>>>>>> [-1] Disagree
> > > >>>>>>>>>>> [0] Neutral
> > > >>>>>>>>>>>
> > > >>>>>>>>>>> Thanks.
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>>
> > > >>>>>>>>> -- Appy
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> > > --
> > >
> > > -- Appy
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

I took a look. Its great. Our replication is in bad need of loving. The
patch does a refactor/revamp/fix of an internal facility putting our peer
management up on a Pv2 basis. It also moves forward the long-rumored
distributed procedure project.

Its a big change though. Would have been great to have it make 2.0.0 but it
didn't. I'd be up for it in a minor release rather than wait on 3.0.0 given
we allow ourselves some leeway adding facility on minors and that it is at
core a solidifying fix. Needs to be doc'd and tested, verified on a deploy
beyond unit test. I could help test. Is it proven compatible with existing
replication deploys?

Who's the 3.0.0 RM? When is it going to roll? Having this in place is best
argument if folks propose back-ports. Without, we are doomed to repeat the
2.0.0 experience.

Thanks,
St.Ack



On Tue, Mar 13, 2018 at 3:44 AM, 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> I've already merged it to branch-2...
>
> And for the procedure based replication peer modification, the feature has
> been finished, at least no critical TODOs. I will not say it has no bugs
> but I do not think it will block the 2.1 or 3.0 release too much.
>
> And please trust my judgement, I'm not a man who only want to show off. For
> example I just reverted the serial replication feature from branch-2 before
> we release beta2 and tried to redo it on master because we found some
> critical problems. And since the basic idea has not been changed so we
> decided to do it on master because it will not spend too much time to
> finish. And for HBASE-19064 it is a big feature so we decided to do it on a
> feature branch. We will select the branches which we want to merge back at
> the time we finish the feature.
>
> For the CP cleanup, I think the problem is we started too late, and in the
> past we made mistakes and let the related projects inject too much into the
> internal of HBase. And for B&R, it is something like the serial replication
> feature. For serial replication feature is that we tested it and found some
> critical problems, and for B&R it was not well tested(IIRC) so we were
> afraid of there will be critical problems. And for the AMv2, I think the
> problem is premature optimization. We even implemented our own AvlTree, and
> also a very complicated scheduler which always makes us dead lock...
>
> These are all very good cases for software engineering in the real world...
> Should be avoided in the future... That's also my duty...
>
> Thanks.
>
> 2018-03-13 17:50 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>
> > I thought the problem with releasing 2.0 was that there were too many
> open
> > features dragging along not coming to finish- AMv2  (biggest one), CP
> > cleanup, B&R, etc.
> > If it was not the case, it wouldn't have been 1 year between first alpha
> > and GA, rather just few months.
> >
> > That said, i don't mind new but hardened features which are already ready
> > to ship (implementation only or not) going in minor version, but that's
> my
> > personal opinion. But going too aggressive on that can indeed lead to the
> > trap Josh mentioned above.
> >
> > For this particular change, my +1 was based on following aspects:
> > -  it's internal
> > - moving ops procv2 framework (gives failure recovery, locking, etc)
> > - Duo's reasoning - "....the synchronous guarantee is really a good thing
> > for our replication related UTs..." and "For the replication peer
> tracking,
> > it is the same problem. It is hard to do fencing with zk watcher since it
> > is asynchronous, so the UTs are always kind of flakey in theoretical."
> >
> > That said, it's pending review.
> > @Duo: As you know it's not possible to spend cycles on it right now -
> > pending 2.0GA - can you please hold it off for few weeks (ideally, until
> GA
> > + 2-3 weeks) which will give community (whoever interested, at least
> > me..smile) a decent change to review it.
> > Thanks
> >
> > -- Appy
> >
> >
> > On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org> wrote:
> >
> > > (Sorry for something other than just a vote)
> > >
> > > I worry seeing a "big" feature branch merge as us falling back into the
> > > 1.x trap. Starting to backport features into 2.x will keep us delaying
> > 3.0
> > > as we have less and less incentive to push to release 3.0 in a timely
> > > manner.
> > >
> > > That said, I also don't want my worries to bar a feature which appears
> to
> > > be related to implementation only (based on one high-level read of the
> > > changes). Perhaps we need to re-think what is allowable for a Y release
> > in
> > > x.y.z...
> > >
> > > +1 for master (which already happened, maybe?)
> > > +0 for branch-2 (simply because I haven't looked closely enough at
> > > changes, can read through and try to change to +1 if you need the
> votes)
> > >
> > >
> > > On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
> > >
> > >> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
> > >> merge branch HBASE-19397-branch-2 back to branch-2.
> > >>
> > >> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
> > >>
> > >> If branch-2.0 will be out soon then let's target this to 2.1. No
> > problem.
> > >>>
> > >>> Thanks.
> > >>>
> > >>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> > >>>
> > >>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <palomino219@gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>> OK, let me merge it master first. And then create a
> > HBASE-19397-branch-2
> > >>>>> which will keep rebasing with the newest branch-2 to see if it is
> > >>>>> stable
> > >>>>> enough. Since we can define this as a bug fix/refactoring rather
> > than a
> > >>>>>
> > >>>> big
> > >>>>
> > >>>>> new feature, it is OK to integrate it at any time. If we think it
> is
> > >>>>>
> > >>>> stable
> > >>>>
> > >>>>> enough before cutting branch-2.0 then we can include it in the
> 2.0.0
> > >>>>> release, else let's include it in 2.1(Maybe we can backport it to
> 2.0
> > >>>>> later?).
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>> I need to cut the Appy-suggested branch-2.0. Shout if
> > >>>> HBASE-19397-branch-2
> > >>>> gets to be too much work and I'll do it sooner rather than later.
> Or,
> > if
> > >>>> easier on you, just say and I'll make the branch-2.0 now so you can
> > just
> > >>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> > >>>> become
> > >>>> hbase2.1...).
> > >>>>
> > >>>> St.Ack
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> Thanks all here.
> > >>>>>
> > >>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <ashish.singhi@huawei.com
> >:
> > >>>>>
> > >>>>> +1 to merge on master and 2.1.
> > >>>>>> Great work.
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Ashish
> > >>>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> > >>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
> > >>>>>> To: dev@hbase.apache.org
> > >>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> > >>>>>>
> > >>>>> branch-2.
> > >>>>
> > >>>>>
> > >>>>>> Anyway, if no objections on merging this into master, let's do it?
> > So
> > >>>>>>
> > >>>>> that
> > >>>>>
> > >>>>>> we can start working on the follow-on features, such as table
> based
> > >>>>>> replication storage, and synchronous replication, etc.
> > >>>>>>
> > >>>>>> Thanks.
> > >>>>>>
> > >>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > >>>>>>
> > >>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> > >>>>>>>
> > >>>>>> palomino219@gmail.com>
> > >>>>
> > >>>>> wrote:
> > >>>>>>>
> > >>>>>>> This 'new' feature only changes DDL part, not the core part of
> > >>>>>>>>
> > >>>>>>> replication,
> > >>>>>>>
> > >>>>>>>> i.e, how to read wal entries and how to replicate it to the
> remote
> > >>>>>>>>
> > >>>>>>> cluster,
> > >>>>>>>
> > >>>>>>>> etc. And also there is no pb message/storage layout change, you
> > >>>>>>>>
> > >>>>>>> can
> > >>>>
> > >>>>> think of this as a big refactoring. Theoretically we even do not
> > >>>>>>>> need to add
> > >>>>>>>>
> > >>>>>>> new
> > >>>>>>>
> > >>>>>>>> UTs for this feature, i.e, no extra stability works. The only
> > >>>>>>>> visible change to users is that it may require more time on
> > >>>>>>>> modifying peers in shell. So in general I think it is less hurt
> to
> > >>>>>>>> include it in the coming release?
> > >>>>>>>>
> > >>>>>>>> And why I think it SHOULD be included in our 2.0 release is
> that,
> > >>>>>>>> the synchronous guarantee is really a good thing for our
> > >>>>>>>>
> > >>>>>>> replication
> > >>>>
> > >>>>> related UTs. The correctness of the current Test***Replication
> > >>>>>>>> usually depends
> > >>>>>>>>
> > >>>>>>> on a
> > >>>>>>>
> > >>>>>>>> flakey condition - we will not do a log rolling between the
> > >>>>>>>> modification
> > >>>>>>>>
> > >>>>>>> on
> > >>>>>>>
> > >>>>>>>> zk and the actual loading of the modification on RS. And we have
> > >>>>>>>> also
> > >>>>>>>>
> > >>>>>>> done
> > >>>>>>>
> > >>>>>>>> a hard work to cleanup the lockings in ReplicationSourceManager
> > >>>>>>>>
> > >>>>>>> and
> > >>>>
> > >>>>> add a fat comment to say why it should be synchronized in this
> > >>>>>>>>
> > >>>>>>> way.
> > >>>>
> > >>>>> In general, the new code is much easier to read, test and debug,
> > >>>>>>>>
> > >>>>>>> and
> > >>>>
> > >>>>> also reduce the possibility of flakeyness, which could help us a
> > >>>>>>>>
> > >>>>>>> lot
> > >>>>
> > >>>>> when we start to stabilize our build.
> > >>>>>>>>
> > >>>>>>>> Thanks.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> You see it as a big bug fix Duo?
> > >>>>>>>
> > >>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it
> more
> > >>>>>> stable, but we know that we can not fix all the problems since we
> > >>>>>>
> > >>>>> store
> > >>>>
> > >>>>> state in several places and it is a 'mission impossible' to make
> all
> > >>>>>>
> > >>>>> the
> > >>>>
> > >>>>> states stay in sync under every situation... So we introduce AM v2.
> > >>>>>> For the replication peer tracking, it is the same problem. It is
> > hard
> > >>>>>>
> > >>>>> to
> > >>>>
> > >>>>> do fencing with zk watcher since it is asynchronous, so the UTs are
> > >>>>>>
> > >>>>> always
> > >>>>>
> > >>>>>> kind of flakey in theoretical. And we depend on replication
> heavily
> > at
> > >>>>>> Xiaomi, it is always a pain for us.
> > >>>>>>
> > >>>>>>
> > >>>>>>> I'm late to review. Will have a look after beta-1 goes out. This
> > >>>>>>>
> > >>>>>> stuff
> > >>>>
> > >>>>> looks great from outside, especially distributed procedure
> framework
> > >>>>>>> which we need all over the place.
> > >>>>>>>
> > >>>>>>> In general I have no problem w/ this in master and an hbase 2.1
> > (2.1
> > >>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but
> let
> > >>>>>>>
> > >>>>>> me
> > >>>>
> > >>>>> take a looksee sir.
> > >>>>>>>
> > >>>>>>> Thanks sir. All the concerns here about whether we should merge
> > this
> > >>>>>>
> > >>>>> into
> > >>>>
> > >>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
> > >>>>>>
> > >>>>> because I
> > >>>>
> > >>>>> believe it will help a lot for stabilizing,  I'm OK with merge it
> to
> > >>>>>>
> > >>>>> 2.1
> > >>>>
> > >>>>> only if you guys all think so.
> > >>>>>>
> > >>>>>>
> > >>>>>>> St.Ack
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > >>>>>>>>
> > >>>>>>>> Same questions as Josh's.
> > >>>>>>>>> 1) We have RCs for beta1 now, which means only commits that can
> > >>>>>>>>>
> > >>>>>>>> go
> > >>>>
> > >>>>> in
> > >>>>>>>>>
> > >>>>>>>> are
> > >>>>>>>
> > >>>>>>>> bug fixes only. This change - although important, needed from
> > >>>>>>>>>
> > >>>>>>>> long
> > >>>>
> > >>>>> time
> > >>>>>>>>>
> > >>>>>>>> and
> > >>>>>>>>
> > >>>>>>>>> well done (testing, summary, etc) - seems rather very large to
> > >>>>>>>>>
> > >>>>>>>> get
> > >>>>
> > >>>>> into
> > >>>>>>>>>
> > >>>>>>>> 2.0
> > >>>>>>>>
> > >>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
> > >>>>>>>>>
> > >>>>>>>> 2.0.
> > >>>>
> > >>>>>
> > >>>>>>>>> -- Appy
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> > >>>>>>>>>
> > >>>>>>>> wrote:
> > >>>>>>
> > >>>>>>>
> > >>>>>>>>> -0 From a general project planning point-of-view (not based on
> > >>>>>>>>>> the technical merit of the code) I am uncomfortable about
> > >>>>>>>>>> pulling in a
> > >>>>>>>>>>
> > >>>>>>>>> brand
> > >>>>>>>>
> > >>>>>>>>> new feature after we've already made one beta RC.
> > >>>>>>>>>>
> > >>>>>>>>>> Duo -- can you expand on why this feature is so important that
> > >>>>>>>>>> we
> > >>>>>>>>>>
> > >>>>>>>>> should
> > >>>>>>>>
> > >>>>>>>>> break our release plan? Are there problems that would make
> > >>>>>>>>>> including
> > >>>>>>>>>>
> > >>>>>>>>> this
> > >>>>>>>>
> > >>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> > >>>>>>>>>>
> > >>>>>>>>> can
> > >>>>
> > >>>>> provide
> > >>>>>>>
> > >>>>>>>> on
> > >>>>>>>>>
> > >>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
> > >>>>>>>>>>>
> > >>>>>>>>>>> We aim to move the peer modification framework from zk
> > >>>>>>>>>>>
> > >>>>>>>>>> watcher
> > >>>>
> > >>>>> to procedure
> > >>>>>>>>>>> v2 in this issue and the work is done now.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Copy the release note here:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Introduce 5 procedures to do peer modifications:
> > >>>>>>>>>>>
> > >>>>>>>>>>> AddPeerProcedure
> > >>>>>>>>>>>> RemovePeerProcedure
> > >>>>>>>>>>>> UpdatePeerConfigProcedure
> > >>>>>>>>>>>> EnablePeerProcedure
> > >>>>>>>>>>>> DisablePeerProcedure
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The procedures are all executed with the following stage:
> > >>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
> > >>>>>>>>>>>>
> > >>>>>>>>>>> 2.
> > >>>>
> > >>>>> Check whether the operation is valid, if not then give up 3.
> > >>>>>>>>>>>> Update peer storage. Notice that if we have entered this
> > >>>>>>>>>>>> stage,
> > >>>>>>>>>>>>
> > >>>>>>>>>>> then
> > >>>>>>>>
> > >>>>>>>>> we
> > >>>>>>>>>>>> can not rollback any more.
> > >>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
> > >>>>>>>>>>>>
> > >>>>>>>>>>> every
> > >>>>
> > >>>>> RS.
> > >>>>>>
> > >>>>>>> 5. Do post cleanup if any.
> > >>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
> > >>>>>>>>>>>> since we
> > >>>>>>>>>>>>
> > >>>>>>>>>>> have
> > >>>>>>>>>
> > >>>>>>>>>> already done the work.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id, so
> > >>>>>>>>>>>> now
> > >>>>>>>>>>>>
> > >>>>>>>>>>> there
> > >>>>>>>>
> > >>>>>>>>> is
> > >>>>>>>>>
> > >>>>>>>>>> no concurrent modifications on a single peer.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
> > >>>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>
> > >>>>> peer modification has already taken effect on all RSes.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
> > >>>>>>>>>>>> manangement,
> > >>>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>
> > >>>>>>>> refactored the upper layer to remove zk related
> > >>>>>>>>>>>>
> > >>>>>>>>>>> naming/code/comment.
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
> > >>>>>>>>>>>> RegionServerObserver, and
> > >>>>>>>>>>>>
> > >>>>>>>>>>> add
> > >>>>>>>>
> > >>>>>>>>> permission check for executeProcedures method which requires
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>
> > >>>>>>>>>>> caller
> > >>>>>>>>
> > >>>>>>>>> to
> > >>>>>>>>>>>> be system user or super user.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
> > >>>>>>>>>>>>
> > >>>>>>>>>>> modifications
> > >>>>>>>
> > >>>>>>>> during the rolling upgrading. There is no pb/layout changes
> > >>>>>>>>>>>>
> > >>>>>>>>>>> on
> > >>>>
> > >>>>> the peer/queue storage on zk.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And there are other benefits.
> > >>>>>>>>>>> First, we have introduced a general procedure framework to
> > >>>>>>>>>>>
> > >>>>>>>>>> send
> > >>>>
> > >>>>> tasks
> > >>>>>>>
> > >>>>>>>> to
> > >>>>>>>>
> > >>>>>>>>> RS
> > >>>>>>>>>>> and report the report back to Master. It can be used to
> > >>>>>>>>>>> implement
> > >>>>>>>>>>>
> > >>>>>>>>>> other
> > >>>>>>>>
> > >>>>>>>>> operations such as ACL change.
> > >>>>>>>>>>> Second, zk is used as a external storage now since we do not
> > >>>>>>>>>>> depend
> > >>>>>>>>>>>
> > >>>>>>>>>> on
> > >>>>>>>
> > >>>>>>>> zk
> > >>>>>>>>>
> > >>>>>>>>>> watcher any more, it will be much easier to implement a
> > >>>>>>>>>>>
> > >>>>>>>>>> 'table
> > >>>>
> > >>>>> based'
> > >>>>>>>
> > >>>>>>>> replication peer/queue storage.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Please vote:
> > >>>>>>>>>>> [+1] Agree
> > >>>>>>>>>>> [-1] Disagree
> > >>>>>>>>>>> [0] Neutral
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks.
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>>
> > >>>>>>>>> -- Appy
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
> > --
> >
> > -- Appy
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

I've already merged it to branch-2...

And for the procedure based replication peer modification, the feature has
been finished, at least no critical TODOs. I will not say it has no bugs
but I do not think it will block the 2.1 or 3.0 release too much.

And please trust my judgement, I'm not a man who only want to show off. For
example I just reverted the serial replication feature from branch-2 before
we release beta2 and tried to redo it on master because we found some
critical problems. And since the basic idea has not been changed so we
decided to do it on master because it will not spend too much time to
finish. And for HBASE-19064 it is a big feature so we decided to do it on a
feature branch. We will select the branches which we want to merge back at
the time we finish the feature.

For the CP cleanup, I think the problem is we started too late, and in the
past we made mistakes and let the related projects inject too much into the
internal of HBase. And for B&R, it is something like the serial replication
feature. For serial replication feature is that we tested it and found some
critical problems, and for B&R it was not well tested(IIRC) so we were
afraid of there will be critical problems. And for the AMv2, I think the
problem is premature optimization. We even implemented our own AvlTree, and
also a very complicated scheduler which always makes us dead lock...

These are all very good cases for software engineering in the real world...
Should be avoided in the future... That's also my duty...

Thanks.

2018-03-13 17:50 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:

> I thought the problem with releasing 2.0 was that there were too many open
> features dragging along not coming to finish- AMv2  (biggest one), CP
> cleanup, B&R, etc.
> If it was not the case, it wouldn't have been 1 year between first alpha
> and GA, rather just few months.
>
> That said, i don't mind new but hardened features which are already ready
> to ship (implementation only or not) going in minor version, but that's my
> personal opinion. But going too aggressive on that can indeed lead to the
> trap Josh mentioned above.
>
> For this particular change, my +1 was based on following aspects:
> -  it's internal
> - moving ops procv2 framework (gives failure recovery, locking, etc)
> - Duo's reasoning - "....the synchronous guarantee is really a good thing
> for our replication related UTs..." and "For the replication peer tracking,
> it is the same problem. It is hard to do fencing with zk watcher since it
> is asynchronous, so the UTs are always kind of flakey in theoretical."
>
> That said, it's pending review.
> @Duo: As you know it's not possible to spend cycles on it right now -
> pending 2.0GA - can you please hold it off for few weeks (ideally, until GA
> + 2-3 weeks) which will give community (whoever interested, at least
> me..smile) a decent change to review it.
> Thanks
>
> -- Appy
>
>
> On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org> wrote:
>
> > (Sorry for something other than just a vote)
> >
> > I worry seeing a "big" feature branch merge as us falling back into the
> > 1.x trap. Starting to backport features into 2.x will keep us delaying
> 3.0
> > as we have less and less incentive to push to release 3.0 in a timely
> > manner.
> >
> > That said, I also don't want my worries to bar a feature which appears to
> > be related to implementation only (based on one high-level read of the
> > changes). Perhaps we need to re-think what is allowable for a Y release
> in
> > x.y.z...
> >
> > +1 for master (which already happened, maybe?)
> > +0 for branch-2 (simply because I haven't looked closely enough at
> > changes, can read through and try to change to +1 if you need the votes)
> >
> >
> > On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
> >
> >> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
> >> merge branch HBASE-19397-branch-2 back to branch-2.
> >>
> >> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
> >>
> >> If branch-2.0 will be out soon then let's target this to 2.1. No
> problem.
> >>>
> >>> Thanks.
> >>>
> >>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> >>>
> >>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> >>>> wrote:
> >>>>
> >>>> OK, let me merge it master first. And then create a
> HBASE-19397-branch-2
> >>>>> which will keep rebasing with the newest branch-2 to see if it is
> >>>>> stable
> >>>>> enough. Since we can define this as a bug fix/refactoring rather
> than a
> >>>>>
> >>>> big
> >>>>
> >>>>> new feature, it is OK to integrate it at any time. If we think it is
> >>>>>
> >>>> stable
> >>>>
> >>>>> enough before cutting branch-2.0 then we can include it in the 2.0.0
> >>>>> release, else let's include it in 2.1(Maybe we can backport it to 2.0
> >>>>> later?).
> >>>>>
> >>>>>
> >>>>>
> >>>> I need to cut the Appy-suggested branch-2.0. Shout if
> >>>> HBASE-19397-branch-2
> >>>> gets to be too much work and I'll do it sooner rather than later. Or,
> if
> >>>> easier on you, just say and I'll make the branch-2.0 now so you can
> just
> >>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> >>>> become
> >>>> hbase2.1...).
> >>>>
> >>>> St.Ack
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Thanks all here.
> >>>>>
> >>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
> >>>>>
> >>>>> +1 to merge on master and 2.1.
> >>>>>> Great work.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ashish
> >>>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> >>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
> >>>>>> To: dev@hbase.apache.org
> >>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> >>>>>>
> >>>>> branch-2.
> >>>>
> >>>>>
> >>>>>> Anyway, if no objections on merging this into master, let's do it?
> So
> >>>>>>
> >>>>> that
> >>>>>
> >>>>>> we can start working on the follow-on features, such as table based
> >>>>>> replication storage, and synchronous replication, etc.
> >>>>>>
> >>>>>> Thanks.
> >>>>>>
> >>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> >>>>>>
> >>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> >>>>>>>
> >>>>>> palomino219@gmail.com>
> >>>>
> >>>>> wrote:
> >>>>>>>
> >>>>>>> This 'new' feature only changes DDL part, not the core part of
> >>>>>>>>
> >>>>>>> replication,
> >>>>>>>
> >>>>>>>> i.e, how to read wal entries and how to replicate it to the remote
> >>>>>>>>
> >>>>>>> cluster,
> >>>>>>>
> >>>>>>>> etc. And also there is no pb message/storage layout change, you
> >>>>>>>>
> >>>>>>> can
> >>>>
> >>>>> think of this as a big refactoring. Theoretically we even do not
> >>>>>>>> need to add
> >>>>>>>>
> >>>>>>> new
> >>>>>>>
> >>>>>>>> UTs for this feature, i.e, no extra stability works. The only
> >>>>>>>> visible change to users is that it may require more time on
> >>>>>>>> modifying peers in shell. So in general I think it is less hurt to
> >>>>>>>> include it in the coming release?
> >>>>>>>>
> >>>>>>>> And why I think it SHOULD be included in our 2.0 release is that,
> >>>>>>>> the synchronous guarantee is really a good thing for our
> >>>>>>>>
> >>>>>>> replication
> >>>>
> >>>>> related UTs. The correctness of the current Test***Replication
> >>>>>>>> usually depends
> >>>>>>>>
> >>>>>>> on a
> >>>>>>>
> >>>>>>>> flakey condition - we will not do a log rolling between the
> >>>>>>>> modification
> >>>>>>>>
> >>>>>>> on
> >>>>>>>
> >>>>>>>> zk and the actual loading of the modification on RS. And we have
> >>>>>>>> also
> >>>>>>>>
> >>>>>>> done
> >>>>>>>
> >>>>>>>> a hard work to cleanup the lockings in ReplicationSourceManager
> >>>>>>>>
> >>>>>>> and
> >>>>
> >>>>> add a fat comment to say why it should be synchronized in this
> >>>>>>>>
> >>>>>>> way.
> >>>>
> >>>>> In general, the new code is much easier to read, test and debug,
> >>>>>>>>
> >>>>>>> and
> >>>>
> >>>>> also reduce the possibility of flakeyness, which could help us a
> >>>>>>>>
> >>>>>>> lot
> >>>>
> >>>>> when we start to stabilize our build.
> >>>>>>>>
> >>>>>>>> Thanks.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> You see it as a big bug fix Duo?
> >>>>>>>
> >>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it more
> >>>>>> stable, but we know that we can not fix all the problems since we
> >>>>>>
> >>>>> store
> >>>>
> >>>>> state in several places and it is a 'mission impossible' to make all
> >>>>>>
> >>>>> the
> >>>>
> >>>>> states stay in sync under every situation... So we introduce AM v2.
> >>>>>> For the replication peer tracking, it is the same problem. It is
> hard
> >>>>>>
> >>>>> to
> >>>>
> >>>>> do fencing with zk watcher since it is asynchronous, so the UTs are
> >>>>>>
> >>>>> always
> >>>>>
> >>>>>> kind of flakey in theoretical. And we depend on replication heavily
> at
> >>>>>> Xiaomi, it is always a pain for us.
> >>>>>>
> >>>>>>
> >>>>>>> I'm late to review. Will have a look after beta-1 goes out. This
> >>>>>>>
> >>>>>> stuff
> >>>>
> >>>>> looks great from outside, especially distributed procedure framework
> >>>>>>> which we need all over the place.
> >>>>>>>
> >>>>>>> In general I have no problem w/ this in master and an hbase 2.1
> (2.1
> >>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but let
> >>>>>>>
> >>>>>> me
> >>>>
> >>>>> take a looksee sir.
> >>>>>>>
> >>>>>>> Thanks sir. All the concerns here about whether we should merge
> this
> >>>>>>
> >>>>> into
> >>>>
> >>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
> >>>>>>
> >>>>> because I
> >>>>
> >>>>> believe it will help a lot for stabilizing,  I'm OK with merge it to
> >>>>>>
> >>>>> 2.1
> >>>>
> >>>>> only if you guys all think so.
> >>>>>>
> >>>>>>
> >>>>>>> St.Ack
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> >>>>>>>>
> >>>>>>>> Same questions as Josh's.
> >>>>>>>>> 1) We have RCs for beta1 now, which means only commits that can
> >>>>>>>>>
> >>>>>>>> go
> >>>>
> >>>>> in
> >>>>>>>>>
> >>>>>>>> are
> >>>>>>>
> >>>>>>>> bug fixes only. This change - although important, needed from
> >>>>>>>>>
> >>>>>>>> long
> >>>>
> >>>>> time
> >>>>>>>>>
> >>>>>>>> and
> >>>>>>>>
> >>>>>>>>> well done (testing, summary, etc) - seems rather very large to
> >>>>>>>>>
> >>>>>>>> get
> >>>>
> >>>>> into
> >>>>>>>>>
> >>>>>>>> 2.0
> >>>>>>>>
> >>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
> >>>>>>>>>
> >>>>>>>> 2.0.
> >>>>
> >>>>>
> >>>>>>>>> -- Appy
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> >>>>>>>>>
> >>>>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>>> -0 From a general project planning point-of-view (not based on
> >>>>>>>>>> the technical merit of the code) I am uncomfortable about
> >>>>>>>>>> pulling in a
> >>>>>>>>>>
> >>>>>>>>> brand
> >>>>>>>>
> >>>>>>>>> new feature after we've already made one beta RC.
> >>>>>>>>>>
> >>>>>>>>>> Duo -- can you expand on why this feature is so important that
> >>>>>>>>>> we
> >>>>>>>>>>
> >>>>>>>>> should
> >>>>>>>>
> >>>>>>>>> break our release plan? Are there problems that would make
> >>>>>>>>>> including
> >>>>>>>>>>
> >>>>>>>>> this
> >>>>>>>>
> >>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> >>>>>>>>>>
> >>>>>>>>> can
> >>>>
> >>>>> provide
> >>>>>>>
> >>>>>>>> on
> >>>>>>>>>
> >>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
> >>>>>>>>>>
> >>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
> >>>>>>>>>>>
> >>>>>>>>>>> We aim to move the peer modification framework from zk
> >>>>>>>>>>>
> >>>>>>>>>> watcher
> >>>>
> >>>>> to procedure
> >>>>>>>>>>> v2 in this issue and the work is done now.
> >>>>>>>>>>>
> >>>>>>>>>>> Copy the release note here:
> >>>>>>>>>>>
> >>>>>>>>>>> Introduce 5 procedures to do peer modifications:
> >>>>>>>>>>>
> >>>>>>>>>>> AddPeerProcedure
> >>>>>>>>>>>> RemovePeerProcedure
> >>>>>>>>>>>> UpdatePeerConfigProcedure
> >>>>>>>>>>>> EnablePeerProcedure
> >>>>>>>>>>>> DisablePeerProcedure
> >>>>>>>>>>>>
> >>>>>>>>>>>> The procedures are all executed with the following stage:
> >>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
> >>>>>>>>>>>>
> >>>>>>>>>>> 2.
> >>>>
> >>>>> Check whether the operation is valid, if not then give up 3.
> >>>>>>>>>>>> Update peer storage. Notice that if we have entered this
> >>>>>>>>>>>> stage,
> >>>>>>>>>>>>
> >>>>>>>>>>> then
> >>>>>>>>
> >>>>>>>>> we
> >>>>>>>>>>>> can not rollback any more.
> >>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
> >>>>>>>>>>>>
> >>>>>>>>>>> every
> >>>>
> >>>>> RS.
> >>>>>>
> >>>>>>> 5. Do post cleanup if any.
> >>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
> >>>>>>>>>>>> since we
> >>>>>>>>>>>>
> >>>>>>>>>>> have
> >>>>>>>>>
> >>>>>>>>>> already done the work.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id, so
> >>>>>>>>>>>> now
> >>>>>>>>>>>>
> >>>>>>>>>>> there
> >>>>>>>>
> >>>>>>>>> is
> >>>>>>>>>
> >>>>>>>>>> no concurrent modifications on a single peer.
> >>>>>>>>>>>>
> >>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
> >>>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>
> >>>>> peer modification has already taken effect on all RSes.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
> >>>>>>>>>>>> manangement,
> >>>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>
> >>>>>>>> refactored the upper layer to remove zk related
> >>>>>>>>>>>>
> >>>>>>>>>>> naming/code/comment.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
> >>>>>>>>>>>> RegionServerObserver, and
> >>>>>>>>>>>>
> >>>>>>>>>>> add
> >>>>>>>>
> >>>>>>>>> permission check for executeProcedures method which requires
> >>>>>>>>>>>> the
> >>>>>>>>>>>>
> >>>>>>>>>>> caller
> >>>>>>>>
> >>>>>>>>> to
> >>>>>>>>>>>> be system user or super user.
> >>>>>>>>>>>>
> >>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
> >>>>>>>>>>>>
> >>>>>>>>>>> modifications
> >>>>>>>
> >>>>>>>> during the rolling upgrading. There is no pb/layout changes
> >>>>>>>>>>>>
> >>>>>>>>>>> on
> >>>>
> >>>>> the peer/queue storage on zk.
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> And there are other benefits.
> >>>>>>>>>>> First, we have introduced a general procedure framework to
> >>>>>>>>>>>
> >>>>>>>>>> send
> >>>>
> >>>>> tasks
> >>>>>>>
> >>>>>>>> to
> >>>>>>>>
> >>>>>>>>> RS
> >>>>>>>>>>> and report the report back to Master. It can be used to
> >>>>>>>>>>> implement
> >>>>>>>>>>>
> >>>>>>>>>> other
> >>>>>>>>
> >>>>>>>>> operations such as ACL change.
> >>>>>>>>>>> Second, zk is used as a external storage now since we do not
> >>>>>>>>>>> depend
> >>>>>>>>>>>
> >>>>>>>>>> on
> >>>>>>>
> >>>>>>>> zk
> >>>>>>>>>
> >>>>>>>>>> watcher any more, it will be much easier to implement a
> >>>>>>>>>>>
> >>>>>>>>>> 'table
> >>>>
> >>>>> based'
> >>>>>>>
> >>>>>>>> replication peer/queue storage.
> >>>>>>>>>>>
> >>>>>>>>>>> Please vote:
> >>>>>>>>>>> [+1] Agree
> >>>>>>>>>>> [-1] Disagree
> >>>>>>>>>>> [0] Neutral
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>>
> >>>>>>>>> -- Appy
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>
> --
>
> -- Appy
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Apekshit Sharma <ap...@cloudera.com>.

I thought the problem with releasing 2.0 was that there were too many open
features dragging along not coming to finish- AMv2  (biggest one), CP
cleanup, B&R, etc.
If it was not the case, it wouldn't have been 1 year between first alpha
and GA, rather just few months.

That said, i don't mind new but hardened features which are already ready
to ship (implementation only or not) going in minor version, but that's my
personal opinion. But going too aggressive on that can indeed lead to the
trap Josh mentioned above.

For this particular change, my +1 was based on following aspects:
-  it's internal
- moving ops procv2 framework (gives failure recovery, locking, etc)
- Duo's reasoning - "....the synchronous guarantee is really a good thing
for our replication related UTs..." and "For the replication peer tracking,
it is the same problem. It is hard to do fencing with zk watcher since it
is asynchronous, so the UTs are always kind of flakey in theoretical."

That said, it's pending review.
@Duo: As you know it's not possible to spend cycles on it right now -
pending 2.0GA - can you please hold it off for few weeks (ideally, until GA
+ 2-3 weeks) which will give community (whoever interested, at least
me..smile) a decent change to review it.
Thanks

-- Appy


On Tue, Mar 13, 2018 at 1:02 AM, Josh Elser <el...@apache.org> wrote:

> (Sorry for something other than just a vote)
>
> I worry seeing a "big" feature branch merge as us falling back into the
> 1.x trap. Starting to backport features into 2.x will keep us delaying 3.0
> as we have less and less incentive to push to release 3.0 in a timely
> manner.
>
> That said, I also don't want my worries to bar a feature which appears to
> be related to implementation only (based on one high-level read of the
> changes). Perhaps we need to re-think what is allowable for a Y release in
> x.y.z...
>
> +1 for master (which already happened, maybe?)
> +0 for branch-2 (simply because I haven't looked closely enough at
> changes, can read through and try to change to +1 if you need the votes)
>
>
> On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
>
>> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
>> merge branch HBASE-19397-branch-2 back to branch-2.
>>
>> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
>>
>> If branch-2.0 will be out soon then let's target this to 2.1. No problem.
>>>
>>> Thanks.
>>>
>>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
>>>
>>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
>>>> wrote:
>>>>
>>>> OK, let me merge it master first. And then create a HBASE-19397-branch-2
>>>>> which will keep rebasing with the newest branch-2 to see if it is
>>>>> stable
>>>>> enough. Since we can define this as a bug fix/refactoring rather than a
>>>>>
>>>> big
>>>>
>>>>> new feature, it is OK to integrate it at any time. If we think it is
>>>>>
>>>> stable
>>>>
>>>>> enough before cutting branch-2.0 then we can include it in the 2.0.0
>>>>> release, else let's include it in 2.1(Maybe we can backport it to 2.0
>>>>> later?).
>>>>>
>>>>>
>>>>>
>>>> I need to cut the Appy-suggested branch-2.0. Shout if
>>>> HBASE-19397-branch-2
>>>> gets to be too much work and I'll do it sooner rather than later. Or, if
>>>> easier on you, just say and I'll make the branch-2.0 now so you can just
>>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
>>>> become
>>>> hbase2.1...).
>>>>
>>>> St.Ack
>>>>
>>>>
>>>>
>>>>
>>>> Thanks all here.
>>>>>
>>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
>>>>>
>>>>> +1 to merge on master and 2.1.
>>>>>> Great work.
>>>>>>
>>>>>> Thanks,
>>>>>> Ashish
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
>>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
>>>>>> To: dev@hbase.apache.org
>>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
>>>>>>
>>>>> branch-2.
>>>>
>>>>>
>>>>>> Anyway, if no objections on merging this into master, let's do it? So
>>>>>>
>>>>> that
>>>>>
>>>>>> we can start working on the follow-on features, such as table based
>>>>>> replication storage, and synchronous replication, etc.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>>>>>>
>>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
>>>>>>>
>>>>>> palomino219@gmail.com>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> This 'new' feature only changes DDL part, not the core part of
>>>>>>>>
>>>>>>> replication,
>>>>>>>
>>>>>>>> i.e, how to read wal entries and how to replicate it to the remote
>>>>>>>>
>>>>>>> cluster,
>>>>>>>
>>>>>>>> etc. And also there is no pb message/storage layout change, you
>>>>>>>>
>>>>>>> can
>>>>
>>>>> think of this as a big refactoring. Theoretically we even do not
>>>>>>>> need to add
>>>>>>>>
>>>>>>> new
>>>>>>>
>>>>>>>> UTs for this feature, i.e, no extra stability works. The only
>>>>>>>> visible change to users is that it may require more time on
>>>>>>>> modifying peers in shell. So in general I think it is less hurt to
>>>>>>>> include it in the coming release?
>>>>>>>>
>>>>>>>> And why I think it SHOULD be included in our 2.0 release is that,
>>>>>>>> the synchronous guarantee is really a good thing for our
>>>>>>>>
>>>>>>> replication
>>>>
>>>>> related UTs. The correctness of the current Test***Replication
>>>>>>>> usually depends
>>>>>>>>
>>>>>>> on a
>>>>>>>
>>>>>>>> flakey condition - we will not do a log rolling between the
>>>>>>>> modification
>>>>>>>>
>>>>>>> on
>>>>>>>
>>>>>>>> zk and the actual loading of the modification on RS. And we have
>>>>>>>> also
>>>>>>>>
>>>>>>> done
>>>>>>>
>>>>>>>> a hard work to cleanup the lockings in ReplicationSourceManager
>>>>>>>>
>>>>>>> and
>>>>
>>>>> add a fat comment to say why it should be synchronized in this
>>>>>>>>
>>>>>>> way.
>>>>
>>>>> In general, the new code is much easier to read, test and debug,
>>>>>>>>
>>>>>>> and
>>>>
>>>>> also reduce the possibility of flakeyness, which could help us a
>>>>>>>>
>>>>>>> lot
>>>>
>>>>> when we start to stabilize our build.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>> You see it as a big bug fix Duo?
>>>>>>>
>>>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it more
>>>>>> stable, but we know that we can not fix all the problems since we
>>>>>>
>>>>> store
>>>>
>>>>> state in several places and it is a 'mission impossible' to make all
>>>>>>
>>>>> the
>>>>
>>>>> states stay in sync under every situation... So we introduce AM v2.
>>>>>> For the replication peer tracking, it is the same problem. It is hard
>>>>>>
>>>>> to
>>>>
>>>>> do fencing with zk watcher since it is asynchronous, so the UTs are
>>>>>>
>>>>> always
>>>>>
>>>>>> kind of flakey in theoretical. And we depend on replication heavily at
>>>>>> Xiaomi, it is always a pain for us.
>>>>>>
>>>>>>
>>>>>>> I'm late to review. Will have a look after beta-1 goes out. This
>>>>>>>
>>>>>> stuff
>>>>
>>>>> looks great from outside, especially distributed procedure framework
>>>>>>> which we need all over the place.
>>>>>>>
>>>>>>> In general I have no problem w/ this in master and an hbase 2.1 (2.1
>>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but let
>>>>>>>
>>>>>> me
>>>>
>>>>> take a looksee sir.
>>>>>>>
>>>>>>> Thanks sir. All the concerns here about whether we should merge this
>>>>>>
>>>>> into
>>>>
>>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
>>>>>>
>>>>> because I
>>>>
>>>>> believe it will help a lot for stabilizing,  I'm OK with merge it to
>>>>>>
>>>>> 2.1
>>>>
>>>>> only if you guys all think so.
>>>>>>
>>>>>>
>>>>>>> St.Ack
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>>>>>>>>
>>>>>>>> Same questions as Josh's.
>>>>>>>>> 1) We have RCs for beta1 now, which means only commits that can
>>>>>>>>>
>>>>>>>> go
>>>>
>>>>> in
>>>>>>>>>
>>>>>>>> are
>>>>>>>
>>>>>>>> bug fixes only. This change - although important, needed from
>>>>>>>>>
>>>>>>>> long
>>>>
>>>>> time
>>>>>>>>>
>>>>>>>> and
>>>>>>>>
>>>>>>>>> well done (testing, summary, etc) - seems rather very large to
>>>>>>>>>
>>>>>>>> get
>>>>
>>>>> into
>>>>>>>>>
>>>>>>>> 2.0
>>>>>>>>
>>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
>>>>>>>>>
>>>>>>>> 2.0.
>>>>
>>>>>
>>>>>>>>> -- Appy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>>> -0 From a general project planning point-of-view (not based on
>>>>>>>>>> the technical merit of the code) I am uncomfortable about
>>>>>>>>>> pulling in a
>>>>>>>>>>
>>>>>>>>> brand
>>>>>>>>
>>>>>>>>> new feature after we've already made one beta RC.
>>>>>>>>>>
>>>>>>>>>> Duo -- can you expand on why this feature is so important that
>>>>>>>>>> we
>>>>>>>>>>
>>>>>>>>> should
>>>>>>>>
>>>>>>>>> break our release plan? Are there problems that would make
>>>>>>>>>> including
>>>>>>>>>>
>>>>>>>>> this
>>>>>>>>
>>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
>>>>>>>>>>
>>>>>>>>> can
>>>>
>>>>> provide
>>>>>>>
>>>>>>>> on
>>>>>>>>>
>>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>>>>>>>>>>>
>>>>>>>>>>> We aim to move the peer modification framework from zk
>>>>>>>>>>>
>>>>>>>>>> watcher
>>>>
>>>>> to procedure
>>>>>>>>>>> v2 in this issue and the work is done now.
>>>>>>>>>>>
>>>>>>>>>>> Copy the release note here:
>>>>>>>>>>>
>>>>>>>>>>> Introduce 5 procedures to do peer modifications:
>>>>>>>>>>>
>>>>>>>>>>> AddPeerProcedure
>>>>>>>>>>>> RemovePeerProcedure
>>>>>>>>>>>> UpdatePeerConfigProcedure
>>>>>>>>>>>> EnablePeerProcedure
>>>>>>>>>>>> DisablePeerProcedure
>>>>>>>>>>>>
>>>>>>>>>>>> The procedures are all executed with the following stage:
>>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
>>>>>>>>>>>>
>>>>>>>>>>> 2.
>>>>
>>>>> Check whether the operation is valid, if not then give up 3.
>>>>>>>>>>>> Update peer storage. Notice that if we have entered this
>>>>>>>>>>>> stage,
>>>>>>>>>>>>
>>>>>>>>>>> then
>>>>>>>>
>>>>>>>>> we
>>>>>>>>>>>> can not rollback any more.
>>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
>>>>>>>>>>>>
>>>>>>>>>>> every
>>>>
>>>>> RS.
>>>>>>
>>>>>>> 5. Do post cleanup if any.
>>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
>>>>>>>>>>>> since we
>>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>>
>>>>>>>>>> already done the work.
>>>>>>>>>>>>
>>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id, so
>>>>>>>>>>>> now
>>>>>>>>>>>>
>>>>>>>>>>> there
>>>>>>>>
>>>>>>>>> is
>>>>>>>>>
>>>>>>>>>> no concurrent modifications on a single peer.
>>>>>>>>>>>>
>>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
>>>>>>>>>>>>
>>>>>>>>>>> the
>>>>
>>>>> peer modification has already taken effect on all RSes.
>>>>>>>>>>>>
>>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
>>>>>>>>>>>> manangement,
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>
>>>>>>>> refactored the upper layer to remove zk related
>>>>>>>>>>>>
>>>>>>>>>>> naming/code/comment.
>>>>>>>
>>>>>>>>
>>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
>>>>>>>>>>>> RegionServerObserver, and
>>>>>>>>>>>>
>>>>>>>>>>> add
>>>>>>>>
>>>>>>>>> permission check for executeProcedures method which requires
>>>>>>>>>>>> the
>>>>>>>>>>>>
>>>>>>>>>>> caller
>>>>>>>>
>>>>>>>>> to
>>>>>>>>>>>> be system user or super user.
>>>>>>>>>>>>
>>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
>>>>>>>>>>>>
>>>>>>>>>>> modifications
>>>>>>>
>>>>>>>> during the rolling upgrading. There is no pb/layout changes
>>>>>>>>>>>>
>>>>>>>>>>> on
>>>>
>>>>> the peer/queue storage on zk.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> And there are other benefits.
>>>>>>>>>>> First, we have introduced a general procedure framework to
>>>>>>>>>>>
>>>>>>>>>> send
>>>>
>>>>> tasks
>>>>>>>
>>>>>>>> to
>>>>>>>>
>>>>>>>>> RS
>>>>>>>>>>> and report the report back to Master. It can be used to
>>>>>>>>>>> implement
>>>>>>>>>>>
>>>>>>>>>> other
>>>>>>>>
>>>>>>>>> operations such as ACL change.
>>>>>>>>>>> Second, zk is used as a external storage now since we do not
>>>>>>>>>>> depend
>>>>>>>>>>>
>>>>>>>>>> on
>>>>>>>
>>>>>>>> zk
>>>>>>>>>
>>>>>>>>>> watcher any more, it will be much easier to implement a
>>>>>>>>>>>
>>>>>>>>>> 'table
>>>>
>>>>> based'
>>>>>>>
>>>>>>>> replication peer/queue storage.
>>>>>>>>>>>
>>>>>>>>>>> Please vote:
>>>>>>>>>>> [+1] Agree
>>>>>>>>>>> [-1] Disagree
>>>>>>>>>>> [0] Neutral
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> -- Appy
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>


-- 

-- Appy

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Josh Elser <el...@apache.org>.

(Sorry for something other than just a vote)

I worry seeing a "big" feature branch merge as us falling back into the 
1.x trap. Starting to backport features into 2.x will keep us delaying 
3.0 as we have less and less incentive to push to release 3.0 in a 
timely manner.

That said, I also don't want my worries to bar a feature which appears 
to be related to implementation only (based on one high-level read of 
the changes). Perhaps we need to re-think what is allowable for a Y 
release in x.y.z...

+1 for master (which already happened, maybe?)
+0 for branch-2 (simply because I haven't looked closely enough at 
changes, can read through and try to change to +1 if you need the votes)

On 3/9/18 2:41 AM, 张铎(Duo Zhang) wrote:
> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
> merge branch HBASE-19397-branch-2 back to branch-2.
> 
> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
> 
>> If branch-2.0 will be out soon then let's target this to 2.1. No problem.
>>
>> Thanks.
>>
>> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
>>
>>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
>>> wrote:
>>>
>>>> OK, let me merge it master first. And then create a HBASE-19397-branch-2
>>>> which will keep rebasing with the newest branch-2 to see if it is stable
>>>> enough. Since we can define this as a bug fix/refactoring rather than a
>>> big
>>>> new feature, it is OK to integrate it at any time. If we think it is
>>> stable
>>>> enough before cutting branch-2.0 then we can include it in the 2.0.0
>>>> release, else let's include it in 2.1(Maybe we can backport it to 2.0
>>>> later?).
>>>>
>>>>
>>>
>>> I need to cut the Appy-suggested branch-2.0. Shout if HBASE-19397-branch-2
>>> gets to be too much work and I'll do it sooner rather than later. Or, if
>>> easier on you, just say and I'll make the branch-2.0 now so you can just
>>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will become
>>> hbase2.1...).
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>>> Thanks all here.
>>>>
>>>> 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
>>>>
>>>>> +1 to merge on master and 2.1.
>>>>> Great work.
>>>>>
>>>>> Thanks,
>>>>> Ashish
>>>>>
>>>>> -----Original Message-----
>>>>> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
>>>>> Sent: Tuesday, January 09, 2018 6:53 AM
>>>>> To: dev@hbase.apache.org
>>>>> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
>>> branch-2.
>>>>>
>>>>> Anyway, if no objections on merging this into master, let's do it? So
>>>> that
>>>>> we can start working on the follow-on features, such as table based
>>>>> replication storage, and synchronous replication, etc.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>>>>>
>>>>>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
>>> palomino219@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> This 'new' feature only changes DDL part, not the core part of
>>>>>> replication,
>>>>>>> i.e, how to read wal entries and how to replicate it to the remote
>>>>>> cluster,
>>>>>>> etc. And also there is no pb message/storage layout change, you
>>> can
>>>>>>> think of this as a big refactoring. Theoretically we even do not
>>>>>>> need to add
>>>>>> new
>>>>>>> UTs for this feature, i.e, no extra stability works. The only
>>>>>>> visible change to users is that it may require more time on
>>>>>>> modifying peers in shell. So in general I think it is less hurt to
>>>>>>> include it in the coming release?
>>>>>>>
>>>>>>> And why I think it SHOULD be included in our 2.0 release is that,
>>>>>>> the synchronous guarantee is really a good thing for our
>>> replication
>>>>>>> related UTs. The correctness of the current Test***Replication
>>>>>>> usually depends
>>>>>> on a
>>>>>>> flakey condition - we will not do a log rolling between the
>>>>>>> modification
>>>>>> on
>>>>>>> zk and the actual loading of the modification on RS. And we have
>>>>>>> also
>>>>>> done
>>>>>>> a hard work to cleanup the lockings in ReplicationSourceManager
>>> and
>>>>>>> add a fat comment to say why it should be synchronized in this
>>> way.
>>>>>>> In general, the new code is much easier to read, test and debug,
>>> and
>>>>>>> also reduce the possibility of flakeyness, which could help us a
>>> lot
>>>>>>> when we start to stabilize our build.
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>> You see it as a big bug fix Duo?
>>>>>>
>>>>> Kind of. Just like the AM v1, we can do lots of fix to make it more
>>>>> stable, but we know that we can not fix all the problems since we
>>> store
>>>>> state in several places and it is a 'mission impossible' to make all
>>> the
>>>>> states stay in sync under every situation... So we introduce AM v2.
>>>>> For the replication peer tracking, it is the same problem. It is hard
>>> to
>>>>> do fencing with zk watcher since it is asynchronous, so the UTs are
>>>> always
>>>>> kind of flakey in theoretical. And we depend on replication heavily at
>>>>> Xiaomi, it is always a pain for us.
>>>>>
>>>>>>
>>>>>> I'm late to review. Will have a look after beta-1 goes out. This
>>> stuff
>>>>>> looks great from outside, especially distributed procedure framework
>>>>>> which we need all over the place.
>>>>>>
>>>>>> In general I have no problem w/ this in master and an hbase 2.1 (2.1
>>>>>> could be soon after 2.0). Its late for big stuff in 2.0.... but let
>>> me
>>>>>> take a looksee sir.
>>>>>>
>>>>> Thanks sir. All the concerns here about whether we should merge this
>>> into
>>>>> 2.0 are reasonable, I know. Although I really want this in 2.0
>>> because I
>>>>> believe it will help a lot for stabilizing,  I'm OK with merge it to
>>> 2.1
>>>>> only if you guys all think so.
>>>>>
>>>>>>
>>>>>> St.Ack
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>>>>>>>
>>>>>>>> Same questions as Josh's.
>>>>>>>> 1) We have RCs for beta1 now, which means only commits that can
>>> go
>>>>>>>> in
>>>>>> are
>>>>>>>> bug fixes only. This change - although important, needed from
>>> long
>>>>>>>> time
>>>>>>> and
>>>>>>>> well done (testing, summary, etc) - seems rather very large to
>>> get
>>>>>>>> into
>>>>>>> 2.0
>>>>>>>> now. Needs good justification why it has to be 2.1 instead of
>>> 2.0.
>>>>>>>>
>>>>>>>> -- Appy
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> -0 From a general project planning point-of-view (not based on
>>>>>>>>> the technical merit of the code) I am uncomfortable about
>>>>>>>>> pulling in a
>>>>>>> brand
>>>>>>>>> new feature after we've already made one beta RC.
>>>>>>>>>
>>>>>>>>> Duo -- can you expand on why this feature is so important that
>>>>>>>>> we
>>>>>>> should
>>>>>>>>> break our release plan? Are there problems that would make
>>>>>>>>> including
>>>>>>> this
>>>>>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you
>>> can
>>>>>> provide
>>>>>>>> on
>>>>>>>>> "why does this need to go into 2.0?" would be helpful.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>>>>>>>>>>
>>>>>>>>>> We aim to move the peer modification framework from zk
>>> watcher
>>>>>>>>>> to procedure
>>>>>>>>>> v2 in this issue and the work is done now.
>>>>>>>>>>
>>>>>>>>>> Copy the release note here:
>>>>>>>>>>
>>>>>>>>>> Introduce 5 procedures to do peer modifications:
>>>>>>>>>>
>>>>>>>>>>> AddPeerProcedure
>>>>>>>>>>> RemovePeerProcedure
>>>>>>>>>>> UpdatePeerConfigProcedure
>>>>>>>>>>> EnablePeerProcedure
>>>>>>>>>>> DisablePeerProcedure
>>>>>>>>>>>
>>>>>>>>>>> The procedures are all executed with the following stage:
>>>>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
>>> 2.
>>>>>>>>>>> Check whether the operation is valid, if not then give up 3.
>>>>>>>>>>> Update peer storage. Notice that if we have entered this
>>>>>>>>>>> stage,
>>>>>>> then
>>>>>>>>>>> we
>>>>>>>>>>> can not rollback any more.
>>>>>>>>>>> 4. Schedule sub procedures to refresh the peer config on
>>> every
>>>>> RS.
>>>>>>>>>>> 5. Do post cleanup if any.
>>>>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored
>>>>>>>>>>> since we
>>>>>>>> have
>>>>>>>>>>> already done the work.
>>>>>>>>>>>
>>>>>>>>>>> The procedure will hold an exclusive lock on the peer id, so
>>>>>>>>>>> now
>>>>>>> there
>>>>>>>> is
>>>>>>>>>>> no concurrent modifications on a single peer.
>>>>>>>>>>>
>>>>>>>>>>> And now it is guaranteed that once the procedure is done,
>>> the
>>>>>>>>>>> peer modification has already taken effect on all RSes.
>>>>>>>>>>>
>>>>>>>>>>> Abstracte a storage layer for replication peer/queue
>>>>>>>>>>> manangement,
>>>>>> and
>>>>>>>>>>> refactored the upper layer to remove zk related
>>>>>> naming/code/comment.
>>>>>>>>>>>
>>>>>>>>>>> Add pre/postExecuteProcedures CP hooks to
>>>>>>>>>>> RegionServerObserver, and
>>>>>>> add
>>>>>>>>>>> permission check for executeProcedures method which requires
>>>>>>>>>>> the
>>>>>>> caller
>>>>>>>>>>> to
>>>>>>>>>>> be system user or super user.
>>>>>>>>>>>
>>>>>>>>>>> On rolling upgrade: just do not do any replication peer
>>>>>> modifications
>>>>>>>>>>> during the rolling upgrading. There is no pb/layout changes
>>> on
>>>>>>>>>>> the peer/queue storage on zk.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> And there are other benefits.
>>>>>>>>>> First, we have introduced a general procedure framework to
>>> send
>>>>>> tasks
>>>>>>> to
>>>>>>>>>> RS
>>>>>>>>>> and report the report back to Master. It can be used to
>>>>>>>>>> implement
>>>>>>> other
>>>>>>>>>> operations such as ACL change.
>>>>>>>>>> Second, zk is used as a external storage now since we do not
>>>>>>>>>> depend
>>>>>> on
>>>>>>>> zk
>>>>>>>>>> watcher any more, it will be much easier to implement a
>>> 'table
>>>>>> based'
>>>>>>>>>> replication peer/queue storage.
>>>>>>>>>>
>>>>>>>>>> Please vote:
>>>>>>>>>> [+1] Agree
>>>>>>>>>> [-1] Disagree
>>>>>>>>>> [0] Neutral
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> -- Appy
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Umesh Agashe <ua...@cloudera.com>.

+1

On Fri, Mar 9, 2018 at 10:40 AM, Stack <st...@duboce.net> wrote:

> The intent is that this would come out in 2.1.0? This would be soon after a
> 2.0.0?
>
> S
>
> On Thu, Mar 8, 2018 at 11:41 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
> > merge branch HBASE-19397-branch-2 back to branch-2.
> >
> > 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
> >
> > > If branch-2.0 will be out soon then let's target this to 2.1. No
> problem.
> > >
> > > Thanks.
> > >
> > > 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> > >
> > >> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> > >> wrote:
> > >>
> > >> > OK, let me merge it master first. And then create a
> > HBASE-19397-branch-2
> > >> > which will keep rebasing with the newest branch-2 to see if it is
> > stable
> > >> > enough. Since we can define this as a bug fix/refactoring rather
> than
> > a
> > >> big
> > >> > new feature, it is OK to integrate it at any time. If we think it is
> > >> stable
> > >> > enough before cutting branch-2.0 then we can include it in the 2.0.0
> > >> > release, else let's include it in 2.1(Maybe we can backport it to
> 2.0
> > >> > later?).
> > >> >
> > >> >
> > >>
> > >> I need to cut the Appy-suggested branch-2.0. Shout if
> > HBASE-19397-branch-2
> > >> gets to be too much work and I'll do it sooner rather than later. Or,
> if
> > >> easier on you, just say and I'll make the branch-2.0 now so you can
> just
> > >> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> > become
> > >> hbase2.1...).
> > >>
> > >> St.Ack
> > >>
> > >>
> > >>
> > >>
> > >> > Thanks all here.
> > >> >
> > >> > 2018-01-09 12:06 GMT+08:00 ashish singhi <ashish.singhi@huawei.com
> >:
> > >> >
> > >> > > +1 to merge on master and 2.1.
> > >> > > Great work.
> > >> > >
> > >> > > Thanks,
> > >> > > Ashish
> > >> > >
> > >> > > -----Original Message-----
> > >> > > From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> > >> > > Sent: Tuesday, January 09, 2018 6:53 AM
> > >> > > To: dev@hbase.apache.org
> > >> > > Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> > >> branch-2.
> > >> > >
> > >> > > Anyway, if no objections on merging this into master, let's do it?
> > So
> > >> > that
> > >> > > we can start working on the follow-on features, such as table
> based
> > >> > > replication storage, and synchronous replication, etc.
> > >> > >
> > >> > > Thanks.
> > >> > >
> > >> > > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > >> > >
> > >> > > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> > >> palomino219@gmail.com>
> > >> > > > wrote:
> > >> > > >
> > >> > > > > This 'new' feature only changes DDL part, not the core part of
> > >> > > > replication,
> > >> > > > > i.e, how to read wal entries and how to replicate it to the
> > remote
> > >> > > > cluster,
> > >> > > > > etc. And also there is no pb message/storage layout change,
> you
> > >> can
> > >> > > > > think of this as a big refactoring. Theoretically we even do
> not
> > >> > > > > need to add
> > >> > > > new
> > >> > > > > UTs for this feature, i.e, no extra stability works. The only
> > >> > > > > visible change to users is that it may require more time on
> > >> > > > > modifying peers in shell. So in general I think it is less
> hurt
> > to
> > >> > > > > include it in the coming release?
> > >> > > > >
> > >> > > > > And why I think it SHOULD be included in our 2.0 release is
> > that,
> > >> > > > > the synchronous guarantee is really a good thing for our
> > >> replication
> > >> > > > > related UTs. The correctness of the current Test***Replication
> > >> > > > > usually depends
> > >> > > > on a
> > >> > > > > flakey condition - we will not do a log rolling between the
> > >> > > > > modification
> > >> > > > on
> > >> > > > > zk and the actual loading of the modification on RS. And we
> have
> > >> > > > > also
> > >> > > > done
> > >> > > > > a hard work to cleanup the lockings in
> ReplicationSourceManager
> > >> and
> > >> > > > > add a fat comment to say why it should be synchronized in this
> > >> way.
> > >> > > > > In general, the new code is much easier to read, test and
> debug,
> > >> and
> > >> > > > > also reduce the possibility of flakeyness, which could help
> us a
> > >> lot
> > >> > > > > when we start to stabilize our build.
> > >> > > > >
> > >> > > > > Thanks.
> > >> > > > >
> > >> > > > >
> > >> > > > You see it as a big bug fix Duo?
> > >> > > >
> > >> > > Kind of. Just like the AM v1, we can do lots of fix to make it
> more
> > >> > > stable, but we know that we can not fix all the problems since we
> > >> store
> > >> > > state in several places and it is a 'mission impossible' to make
> all
> > >> the
> > >> > > states stay in sync under every situation... So we introduce AM
> v2.
> > >> > > For the replication peer tracking, it is the same problem. It is
> > hard
> > >> to
> > >> > > do fencing with zk watcher since it is asynchronous, so the UTs
> are
> > >> > always
> > >> > > kind of flakey in theoretical. And we depend on replication
> heavily
> > at
> > >> > > Xiaomi, it is always a pain for us.
> > >> > >
> > >> > > >
> > >> > > > I'm late to review. Will have a look after beta-1 goes out. This
> > >> stuff
> > >> > > > looks great from outside, especially distributed procedure
> > framework
> > >> > > > which we need all over the place.
> > >> > > >
> > >> > > > In general I have no problem w/ this in master and an hbase 2.1
> > (2.1
> > >> > > > could be soon after 2.0). Its late for big stuff in 2.0.... but
> > let
> > >> me
> > >> > > > take a looksee sir.
> > >> > > >
> > >> > > Thanks sir. All the concerns here about whether we should merge
> this
> > >> into
> > >> > > 2.0 are reasonable, I know. Although I really want this in 2.0
> > >> because I
> > >> > > believe it will help a lot for stabilizing,  I'm OK with merge it
> to
> > >> 2.1
> > >> > > only if you guys all think so.
> > >> > >
> > >> > > >
> > >> > > > St.Ack
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > >
> > >> > > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <appy@cloudera.com
> >:
> > >> > > > >
> > >> > > > > > Same questions as Josh's.
> > >> > > > > > 1) We have RCs for beta1 now, which means only commits that
> > can
> > >> go
> > >> > > > > > in
> > >> > > > are
> > >> > > > > > bug fixes only. This change - although important, needed
> from
> > >> long
> > >> > > > > > time
> > >> > > > > and
> > >> > > > > > well done (testing, summary, etc) - seems rather very large
> to
> > >> get
> > >> > > > > > into
> > >> > > > > 2.0
> > >> > > > > > now. Needs good justification why it has to be 2.1 instead
> of
> > >> 2.0.
> > >> > > > > >
> > >> > > > > > -- Appy
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <
> elserj@apache.org
> > >
> > >> > > wrote:
> > >> > > > > >
> > >> > > > > > > -0 From a general project planning point-of-view (not
> based
> > on
> > >> > > > > > > the technical merit of the code) I am uncomfortable about
> > >> > > > > > > pulling in a
> > >> > > > > brand
> > >> > > > > > > new feature after we've already made one beta RC.
> > >> > > > > > >
> > >> > > > > > > Duo -- can you expand on why this feature is so important
> > that
> > >> > > > > > > we
> > >> > > > > should
> > >> > > > > > > break our release plan? Are there problems that would make
> > >> > > > > > > including
> > >> > > > > this
> > >> > > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color
> you
> > >> can
> > >> > > > provide
> > >> > > > > > on
> > >> > > > > > > "why does this need to go into 2.0?" would be helpful.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > >> > > > > > >
> > >> > > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > >> > > > > > >>
> > >> > > > > > >> We aim to move the peer modification framework from zk
> > >> watcher
> > >> > > > > > >> to procedure
> > >> > > > > > >> v2 in this issue and the work is done now.
> > >> > > > > > >>
> > >> > > > > > >> Copy the release note here:
> > >> > > > > > >>
> > >> > > > > > >> Introduce 5 procedures to do peer modifications:
> > >> > > > > > >>
> > >> > > > > > >>> AddPeerProcedure
> > >> > > > > > >>> RemovePeerProcedure
> > >> > > > > > >>> UpdatePeerConfigProcedure
> > >> > > > > > >>> EnablePeerProcedure
> > >> > > > > > >>> DisablePeerProcedure
> > >> > > > > > >>>
> > >> > > > > > >>> The procedures are all executed with the following
> stage:
> > >> > > > > > >>> 1. Call pre CP hook, if an exception is thrown then give
> > up
> > >> 2.
> > >> > > > > > >>> Check whether the operation is valid, if not then give
> up
> > 3.
> > >> > > > > > >>> Update peer storage. Notice that if we have entered this
> > >> > > > > > >>> stage,
> > >> > > > > then
> > >> > > > > > >>> we
> > >> > > > > > >>> can not rollback any more.
> > >> > > > > > >>> 4. Schedule sub procedures to refresh the peer config on
> > >> every
> > >> > > RS.
> > >> > > > > > >>> 5. Do post cleanup if any.
> > >> > > > > > >>> 6. Call post CP hook. The exception thrown will be
> ignored
> > >> > > > > > >>> since we
> > >> > > > > > have
> > >> > > > > > >>> already done the work.
> > >> > > > > > >>>
> > >> > > > > > >>> The procedure will hold an exclusive lock on the peer
> id,
> > so
> > >> > > > > > >>> now
> > >> > > > > there
> > >> > > > > > is
> > >> > > > > > >>> no concurrent modifications on a single peer.
> > >> > > > > > >>>
> > >> > > > > > >>> And now it is guaranteed that once the procedure is
> done,
> > >> the
> > >> > > > > > >>> peer modification has already taken effect on all RSes.
> > >> > > > > > >>>
> > >> > > > > > >>> Abstracte a storage layer for replication peer/queue
> > >> > > > > > >>> manangement,
> > >> > > > and
> > >> > > > > > >>> refactored the upper layer to remove zk related
> > >> > > > naming/code/comment.
> > >> > > > > > >>>
> > >> > > > > > >>> Add pre/postExecuteProcedures CP hooks to
> > >> > > > > > >>> RegionServerObserver, and
> > >> > > > > add
> > >> > > > > > >>> permission check for executeProcedures method which
> > requires
> > >> > > > > > >>> the
> > >> > > > > caller
> > >> > > > > > >>> to
> > >> > > > > > >>> be system user or super user.
> > >> > > > > > >>>
> > >> > > > > > >>> On rolling upgrade: just do not do any replication peer
> > >> > > > modifications
> > >> > > > > > >>> during the rolling upgrading. There is no pb/layout
> > changes
> > >> on
> > >> > > > > > >>> the peer/queue storage on zk.
> > >> > > > > > >>>
> > >> > > > > > >>>
> > >> > > > > > >> And there are other benefits.
> > >> > > > > > >> First, we have introduced a general procedure framework
> to
> > >> send
> > >> > > > tasks
> > >> > > > > to
> > >> > > > > > >> RS
> > >> > > > > > >> and report the report back to Master. It can be used to
> > >> > > > > > >> implement
> > >> > > > > other
> > >> > > > > > >> operations such as ACL change.
> > >> > > > > > >> Second, zk is used as a external storage now since we do
> > not
> > >> > > > > > >> depend
> > >> > > > on
> > >> > > > > > zk
> > >> > > > > > >> watcher any more, it will be much easier to implement a
> > >> 'table
> > >> > > > based'
> > >> > > > > > >> replication peer/queue storage.
> > >> > > > > > >>
> > >> > > > > > >> Please vote:
> > >> > > > > > >> [+1] Agree
> > >> > > > > > >> [-1] Disagree
> > >> > > > > > >> [0] Neutral
> > >> > > > > > >>
> > >> > > > > > >> Thanks.
> > >> > > > > > >>
> > >> > > > > > >>
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > --
> > >> > > > > >
> > >> > > > > > -- Appy
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

The intent is that this would come out in 2.1.0? This would be soon after a
2.0.0?

S

On Thu, Mar 8, 2018 at 11:41 PM, 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
> merge branch HBASE-19397-branch-2 back to branch-2.
>
> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
>
> > If branch-2.0 will be out soon then let's target this to 2.1. No problem.
> >
> > Thanks.
> >
> > 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> >
> >> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> >> wrote:
> >>
> >> > OK, let me merge it master first. And then create a
> HBASE-19397-branch-2
> >> > which will keep rebasing with the newest branch-2 to see if it is
> stable
> >> > enough. Since we can define this as a bug fix/refactoring rather than
> a
> >> big
> >> > new feature, it is OK to integrate it at any time. If we think it is
> >> stable
> >> > enough before cutting branch-2.0 then we can include it in the 2.0.0
> >> > release, else let's include it in 2.1(Maybe we can backport it to 2.0
> >> > later?).
> >> >
> >> >
> >>
> >> I need to cut the Appy-suggested branch-2.0. Shout if
> HBASE-19397-branch-2
> >> gets to be too much work and I'll do it sooner rather than later. Or, if
> >> easier on you, just say and I'll make the branch-2.0 now so you can just
> >> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> become
> >> hbase2.1...).
> >>
> >> St.Ack
> >>
> >>
> >>
> >>
> >> > Thanks all here.
> >> >
> >> > 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
> >> >
> >> > > +1 to merge on master and 2.1.
> >> > > Great work.
> >> > >
> >> > > Thanks,
> >> > > Ashish
> >> > >
> >> > > -----Original Message-----
> >> > > From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> >> > > Sent: Tuesday, January 09, 2018 6:53 AM
> >> > > To: dev@hbase.apache.org
> >> > > Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> >> branch-2.
> >> > >
> >> > > Anyway, if no objections on merging this into master, let's do it?
> So
> >> > that
> >> > > we can start working on the follow-on features, such as table based
> >> > > replication storage, and synchronous replication, etc.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> >> > >
> >> > > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> >> palomino219@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > This 'new' feature only changes DDL part, not the core part of
> >> > > > replication,
> >> > > > > i.e, how to read wal entries and how to replicate it to the
> remote
> >> > > > cluster,
> >> > > > > etc. And also there is no pb message/storage layout change, you
> >> can
> >> > > > > think of this as a big refactoring. Theoretically we even do not
> >> > > > > need to add
> >> > > > new
> >> > > > > UTs for this feature, i.e, no extra stability works. The only
> >> > > > > visible change to users is that it may require more time on
> >> > > > > modifying peers in shell. So in general I think it is less hurt
> to
> >> > > > > include it in the coming release?
> >> > > > >
> >> > > > > And why I think it SHOULD be included in our 2.0 release is
> that,
> >> > > > > the synchronous guarantee is really a good thing for our
> >> replication
> >> > > > > related UTs. The correctness of the current Test***Replication
> >> > > > > usually depends
> >> > > > on a
> >> > > > > flakey condition - we will not do a log rolling between the
> >> > > > > modification
> >> > > > on
> >> > > > > zk and the actual loading of the modification on RS. And we have
> >> > > > > also
> >> > > > done
> >> > > > > a hard work to cleanup the lockings in ReplicationSourceManager
> >> and
> >> > > > > add a fat comment to say why it should be synchronized in this
> >> way.
> >> > > > > In general, the new code is much easier to read, test and debug,
> >> and
> >> > > > > also reduce the possibility of flakeyness, which could help us a
> >> lot
> >> > > > > when we start to stabilize our build.
> >> > > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > >
> >> > > > You see it as a big bug fix Duo?
> >> > > >
> >> > > Kind of. Just like the AM v1, we can do lots of fix to make it more
> >> > > stable, but we know that we can not fix all the problems since we
> >> store
> >> > > state in several places and it is a 'mission impossible' to make all
> >> the
> >> > > states stay in sync under every situation... So we introduce AM v2.
> >> > > For the replication peer tracking, it is the same problem. It is
> hard
> >> to
> >> > > do fencing with zk watcher since it is asynchronous, so the UTs are
> >> > always
> >> > > kind of flakey in theoretical. And we depend on replication heavily
> at
> >> > > Xiaomi, it is always a pain for us.
> >> > >
> >> > > >
> >> > > > I'm late to review. Will have a look after beta-1 goes out. This
> >> stuff
> >> > > > looks great from outside, especially distributed procedure
> framework
> >> > > > which we need all over the place.
> >> > > >
> >> > > > In general I have no problem w/ this in master and an hbase 2.1
> (2.1
> >> > > > could be soon after 2.0). Its late for big stuff in 2.0.... but
> let
> >> me
> >> > > > take a looksee sir.
> >> > > >
> >> > > Thanks sir. All the concerns here about whether we should merge this
> >> into
> >> > > 2.0 are reasonable, I know. Although I really want this in 2.0
> >> because I
> >> > > believe it will help a lot for stabilizing,  I'm OK with merge it to
> >> 2.1
> >> > > only if you guys all think so.
> >> > >
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> >> > > > >
> >> > > > > > Same questions as Josh's.
> >> > > > > > 1) We have RCs for beta1 now, which means only commits that
> can
> >> go
> >> > > > > > in
> >> > > > are
> >> > > > > > bug fixes only. This change - although important, needed from
> >> long
> >> > > > > > time
> >> > > > > and
> >> > > > > > well done (testing, summary, etc) - seems rather very large to
> >> get
> >> > > > > > into
> >> > > > > 2.0
> >> > > > > > now. Needs good justification why it has to be 2.1 instead of
> >> 2.0.
> >> > > > > >
> >> > > > > > -- Appy
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <elserj@apache.org
> >
> >> > > wrote:
> >> > > > > >
> >> > > > > > > -0 From a general project planning point-of-view (not based
> on
> >> > > > > > > the technical merit of the code) I am uncomfortable about
> >> > > > > > > pulling in a
> >> > > > > brand
> >> > > > > > > new feature after we've already made one beta RC.
> >> > > > > > >
> >> > > > > > > Duo -- can you expand on why this feature is so important
> that
> >> > > > > > > we
> >> > > > > should
> >> > > > > > > break our release plan? Are there problems that would make
> >> > > > > > > including
> >> > > > > this
> >> > > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> >> can
> >> > > > provide
> >> > > > > > on
> >> > > > > > > "why does this need to go into 2.0?" would be helpful.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> >> > > > > > >
> >> > > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> >> > > > > > >>
> >> > > > > > >> We aim to move the peer modification framework from zk
> >> watcher
> >> > > > > > >> to procedure
> >> > > > > > >> v2 in this issue and the work is done now.
> >> > > > > > >>
> >> > > > > > >> Copy the release note here:
> >> > > > > > >>
> >> > > > > > >> Introduce 5 procedures to do peer modifications:
> >> > > > > > >>
> >> > > > > > >>> AddPeerProcedure
> >> > > > > > >>> RemovePeerProcedure
> >> > > > > > >>> UpdatePeerConfigProcedure
> >> > > > > > >>> EnablePeerProcedure
> >> > > > > > >>> DisablePeerProcedure
> >> > > > > > >>>
> >> > > > > > >>> The procedures are all executed with the following stage:
> >> > > > > > >>> 1. Call pre CP hook, if an exception is thrown then give
> up
> >> 2.
> >> > > > > > >>> Check whether the operation is valid, if not then give up
> 3.
> >> > > > > > >>> Update peer storage. Notice that if we have entered this
> >> > > > > > >>> stage,
> >> > > > > then
> >> > > > > > >>> we
> >> > > > > > >>> can not rollback any more.
> >> > > > > > >>> 4. Schedule sub procedures to refresh the peer config on
> >> every
> >> > > RS.
> >> > > > > > >>> 5. Do post cleanup if any.
> >> > > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> >> > > > > > >>> since we
> >> > > > > > have
> >> > > > > > >>> already done the work.
> >> > > > > > >>>
> >> > > > > > >>> The procedure will hold an exclusive lock on the peer id,
> so
> >> > > > > > >>> now
> >> > > > > there
> >> > > > > > is
> >> > > > > > >>> no concurrent modifications on a single peer.
> >> > > > > > >>>
> >> > > > > > >>> And now it is guaranteed that once the procedure is done,
> >> the
> >> > > > > > >>> peer modification has already taken effect on all RSes.
> >> > > > > > >>>
> >> > > > > > >>> Abstracte a storage layer for replication peer/queue
> >> > > > > > >>> manangement,
> >> > > > and
> >> > > > > > >>> refactored the upper layer to remove zk related
> >> > > > naming/code/comment.
> >> > > > > > >>>
> >> > > > > > >>> Add pre/postExecuteProcedures CP hooks to
> >> > > > > > >>> RegionServerObserver, and
> >> > > > > add
> >> > > > > > >>> permission check for executeProcedures method which
> requires
> >> > > > > > >>> the
> >> > > > > caller
> >> > > > > > >>> to
> >> > > > > > >>> be system user or super user.
> >> > > > > > >>>
> >> > > > > > >>> On rolling upgrade: just do not do any replication peer
> >> > > > modifications
> >> > > > > > >>> during the rolling upgrading. There is no pb/layout
> changes
> >> on
> >> > > > > > >>> the peer/queue storage on zk.
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > > >> And there are other benefits.
> >> > > > > > >> First, we have introduced a general procedure framework to
> >> send
> >> > > > tasks
> >> > > > > to
> >> > > > > > >> RS
> >> > > > > > >> and report the report back to Master. It can be used to
> >> > > > > > >> implement
> >> > > > > other
> >> > > > > > >> operations such as ACL change.
> >> > > > > > >> Second, zk is used as a external storage now since we do
> not
> >> > > > > > >> depend
> >> > > > on
> >> > > > > > zk
> >> > > > > > >> watcher any more, it will be much easier to implement a
> >> 'table
> >> > > > based'
> >> > > > > > >> replication peer/queue storage.
> >> > > > > > >>
> >> > > > > > >> Please vote:
> >> > > > > > >> [+1] Agree
> >> > > > > > >> [-1] Disagree
> >> > > > > > >> [0] Neutral
> >> > > > > > >>
> >> > > > > > >> Thanks.
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > >
> >> > > > > > -- Appy
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Apekshit Sharma <ap...@cloudera.com>.

Go for it!

On Fri, Mar 9, 2018 at 1:11 PM, 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
> merge branch HBASE-19397-branch-2 back to branch-2.
>
> 2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:
>
> > If branch-2.0 will be out soon then let's target this to 2.1. No problem.
> >
> > Thanks.
> >
> > 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
> >
> >> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> >> wrote:
> >>
> >> > OK, let me merge it master first. And then create a
> HBASE-19397-branch-2
> >> > which will keep rebasing with the newest branch-2 to see if it is
> stable
> >> > enough. Since we can define this as a bug fix/refactoring rather than
> a
> >> big
> >> > new feature, it is OK to integrate it at any time. If we think it is
> >> stable
> >> > enough before cutting branch-2.0 then we can include it in the 2.0.0
> >> > release, else let's include it in 2.1(Maybe we can backport it to 2.0
> >> > later?).
> >> >
> >> >
> >>
> >> I need to cut the Appy-suggested branch-2.0. Shout if
> HBASE-19397-branch-2
> >> gets to be too much work and I'll do it sooner rather than later. Or, if
> >> easier on you, just say and I'll make the branch-2.0 now so you can just
> >> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will
> become
> >> hbase2.1...).
> >>
> >> St.Ack
> >>
> >>
> >>
> >>
> >> > Thanks all here.
> >> >
> >> > 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
> >> >
> >> > > +1 to merge on master and 2.1.
> >> > > Great work.
> >> > >
> >> > > Thanks,
> >> > > Ashish
> >> > >
> >> > > -----Original Message-----
> >> > > From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> >> > > Sent: Tuesday, January 09, 2018 6:53 AM
> >> > > To: dev@hbase.apache.org
> >> > > Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> >> branch-2.
> >> > >
> >> > > Anyway, if no objections on merging this into master, let's do it?
> So
> >> > that
> >> > > we can start working on the follow-on features, such as table based
> >> > > replication storage, and synchronous replication, etc.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> >> > >
> >> > > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
> >> palomino219@gmail.com>
> >> > > > wrote:
> >> > > >
> >> > > > > This 'new' feature only changes DDL part, not the core part of
> >> > > > replication,
> >> > > > > i.e, how to read wal entries and how to replicate it to the
> remote
> >> > > > cluster,
> >> > > > > etc. And also there is no pb message/storage layout change, you
> >> can
> >> > > > > think of this as a big refactoring. Theoretically we even do not
> >> > > > > need to add
> >> > > > new
> >> > > > > UTs for this feature, i.e, no extra stability works. The only
> >> > > > > visible change to users is that it may require more time on
> >> > > > > modifying peers in shell. So in general I think it is less hurt
> to
> >> > > > > include it in the coming release?
> >> > > > >
> >> > > > > And why I think it SHOULD be included in our 2.0 release is
> that,
> >> > > > > the synchronous guarantee is really a good thing for our
> >> replication
> >> > > > > related UTs. The correctness of the current Test***Replication
> >> > > > > usually depends
> >> > > > on a
> >> > > > > flakey condition - we will not do a log rolling between the
> >> > > > > modification
> >> > > > on
> >> > > > > zk and the actual loading of the modification on RS. And we have
> >> > > > > also
> >> > > > done
> >> > > > > a hard work to cleanup the lockings in ReplicationSourceManager
> >> and
> >> > > > > add a fat comment to say why it should be synchronized in this
> >> way.
> >> > > > > In general, the new code is much easier to read, test and debug,
> >> and
> >> > > > > also reduce the possibility of flakeyness, which could help us a
> >> lot
> >> > > > > when we start to stabilize our build.
> >> > > > >
> >> > > > > Thanks.
> >> > > > >
> >> > > > >
> >> > > > You see it as a big bug fix Duo?
> >> > > >
> >> > > Kind of. Just like the AM v1, we can do lots of fix to make it more
> >> > > stable, but we know that we can not fix all the problems since we
> >> store
> >> > > state in several places and it is a 'mission impossible' to make all
> >> the
> >> > > states stay in sync under every situation... So we introduce AM v2.
> >> > > For the replication peer tracking, it is the same problem. It is
> hard
> >> to
> >> > > do fencing with zk watcher since it is asynchronous, so the UTs are
> >> > always
> >> > > kind of flakey in theoretical. And we depend on replication heavily
> at
> >> > > Xiaomi, it is always a pain for us.
> >> > >
> >> > > >
> >> > > > I'm late to review. Will have a look after beta-1 goes out. This
> >> stuff
> >> > > > looks great from outside, especially distributed procedure
> framework
> >> > > > which we need all over the place.
> >> > > >
> >> > > > In general I have no problem w/ this in master and an hbase 2.1
> (2.1
> >> > > > could be soon after 2.0). Its late for big stuff in 2.0.... but
> let
> >> me
> >> > > > take a looksee sir.
> >> > > >
> >> > > Thanks sir. All the concerns here about whether we should merge this
> >> into
> >> > > 2.0 are reasonable, I know. Although I really want this in 2.0
> >> because I
> >> > > believe it will help a lot for stabilizing,  I'm OK with merge it to
> >> 2.1
> >> > > only if you guys all think so.
> >> > >
> >> > > >
> >> > > > St.Ack
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > >
> >> > > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> >> > > > >
> >> > > > > > Same questions as Josh's.
> >> > > > > > 1) We have RCs for beta1 now, which means only commits that
> can
> >> go
> >> > > > > > in
> >> > > > are
> >> > > > > > bug fixes only. This change - although important, needed from
> >> long
> >> > > > > > time
> >> > > > > and
> >> > > > > > well done (testing, summary, etc) - seems rather very large to
> >> get
> >> > > > > > into
> >> > > > > 2.0
> >> > > > > > now. Needs good justification why it has to be 2.1 instead of
> >> 2.0.
> >> > > > > >
> >> > > > > > -- Appy
> >> > > > > >
> >> > > > > >
> >> > > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <elserj@apache.org
> >
> >> > > wrote:
> >> > > > > >
> >> > > > > > > -0 From a general project planning point-of-view (not based
> on
> >> > > > > > > the technical merit of the code) I am uncomfortable about
> >> > > > > > > pulling in a
> >> > > > > brand
> >> > > > > > > new feature after we've already made one beta RC.
> >> > > > > > >
> >> > > > > > > Duo -- can you expand on why this feature is so important
> that
> >> > > > > > > we
> >> > > > > should
> >> > > > > > > break our release plan? Are there problems that would make
> >> > > > > > > including
> >> > > > > this
> >> > > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you
> >> can
> >> > > > provide
> >> > > > > > on
> >> > > > > > > "why does this need to go into 2.0?" would be helpful.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> >> > > > > > >
> >> > > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> >> > > > > > >>
> >> > > > > > >> We aim to move the peer modification framework from zk
> >> watcher
> >> > > > > > >> to procedure
> >> > > > > > >> v2 in this issue and the work is done now.
> >> > > > > > >>
> >> > > > > > >> Copy the release note here:
> >> > > > > > >>
> >> > > > > > >> Introduce 5 procedures to do peer modifications:
> >> > > > > > >>
> >> > > > > > >>> AddPeerProcedure
> >> > > > > > >>> RemovePeerProcedure
> >> > > > > > >>> UpdatePeerConfigProcedure
> >> > > > > > >>> EnablePeerProcedure
> >> > > > > > >>> DisablePeerProcedure
> >> > > > > > >>>
> >> > > > > > >>> The procedures are all executed with the following stage:
> >> > > > > > >>> 1. Call pre CP hook, if an exception is thrown then give
> up
> >> 2.
> >> > > > > > >>> Check whether the operation is valid, if not then give up
> 3.
> >> > > > > > >>> Update peer storage. Notice that if we have entered this
> >> > > > > > >>> stage,
> >> > > > > then
> >> > > > > > >>> we
> >> > > > > > >>> can not rollback any more.
> >> > > > > > >>> 4. Schedule sub procedures to refresh the peer config on
> >> every
> >> > > RS.
> >> > > > > > >>> 5. Do post cleanup if any.
> >> > > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> >> > > > > > >>> since we
> >> > > > > > have
> >> > > > > > >>> already done the work.
> >> > > > > > >>>
> >> > > > > > >>> The procedure will hold an exclusive lock on the peer id,
> so
> >> > > > > > >>> now
> >> > > > > there
> >> > > > > > is
> >> > > > > > >>> no concurrent modifications on a single peer.
> >> > > > > > >>>
> >> > > > > > >>> And now it is guaranteed that once the procedure is done,
> >> the
> >> > > > > > >>> peer modification has already taken effect on all RSes.
> >> > > > > > >>>
> >> > > > > > >>> Abstracte a storage layer for replication peer/queue
> >> > > > > > >>> manangement,
> >> > > > and
> >> > > > > > >>> refactored the upper layer to remove zk related
> >> > > > naming/code/comment.
> >> > > > > > >>>
> >> > > > > > >>> Add pre/postExecuteProcedures CP hooks to
> >> > > > > > >>> RegionServerObserver, and
> >> > > > > add
> >> > > > > > >>> permission check for executeProcedures method which
> requires
> >> > > > > > >>> the
> >> > > > > caller
> >> > > > > > >>> to
> >> > > > > > >>> be system user or super user.
> >> > > > > > >>>
> >> > > > > > >>> On rolling upgrade: just do not do any replication peer
> >> > > > modifications
> >> > > > > > >>> during the rolling upgrading. There is no pb/layout
> changes
> >> on
> >> > > > > > >>> the peer/queue storage on zk.
> >> > > > > > >>>
> >> > > > > > >>>
> >> > > > > > >> And there are other benefits.
> >> > > > > > >> First, we have introduced a general procedure framework to
> >> send
> >> > > > tasks
> >> > > > > to
> >> > > > > > >> RS
> >> > > > > > >> and report the report back to Master. It can be used to
> >> > > > > > >> implement
> >> > > > > other
> >> > > > > > >> operations such as ACL change.
> >> > > > > > >> Second, zk is used as a external storage now since we do
> not
> >> > > > > > >> depend
> >> > > > on
> >> > > > > > zk
> >> > > > > > >> watcher any more, it will be much easier to implement a
> >> 'table
> >> > > > based'
> >> > > > > > >> replication peer/queue storage.
> >> > > > > > >>
> >> > > > > > >> Please vote:
> >> > > > > > >> [+1] Agree
> >> > > > > > >> [-1] Disagree
> >> > > > > > >> [0] Neutral
> >> > > > > > >>
> >> > > > > > >> Thanks.
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > >
> >> > > > > > -- Appy
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>



-- 

-- Appy

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

Since branch-2.0 has been cut and branch-2 is now 2.1.0-SNAPSHOT, will
merge branch HBASE-19397-branch-2 back to branch-2.

2018-01-10 9:20 GMT+08:00 张铎(Duo Zhang) <pa...@gmail.com>:

> If branch-2.0 will be out soon then let's target this to 2.1. No problem.
>
> Thanks.
>
> 2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:
>
>> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
>> wrote:
>>
>> > OK, let me merge it master first. And then create a HBASE-19397-branch-2
>> > which will keep rebasing with the newest branch-2 to see if it is stable
>> > enough. Since we can define this as a bug fix/refactoring rather than a
>> big
>> > new feature, it is OK to integrate it at any time. If we think it is
>> stable
>> > enough before cutting branch-2.0 then we can include it in the 2.0.0
>> > release, else let's include it in 2.1(Maybe we can backport it to 2.0
>> > later?).
>> >
>> >
>>
>> I need to cut the Appy-suggested branch-2.0. Shout if HBASE-19397-branch-2
>> gets to be too much work and I'll do it sooner rather than later. Or, if
>> easier on you, just say and I'll make the branch-2.0 now so you can just
>> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will become
>> hbase2.1...).
>>
>> St.Ack
>>
>>
>>
>>
>> > Thanks all here.
>> >
>> > 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
>> >
>> > > +1 to merge on master and 2.1.
>> > > Great work.
>> > >
>> > > Thanks,
>> > > Ashish
>> > >
>> > > -----Original Message-----
>> > > From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
>> > > Sent: Tuesday, January 09, 2018 6:53 AM
>> > > To: dev@hbase.apache.org
>> > > Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
>> branch-2.
>> > >
>> > > Anyway, if no objections on merging this into master, let's do it? So
>> > that
>> > > we can start working on the follow-on features, such as table based
>> > > replication storage, and synchronous replication, etc.
>> > >
>> > > Thanks.
>> > >
>> > > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>> > >
>> > > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <
>> palomino219@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > This 'new' feature only changes DDL part, not the core part of
>> > > > replication,
>> > > > > i.e, how to read wal entries and how to replicate it to the remote
>> > > > cluster,
>> > > > > etc. And also there is no pb message/storage layout change, you
>> can
>> > > > > think of this as a big refactoring. Theoretically we even do not
>> > > > > need to add
>> > > > new
>> > > > > UTs for this feature, i.e, no extra stability works. The only
>> > > > > visible change to users is that it may require more time on
>> > > > > modifying peers in shell. So in general I think it is less hurt to
>> > > > > include it in the coming release?
>> > > > >
>> > > > > And why I think it SHOULD be included in our 2.0 release is that,
>> > > > > the synchronous guarantee is really a good thing for our
>> replication
>> > > > > related UTs. The correctness of the current Test***Replication
>> > > > > usually depends
>> > > > on a
>> > > > > flakey condition - we will not do a log rolling between the
>> > > > > modification
>> > > > on
>> > > > > zk and the actual loading of the modification on RS. And we have
>> > > > > also
>> > > > done
>> > > > > a hard work to cleanup the lockings in ReplicationSourceManager
>> and
>> > > > > add a fat comment to say why it should be synchronized in this
>> way.
>> > > > > In general, the new code is much easier to read, test and debug,
>> and
>> > > > > also reduce the possibility of flakeyness, which could help us a
>> lot
>> > > > > when we start to stabilize our build.
>> > > > >
>> > > > > Thanks.
>> > > > >
>> > > > >
>> > > > You see it as a big bug fix Duo?
>> > > >
>> > > Kind of. Just like the AM v1, we can do lots of fix to make it more
>> > > stable, but we know that we can not fix all the problems since we
>> store
>> > > state in several places and it is a 'mission impossible' to make all
>> the
>> > > states stay in sync under every situation... So we introduce AM v2.
>> > > For the replication peer tracking, it is the same problem. It is hard
>> to
>> > > do fencing with zk watcher since it is asynchronous, so the UTs are
>> > always
>> > > kind of flakey in theoretical. And we depend on replication heavily at
>> > > Xiaomi, it is always a pain for us.
>> > >
>> > > >
>> > > > I'm late to review. Will have a look after beta-1 goes out. This
>> stuff
>> > > > looks great from outside, especially distributed procedure framework
>> > > > which we need all over the place.
>> > > >
>> > > > In general I have no problem w/ this in master and an hbase 2.1 (2.1
>> > > > could be soon after 2.0). Its late for big stuff in 2.0.... but let
>> me
>> > > > take a looksee sir.
>> > > >
>> > > Thanks sir. All the concerns here about whether we should merge this
>> into
>> > > 2.0 are reasonable, I know. Although I really want this in 2.0
>> because I
>> > > believe it will help a lot for stabilizing,  I'm OK with merge it to
>> 2.1
>> > > only if you guys all think so.
>> > >
>> > > >
>> > > > St.Ack
>> > > >
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>> > > > >
>> > > > > > Same questions as Josh's.
>> > > > > > 1) We have RCs for beta1 now, which means only commits that can
>> go
>> > > > > > in
>> > > > are
>> > > > > > bug fixes only. This change - although important, needed from
>> long
>> > > > > > time
>> > > > > and
>> > > > > > well done (testing, summary, etc) - seems rather very large to
>> get
>> > > > > > into
>> > > > > 2.0
>> > > > > > now. Needs good justification why it has to be 2.1 instead of
>> 2.0.
>> > > > > >
>> > > > > > -- Appy
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
>> > > wrote:
>> > > > > >
>> > > > > > > -0 From a general project planning point-of-view (not based on
>> > > > > > > the technical merit of the code) I am uncomfortable about
>> > > > > > > pulling in a
>> > > > > brand
>> > > > > > > new feature after we've already made one beta RC.
>> > > > > > >
>> > > > > > > Duo -- can you expand on why this feature is so important that
>> > > > > > > we
>> > > > > should
>> > > > > > > break our release plan? Are there problems that would make
>> > > > > > > including
>> > > > > this
>> > > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you
>> can
>> > > > provide
>> > > > > > on
>> > > > > > > "why does this need to go into 2.0?" would be helpful.
>> > > > > > >
>> > > > > > >
>> > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
>> > > > > > >
>> > > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
>> > > > > > >>
>> > > > > > >> We aim to move the peer modification framework from zk
>> watcher
>> > > > > > >> to procedure
>> > > > > > >> v2 in this issue and the work is done now.
>> > > > > > >>
>> > > > > > >> Copy the release note here:
>> > > > > > >>
>> > > > > > >> Introduce 5 procedures to do peer modifications:
>> > > > > > >>
>> > > > > > >>> AddPeerProcedure
>> > > > > > >>> RemovePeerProcedure
>> > > > > > >>> UpdatePeerConfigProcedure
>> > > > > > >>> EnablePeerProcedure
>> > > > > > >>> DisablePeerProcedure
>> > > > > > >>>
>> > > > > > >>> The procedures are all executed with the following stage:
>> > > > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
>> 2.
>> > > > > > >>> Check whether the operation is valid, if not then give up 3.
>> > > > > > >>> Update peer storage. Notice that if we have entered this
>> > > > > > >>> stage,
>> > > > > then
>> > > > > > >>> we
>> > > > > > >>> can not rollback any more.
>> > > > > > >>> 4. Schedule sub procedures to refresh the peer config on
>> every
>> > > RS.
>> > > > > > >>> 5. Do post cleanup if any.
>> > > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
>> > > > > > >>> since we
>> > > > > > have
>> > > > > > >>> already done the work.
>> > > > > > >>>
>> > > > > > >>> The procedure will hold an exclusive lock on the peer id, so
>> > > > > > >>> now
>> > > > > there
>> > > > > > is
>> > > > > > >>> no concurrent modifications on a single peer.
>> > > > > > >>>
>> > > > > > >>> And now it is guaranteed that once the procedure is done,
>> the
>> > > > > > >>> peer modification has already taken effect on all RSes.
>> > > > > > >>>
>> > > > > > >>> Abstracte a storage layer for replication peer/queue
>> > > > > > >>> manangement,
>> > > > and
>> > > > > > >>> refactored the upper layer to remove zk related
>> > > > naming/code/comment.
>> > > > > > >>>
>> > > > > > >>> Add pre/postExecuteProcedures CP hooks to
>> > > > > > >>> RegionServerObserver, and
>> > > > > add
>> > > > > > >>> permission check for executeProcedures method which requires
>> > > > > > >>> the
>> > > > > caller
>> > > > > > >>> to
>> > > > > > >>> be system user or super user.
>> > > > > > >>>
>> > > > > > >>> On rolling upgrade: just do not do any replication peer
>> > > > modifications
>> > > > > > >>> during the rolling upgrading. There is no pb/layout changes
>> on
>> > > > > > >>> the peer/queue storage on zk.
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >> And there are other benefits.
>> > > > > > >> First, we have introduced a general procedure framework to
>> send
>> > > > tasks
>> > > > > to
>> > > > > > >> RS
>> > > > > > >> and report the report back to Master. It can be used to
>> > > > > > >> implement
>> > > > > other
>> > > > > > >> operations such as ACL change.
>> > > > > > >> Second, zk is used as a external storage now since we do not
>> > > > > > >> depend
>> > > > on
>> > > > > > zk
>> > > > > > >> watcher any more, it will be much easier to implement a
>> 'table
>> > > > based'
>> > > > > > >> replication peer/queue storage.
>> > > > > > >>
>> > > > > > >> Please vote:
>> > > > > > >> [+1] Agree
>> > > > > > >> [-1] Disagree
>> > > > > > >> [0] Neutral
>> > > > > > >>
>> > > > > > >> Thanks.
>> > > > > > >>
>> > > > > > >>
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > >
>> > > > > > -- Appy
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

If branch-2.0 will be out soon then let's target this to 2.1. No problem.

Thanks.

2018-01-10 1:28 GMT+08:00 Stack <st...@duboce.net>:

> On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > OK, let me merge it master first. And then create a HBASE-19397-branch-2
> > which will keep rebasing with the newest branch-2 to see if it is stable
> > enough. Since we can define this as a bug fix/refactoring rather than a
> big
> > new feature, it is OK to integrate it at any time. If we think it is
> stable
> > enough before cutting branch-2.0 then we can include it in the 2.0.0
> > release, else let's include it in 2.1(Maybe we can backport it to 2.0
> > later?).
> >
> >
>
> I need to cut the Appy-suggested branch-2.0. Shout if HBASE-19397-branch-2
> gets to be too much work and I'll do it sooner rather than later. Or, if
> easier on you, just say and I'll make the branch-2.0 now so you can just
> commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will become
> hbase2.1...).
>
> St.Ack
>
>
>
>
> > Thanks all here.
> >
> > 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
> >
> > > +1 to merge on master and 2.1.
> > > Great work.
> > >
> > > Thanks,
> > > Ashish
> > >
> > > -----Original Message-----
> > > From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> > > Sent: Tuesday, January 09, 2018 6:53 AM
> > > To: dev@hbase.apache.org
> > > Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and
> branch-2.
> > >
> > > Anyway, if no objections on merging this into master, let's do it? So
> > that
> > > we can start working on the follow-on features, such as table based
> > > replication storage, and synchronous replication, etc.
> > >
> > > Thanks.
> > >
> > > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > >
> > > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <palomino219@gmail.com
> >
> > > > wrote:
> > > >
> > > > > This 'new' feature only changes DDL part, not the core part of
> > > > replication,
> > > > > i.e, how to read wal entries and how to replicate it to the remote
> > > > cluster,
> > > > > etc. And also there is no pb message/storage layout change, you can
> > > > > think of this as a big refactoring. Theoretically we even do not
> > > > > need to add
> > > > new
> > > > > UTs for this feature, i.e, no extra stability works. The only
> > > > > visible change to users is that it may require more time on
> > > > > modifying peers in shell. So in general I think it is less hurt to
> > > > > include it in the coming release?
> > > > >
> > > > > And why I think it SHOULD be included in our 2.0 release is that,
> > > > > the synchronous guarantee is really a good thing for our
> replication
> > > > > related UTs. The correctness of the current Test***Replication
> > > > > usually depends
> > > > on a
> > > > > flakey condition - we will not do a log rolling between the
> > > > > modification
> > > > on
> > > > > zk and the actual loading of the modification on RS. And we have
> > > > > also
> > > > done
> > > > > a hard work to cleanup the lockings in ReplicationSourceManager and
> > > > > add a fat comment to say why it should be synchronized in this way.
> > > > > In general, the new code is much easier to read, test and debug,
> and
> > > > > also reduce the possibility of flakeyness, which could help us a
> lot
> > > > > when we start to stabilize our build.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > You see it as a big bug fix Duo?
> > > >
> > > Kind of. Just like the AM v1, we can do lots of fix to make it more
> > > stable, but we know that we can not fix all the problems since we store
> > > state in several places and it is a 'mission impossible' to make all
> the
> > > states stay in sync under every situation... So we introduce AM v2.
> > > For the replication peer tracking, it is the same problem. It is hard
> to
> > > do fencing with zk watcher since it is asynchronous, so the UTs are
> > always
> > > kind of flakey in theoretical. And we depend on replication heavily at
> > > Xiaomi, it is always a pain for us.
> > >
> > > >
> > > > I'm late to review. Will have a look after beta-1 goes out. This
> stuff
> > > > looks great from outside, especially distributed procedure framework
> > > > which we need all over the place.
> > > >
> > > > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> > > > could be soon after 2.0). Its late for big stuff in 2.0.... but let
> me
> > > > take a looksee sir.
> > > >
> > > Thanks sir. All the concerns here about whether we should merge this
> into
> > > 2.0 are reasonable, I know. Although I really want this in 2.0 because
> I
> > > believe it will help a lot for stabilizing,  I'm OK with merge it to
> 2.1
> > > only if you guys all think so.
> > >
> > > >
> > > > St.Ack
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > > > >
> > > > > > Same questions as Josh's.
> > > > > > 1) We have RCs for beta1 now, which means only commits that can
> go
> > > > > > in
> > > > are
> > > > > > bug fixes only. This change - although important, needed from
> long
> > > > > > time
> > > > > and
> > > > > > well done (testing, summary, etc) - seems rather very large to
> get
> > > > > > into
> > > > > 2.0
> > > > > > now. Needs good justification why it has to be 2.1 instead of
> 2.0.
> > > > > >
> > > > > > -- Appy
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > -0 From a general project planning point-of-view (not based on
> > > > > > > the technical merit of the code) I am uncomfortable about
> > > > > > > pulling in a
> > > > > brand
> > > > > > > new feature after we've already made one beta RC.
> > > > > > >
> > > > > > > Duo -- can you expand on why this feature is so important that
> > > > > > > we
> > > > > should
> > > > > > > break our release plan? Are there problems that would make
> > > > > > > including
> > > > > this
> > > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > > > provide
> > > > > > on
> > > > > > > "why does this need to go into 2.0?" would be helpful.
> > > > > > >
> > > > > > >
> > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > > > >
> > > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > > > >>
> > > > > > >> We aim to move the peer modification framework from zk watcher
> > > > > > >> to procedure
> > > > > > >> v2 in this issue and the work is done now.
> > > > > > >>
> > > > > > >> Copy the release note here:
> > > > > > >>
> > > > > > >> Introduce 5 procedures to do peer modifications:
> > > > > > >>
> > > > > > >>> AddPeerProcedure
> > > > > > >>> RemovePeerProcedure
> > > > > > >>> UpdatePeerConfigProcedure
> > > > > > >>> EnablePeerProcedure
> > > > > > >>> DisablePeerProcedure
> > > > > > >>>
> > > > > > >>> The procedures are all executed with the following stage:
> > > > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> 2.
> > > > > > >>> Check whether the operation is valid, if not then give up 3.
> > > > > > >>> Update peer storage. Notice that if we have entered this
> > > > > > >>> stage,
> > > > > then
> > > > > > >>> we
> > > > > > >>> can not rollback any more.
> > > > > > >>> 4. Schedule sub procedures to refresh the peer config on
> every
> > > RS.
> > > > > > >>> 5. Do post cleanup if any.
> > > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> > > > > > >>> since we
> > > > > > have
> > > > > > >>> already done the work.
> > > > > > >>>
> > > > > > >>> The procedure will hold an exclusive lock on the peer id, so
> > > > > > >>> now
> > > > > there
> > > > > > is
> > > > > > >>> no concurrent modifications on a single peer.
> > > > > > >>>
> > > > > > >>> And now it is guaranteed that once the procedure is done, the
> > > > > > >>> peer modification has already taken effect on all RSes.
> > > > > > >>>
> > > > > > >>> Abstracte a storage layer for replication peer/queue
> > > > > > >>> manangement,
> > > > and
> > > > > > >>> refactored the upper layer to remove zk related
> > > > naming/code/comment.
> > > > > > >>>
> > > > > > >>> Add pre/postExecuteProcedures CP hooks to
> > > > > > >>> RegionServerObserver, and
> > > > > add
> > > > > > >>> permission check for executeProcedures method which requires
> > > > > > >>> the
> > > > > caller
> > > > > > >>> to
> > > > > > >>> be system user or super user.
> > > > > > >>>
> > > > > > >>> On rolling upgrade: just do not do any replication peer
> > > > modifications
> > > > > > >>> during the rolling upgrading. There is no pb/layout changes
> on
> > > > > > >>> the peer/queue storage on zk.
> > > > > > >>>
> > > > > > >>>
> > > > > > >> And there are other benefits.
> > > > > > >> First, we have introduced a general procedure framework to
> send
> > > > tasks
> > > > > to
> > > > > > >> RS
> > > > > > >> and report the report back to Master. It can be used to
> > > > > > >> implement
> > > > > other
> > > > > > >> operations such as ACL change.
> > > > > > >> Second, zk is used as a external storage now since we do not
> > > > > > >> depend
> > > > on
> > > > > > zk
> > > > > > >> watcher any more, it will be much easier to implement a 'table
> > > > based'
> > > > > > >> replication peer/queue storage.
> > > > > > >>
> > > > > > >> Please vote:
> > > > > > >> [+1] Agree
> > > > > > >> [-1] Disagree
> > > > > > >> [0] Neutral
> > > > > > >>
> > > > > > >> Thanks.
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > -- Appy
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

On Mon, Jan 8, 2018 at 8:19 PM, 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> OK, let me merge it master first. And then create a HBASE-19397-branch-2
> which will keep rebasing with the newest branch-2 to see if it is stable
> enough. Since we can define this as a bug fix/refactoring rather than a big
> new feature, it is OK to integrate it at any time. If we think it is stable
> enough before cutting branch-2.0 then we can include it in the 2.0.0
> release, else let's include it in 2.1(Maybe we can backport it to 2.0
> later?).
>
>

I need to cut the Appy-suggested branch-2.0. Shout if HBASE-19397-branch-2
gets to be too much work and I'll do it sooner rather than later. Or, if
easier on you, just say and I'll make the branch-2.0 now so you can just
commit to branch-2 (branch-2.0 will become hbase2.0, branch-2 will become
hbase2.1...).

St.Ack




> Thanks all here.
>
> 2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:
>
> > +1 to merge on master and 2.1.
> > Great work.
> >
> > Thanks,
> > Ashish
> >
> > -----Original Message-----
> > From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> > Sent: Tuesday, January 09, 2018 6:53 AM
> > To: dev@hbase.apache.org
> > Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.
> >
> > Anyway, if no objections on merging this into master, let's do it? So
> that
> > we can start working on the follow-on features, such as table based
> > replication storage, and synchronous replication, etc.
> >
> > Thanks.
> >
> > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> >
> > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > >
> > > > This 'new' feature only changes DDL part, not the core part of
> > > replication,
> > > > i.e, how to read wal entries and how to replicate it to the remote
> > > cluster,
> > > > etc. And also there is no pb message/storage layout change, you can
> > > > think of this as a big refactoring. Theoretically we even do not
> > > > need to add
> > > new
> > > > UTs for this feature, i.e, no extra stability works. The only
> > > > visible change to users is that it may require more time on
> > > > modifying peers in shell. So in general I think it is less hurt to
> > > > include it in the coming release?
> > > >
> > > > And why I think it SHOULD be included in our 2.0 release is that,
> > > > the synchronous guarantee is really a good thing for our replication
> > > > related UTs. The correctness of the current Test***Replication
> > > > usually depends
> > > on a
> > > > flakey condition - we will not do a log rolling between the
> > > > modification
> > > on
> > > > zk and the actual loading of the modification on RS. And we have
> > > > also
> > > done
> > > > a hard work to cleanup the lockings in ReplicationSourceManager and
> > > > add a fat comment to say why it should be synchronized in this way.
> > > > In general, the new code is much easier to read, test and debug, and
> > > > also reduce the possibility of flakeyness, which could help us a lot
> > > > when we start to stabilize our build.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > You see it as a big bug fix Duo?
> > >
> > Kind of. Just like the AM v1, we can do lots of fix to make it more
> > stable, but we know that we can not fix all the problems since we store
> > state in several places and it is a 'mission impossible' to make all the
> > states stay in sync under every situation... So we introduce AM v2.
> > For the replication peer tracking, it is the same problem. It is hard to
> > do fencing with zk watcher since it is asynchronous, so the UTs are
> always
> > kind of flakey in theoretical. And we depend on replication heavily at
> > Xiaomi, it is always a pain for us.
> >
> > >
> > > I'm late to review. Will have a look after beta-1 goes out. This stuff
> > > looks great from outside, especially distributed procedure framework
> > > which we need all over the place.
> > >
> > > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> > > could be soon after 2.0). Its late for big stuff in 2.0.... but let me
> > > take a looksee sir.
> > >
> > Thanks sir. All the concerns here about whether we should merge this into
> > 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> > believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> > only if you guys all think so.
> >
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > > >
> > > > > Same questions as Josh's.
> > > > > 1) We have RCs for beta1 now, which means only commits that can go
> > > > > in
> > > are
> > > > > bug fixes only. This change - although important, needed from long
> > > > > time
> > > > and
> > > > > well done (testing, summary, etc) - seems rather very large to get
> > > > > into
> > > > 2.0
> > > > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > > > >
> > > > > -- Appy
> > > > >
> > > > >
> > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> > wrote:
> > > > >
> > > > > > -0 From a general project planning point-of-view (not based on
> > > > > > the technical merit of the code) I am uncomfortable about
> > > > > > pulling in a
> > > > brand
> > > > > > new feature after we've already made one beta RC.
> > > > > >
> > > > > > Duo -- can you expand on why this feature is so important that
> > > > > > we
> > > > should
> > > > > > break our release plan? Are there problems that would make
> > > > > > including
> > > > this
> > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > > provide
> > > > > on
> > > > > > "why does this need to go into 2.0?" would be helpful.
> > > > > >
> > > > > >
> > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > > >
> > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > > >>
> > > > > >> We aim to move the peer modification framework from zk watcher
> > > > > >> to procedure
> > > > > >> v2 in this issue and the work is done now.
> > > > > >>
> > > > > >> Copy the release note here:
> > > > > >>
> > > > > >> Introduce 5 procedures to do peer modifications:
> > > > > >>
> > > > > >>> AddPeerProcedure
> > > > > >>> RemovePeerProcedure
> > > > > >>> UpdatePeerConfigProcedure
> > > > > >>> EnablePeerProcedure
> > > > > >>> DisablePeerProcedure
> > > > > >>>
> > > > > >>> The procedures are all executed with the following stage:
> > > > > >>> 1. Call pre CP hook, if an exception is thrown then give up 2.
> > > > > >>> Check whether the operation is valid, if not then give up 3.
> > > > > >>> Update peer storage. Notice that if we have entered this
> > > > > >>> stage,
> > > > then
> > > > > >>> we
> > > > > >>> can not rollback any more.
> > > > > >>> 4. Schedule sub procedures to refresh the peer config on every
> > RS.
> > > > > >>> 5. Do post cleanup if any.
> > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> > > > > >>> since we
> > > > > have
> > > > > >>> already done the work.
> > > > > >>>
> > > > > >>> The procedure will hold an exclusive lock on the peer id, so
> > > > > >>> now
> > > > there
> > > > > is
> > > > > >>> no concurrent modifications on a single peer.
> > > > > >>>
> > > > > >>> And now it is guaranteed that once the procedure is done, the
> > > > > >>> peer modification has already taken effect on all RSes.
> > > > > >>>
> > > > > >>> Abstracte a storage layer for replication peer/queue
> > > > > >>> manangement,
> > > and
> > > > > >>> refactored the upper layer to remove zk related
> > > naming/code/comment.
> > > > > >>>
> > > > > >>> Add pre/postExecuteProcedures CP hooks to
> > > > > >>> RegionServerObserver, and
> > > > add
> > > > > >>> permission check for executeProcedures method which requires
> > > > > >>> the
> > > > caller
> > > > > >>> to
> > > > > >>> be system user or super user.
> > > > > >>>
> > > > > >>> On rolling upgrade: just do not do any replication peer
> > > modifications
> > > > > >>> during the rolling upgrading. There is no pb/layout changes on
> > > > > >>> the peer/queue storage on zk.
> > > > > >>>
> > > > > >>>
> > > > > >> And there are other benefits.
> > > > > >> First, we have introduced a general procedure framework to send
> > > tasks
> > > > to
> > > > > >> RS
> > > > > >> and report the report back to Master. It can be used to
> > > > > >> implement
> > > > other
> > > > > >> operations such as ACL change.
> > > > > >> Second, zk is used as a external storage now since we do not
> > > > > >> depend
> > > on
> > > > > zk
> > > > > >> watcher any more, it will be much easier to implement a 'table
> > > based'
> > > > > >> replication peer/queue storage.
> > > > > >>
> > > > > >> Please vote:
> > > > > >> [+1] Agree
> > > > > >> [-1] Disagree
> > > > > >> [0] Neutral
> > > > > >>
> > > > > >> Thanks.
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > -- Appy
> > > > >
> > > >
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

OK, let me merge it master first. And then create a HBASE-19397-branch-2
which will keep rebasing with the newest branch-2 to see if it is stable
enough. Since we can define this as a bug fix/refactoring rather than a big
new feature, it is OK to integrate it at any time. If we think it is stable
enough before cutting branch-2.0 then we can include it in the 2.0.0
release, else let's include it in 2.1(Maybe we can backport it to 2.0
later?).

Thanks all here.

2018-01-09 12:06 GMT+08:00 ashish singhi <as...@huawei.com>:

> +1 to merge on master and 2.1.
> Great work.
>
> Thanks,
> Ashish
>
> -----Original Message-----
> From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com]
> Sent: Tuesday, January 09, 2018 6:53 AM
> To: dev@hbase.apache.org
> Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.
>
> Anyway, if no objections on merging this into master, let's do it? So that
> we can start working on the follow-on features, such as table based
> replication storage, and synchronous replication, etc.
>
> Thanks.
>
> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>
> > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > This 'new' feature only changes DDL part, not the core part of
> > replication,
> > > i.e, how to read wal entries and how to replicate it to the remote
> > cluster,
> > > etc. And also there is no pb message/storage layout change, you can
> > > think of this as a big refactoring. Theoretically we even do not
> > > need to add
> > new
> > > UTs for this feature, i.e, no extra stability works. The only
> > > visible change to users is that it may require more time on
> > > modifying peers in shell. So in general I think it is less hurt to
> > > include it in the coming release?
> > >
> > > And why I think it SHOULD be included in our 2.0 release is that,
> > > the synchronous guarantee is really a good thing for our replication
> > > related UTs. The correctness of the current Test***Replication
> > > usually depends
> > on a
> > > flakey condition - we will not do a log rolling between the
> > > modification
> > on
> > > zk and the actual loading of the modification on RS. And we have
> > > also
> > done
> > > a hard work to cleanup the lockings in ReplicationSourceManager and
> > > add a fat comment to say why it should be synchronized in this way.
> > > In general, the new code is much easier to read, test and debug, and
> > > also reduce the possibility of flakeyness, which could help us a lot
> > > when we start to stabilize our build.
> > >
> > > Thanks.
> > >
> > >
> > You see it as a big bug fix Duo?
> >
> Kind of. Just like the AM v1, we can do lots of fix to make it more
> stable, but we know that we can not fix all the problems since we store
> state in several places and it is a 'mission impossible' to make all the
> states stay in sync under every situation... So we introduce AM v2.
> For the replication peer tracking, it is the same problem. It is hard to
> do fencing with zk watcher since it is asynchronous, so the UTs are always
> kind of flakey in theoretical. And we depend on replication heavily at
> Xiaomi, it is always a pain for us.
>
> >
> > I'm late to review. Will have a look after beta-1 goes out. This stuff
> > looks great from outside, especially distributed procedure framework
> > which we need all over the place.
> >
> > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> > could be soon after 2.0). Its late for big stuff in 2.0.... but let me
> > take a looksee sir.
> >
> Thanks sir. All the concerns here about whether we should merge this into
> 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> only if you guys all think so.
>
> >
> > St.Ack
> >
> >
> >
> >
> >
> > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > >
> > > > Same questions as Josh's.
> > > > 1) We have RCs for beta1 now, which means only commits that can go
> > > > in
> > are
> > > > bug fixes only. This change - although important, needed from long
> > > > time
> > > and
> > > > well done (testing, summary, etc) - seems rather very large to get
> > > > into
> > > 2.0
> > > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > > >
> > > > -- Appy
> > > >
> > > >
> > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> wrote:
> > > >
> > > > > -0 From a general project planning point-of-view (not based on
> > > > > the technical merit of the code) I am uncomfortable about
> > > > > pulling in a
> > > brand
> > > > > new feature after we've already made one beta RC.
> > > > >
> > > > > Duo -- can you expand on why this feature is so important that
> > > > > we
> > > should
> > > > > break our release plan? Are there problems that would make
> > > > > including
> > > this
> > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > provide
> > > > on
> > > > > "why does this need to go into 2.0?" would be helpful.
> > > > >
> > > > >
> > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > >
> > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > >>
> > > > >> We aim to move the peer modification framework from zk watcher
> > > > >> to procedure
> > > > >> v2 in this issue and the work is done now.
> > > > >>
> > > > >> Copy the release note here:
> > > > >>
> > > > >> Introduce 5 procedures to do peer modifications:
> > > > >>
> > > > >>> AddPeerProcedure
> > > > >>> RemovePeerProcedure
> > > > >>> UpdatePeerConfigProcedure
> > > > >>> EnablePeerProcedure
> > > > >>> DisablePeerProcedure
> > > > >>>
> > > > >>> The procedures are all executed with the following stage:
> > > > >>> 1. Call pre CP hook, if an exception is thrown then give up 2.
> > > > >>> Check whether the operation is valid, if not then give up 3.
> > > > >>> Update peer storage. Notice that if we have entered this
> > > > >>> stage,
> > > then
> > > > >>> we
> > > > >>> can not rollback any more.
> > > > >>> 4. Schedule sub procedures to refresh the peer config on every
> RS.
> > > > >>> 5. Do post cleanup if any.
> > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> > > > >>> since we
> > > > have
> > > > >>> already done the work.
> > > > >>>
> > > > >>> The procedure will hold an exclusive lock on the peer id, so
> > > > >>> now
> > > there
> > > > is
> > > > >>> no concurrent modifications on a single peer.
> > > > >>>
> > > > >>> And now it is guaranteed that once the procedure is done, the
> > > > >>> peer modification has already taken effect on all RSes.
> > > > >>>
> > > > >>> Abstracte a storage layer for replication peer/queue
> > > > >>> manangement,
> > and
> > > > >>> refactored the upper layer to remove zk related
> > naming/code/comment.
> > > > >>>
> > > > >>> Add pre/postExecuteProcedures CP hooks to
> > > > >>> RegionServerObserver, and
> > > add
> > > > >>> permission check for executeProcedures method which requires
> > > > >>> the
> > > caller
> > > > >>> to
> > > > >>> be system user or super user.
> > > > >>>
> > > > >>> On rolling upgrade: just do not do any replication peer
> > modifications
> > > > >>> during the rolling upgrading. There is no pb/layout changes on
> > > > >>> the peer/queue storage on zk.
> > > > >>>
> > > > >>>
> > > > >> And there are other benefits.
> > > > >> First, we have introduced a general procedure framework to send
> > tasks
> > > to
> > > > >> RS
> > > > >> and report the report back to Master. It can be used to
> > > > >> implement
> > > other
> > > > >> operations such as ACL change.
> > > > >> Second, zk is used as a external storage now since we do not
> > > > >> depend
> > on
> > > > zk
> > > > >> watcher any more, it will be much easier to implement a 'table
> > based'
> > > > >> replication peer/queue storage.
> > > > >>
> > > > >> Please vote:
> > > > >> [+1] Agree
> > > > >> [-1] Disagree
> > > > >> [0] Neutral
> > > > >>
> > > > >> Thanks.
> > > > >>
> > > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > -- Appy
> > > >
> > >
> >
>

RE: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by ashish singhi <as...@huawei.com>.

+1 to merge on master and 2.1.
Great work.

Thanks,
Ashish

-----Original Message-----
From: 张铎(Duo Zhang) [mailto:palomino219@gmail.com] 
Sent: Tuesday, January 09, 2018 6:53 AM
To: dev@hbase.apache.org
Subject: Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Anyway, if no objections on merging this into master, let's do it? So that we can start working on the follow-on features, such as table based replication storage, and synchronous replication, etc.

Thanks.

2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:

> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > This 'new' feature only changes DDL part, not the core part of
> replication,
> > i.e, how to read wal entries and how to replicate it to the remote
> cluster,
> > etc. And also there is no pb message/storage layout change, you can 
> > think of this as a big refactoring. Theoretically we even do not 
> > need to add
> new
> > UTs for this feature, i.e, no extra stability works. The only 
> > visible change to users is that it may require more time on 
> > modifying peers in shell. So in general I think it is less hurt to 
> > include it in the coming release?
> >
> > And why I think it SHOULD be included in our 2.0 release is that, 
> > the synchronous guarantee is really a good thing for our replication 
> > related UTs. The correctness of the current Test***Replication 
> > usually depends
> on a
> > flakey condition - we will not do a log rolling between the 
> > modification
> on
> > zk and the actual loading of the modification on RS. And we have 
> > also
> done
> > a hard work to cleanup the lockings in ReplicationSourceManager and 
> > add a fat comment to say why it should be synchronized in this way. 
> > In general, the new code is much easier to read, test and debug, and 
> > also reduce the possibility of flakeyness, which could help us a lot 
> > when we start to stabilize our build.
> >
> > Thanks.
> >
> >
> You see it as a big bug fix Duo?
>
Kind of. Just like the AM v1, we can do lots of fix to make it more stable, but we know that we can not fix all the problems since we store state in several places and it is a 'mission impossible' to make all the states stay in sync under every situation... So we introduce AM v2.
For the replication peer tracking, it is the same problem. It is hard to do fencing with zk watcher since it is asynchronous, so the UTs are always kind of flakey in theoretical. And we depend on replication heavily at Xiaomi, it is always a pain for us.

>
> I'm late to review. Will have a look after beta-1 goes out. This stuff 
> looks great from outside, especially distributed procedure framework 
> which we need all over the place.
>
> In general I have no problem w/ this in master and an hbase 2.1 (2.1 
> could be soon after 2.0). Its late for big stuff in 2.0.... but let me 
> take a looksee sir.
>
Thanks sir. All the concerns here about whether we should merge this into
2.0 are reasonable, I know. Although I really want this in 2.0 because I believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1 only if you guys all think so.

>
> St.Ack
>
>
>
>
>
> > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> >
> > > Same questions as Josh's.
> > > 1) We have RCs for beta1 now, which means only commits that can go 
> > > in
> are
> > > bug fixes only. This change - although important, needed from long 
> > > time
> > and
> > > well done (testing, summary, etc) - seems rather very large to get 
> > > into
> > 2.0
> > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > >
> > > -- Appy
> > >
> > >
> > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:
> > >
> > > > -0 From a general project planning point-of-view (not based on 
> > > > the technical merit of the code) I am uncomfortable about 
> > > > pulling in a
> > brand
> > > > new feature after we've already made one beta RC.
> > > >
> > > > Duo -- can you expand on why this feature is so important that 
> > > > we
> > should
> > > > break our release plan? Are there problems that would make 
> > > > including
> > this
> > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> provide
> > > on
> > > > "why does this need to go into 2.0?" would be helpful.
> > > >
> > > >
> > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > >
> > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > >>
> > > >> We aim to move the peer modification framework from zk watcher 
> > > >> to procedure
> > > >> v2 in this issue and the work is done now.
> > > >>
> > > >> Copy the release note here:
> > > >>
> > > >> Introduce 5 procedures to do peer modifications:
> > > >>
> > > >>> AddPeerProcedure
> > > >>> RemovePeerProcedure
> > > >>> UpdatePeerConfigProcedure
> > > >>> EnablePeerProcedure
> > > >>> DisablePeerProcedure
> > > >>>
> > > >>> The procedures are all executed with the following stage:
> > > >>> 1. Call pre CP hook, if an exception is thrown then give up 2. 
> > > >>> Check whether the operation is valid, if not then give up 3. 
> > > >>> Update peer storage. Notice that if we have entered this 
> > > >>> stage,
> > then
> > > >>> we
> > > >>> can not rollback any more.
> > > >>> 4. Schedule sub procedures to refresh the peer config on every RS.
> > > >>> 5. Do post cleanup if any.
> > > >>> 6. Call post CP hook. The exception thrown will be ignored 
> > > >>> since we
> > > have
> > > >>> already done the work.
> > > >>>
> > > >>> The procedure will hold an exclusive lock on the peer id, so 
> > > >>> now
> > there
> > > is
> > > >>> no concurrent modifications on a single peer.
> > > >>>
> > > >>> And now it is guaranteed that once the procedure is done, the 
> > > >>> peer modification has already taken effect on all RSes.
> > > >>>
> > > >>> Abstracte a storage layer for replication peer/queue 
> > > >>> manangement,
> and
> > > >>> refactored the upper layer to remove zk related
> naming/code/comment.
> > > >>>
> > > >>> Add pre/postExecuteProcedures CP hooks to 
> > > >>> RegionServerObserver, and
> > add
> > > >>> permission check for executeProcedures method which requires 
> > > >>> the
> > caller
> > > >>> to
> > > >>> be system user or super user.
> > > >>>
> > > >>> On rolling upgrade: just do not do any replication peer
> modifications
> > > >>> during the rolling upgrading. There is no pb/layout changes on 
> > > >>> the peer/queue storage on zk.
> > > >>>
> > > >>>
> > > >> And there are other benefits.
> > > >> First, we have introduced a general procedure framework to send
> tasks
> > to
> > > >> RS
> > > >> and report the report back to Master. It can be used to 
> > > >> implement
> > other
> > > >> operations such as ACL change.
> > > >> Second, zk is used as a external storage now since we do not 
> > > >> depend
> on
> > > zk
> > > >> watcher any more, it will be much easier to implement a 'table
> based'
> > > >> replication peer/queue storage.
> > > >>
> > > >> Please vote:
> > > >> [+1] Agree
> > > >> [-1] Disagree
> > > >> [0] Neutral
> > > >>
> > > >> Thanks.
> > > >>
> > > >>
> > >
> > >
> > > --
> > >
> > > -- Appy
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Yu Li <ca...@gmail.com>.

+1 for both master and branch-2. IMHO flaky UT designs are some must-fix
once found, efforts will be paid sooner or later in stability testing (or
in production after release if our svt coverage is not good enough).

Best Regards,
Yu

On 9 January 2018 at 10:32, OpenInx <op...@gmail.com> wrote:

> +1  for a merge to master firstly ( no-binding )
>
> If merge into the master branch as early as possible, we can develop
> follow-on features (table based features & sync-replication) as early as
> possible.
>
> On Tue, Jan 9, 2018 at 9:28 AM, Andrew Purtell <ap...@apache.org>
> wrote:
>
> > FWIW, you have my +1 for a merge to master.
> >
> >
> >
> > On Mon, Jan 8, 2018 at 5:22 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > Anyway, if no objections on merging this into master, let's do it? So
> > that
> > > we can start working on the follow-on features, such as table based
> > > replication storage, and synchronous replication, etc.
> > >
> > > Thanks.
> > >
> > > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> > >
> > > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <palomino219@gmail.com
> >
> > > > wrote:
> > > >
> > > > > This 'new' feature only changes DDL part, not the core part of
> > > > replication,
> > > > > i.e, how to read wal entries and how to replicate it to the remote
> > > > cluster,
> > > > > etc. And also there is no pb message/storage layout change, you can
> > > think
> > > > > of this as a big refactoring. Theoretically we even do not need to
> > add
> > > > new
> > > > > UTs for this feature, i.e, no extra stability works. The only
> visible
> > > > > change to users is that it may require more time on modifying peers
> > in
> > > > > shell. So in general I think it is less hurt to include it in the
> > > coming
> > > > > release?
> > > > >
> > > > > And why I think it SHOULD be included in our 2.0 release is that,
> the
> > > > > synchronous guarantee is really a good thing for our replication
> > > related
> > > > > UTs. The correctness of the current Test***Replication usually
> > depends
> > > > on a
> > > > > flakey condition - we will not do a log rolling between the
> > > modification
> > > > on
> > > > > zk and the actual loading of the modification on RS. And we have
> also
> > > > done
> > > > > a hard work to cleanup the lockings in ReplicationSourceManager and
> > > add a
> > > > > fat comment to say why it should be synchronized in this way. In
> > > general,
> > > > > the new code is much easier to read, test and debug, and also
> reduce
> > > the
> > > > > possibility of flakeyness, which could help us a lot when we start
> to
> > > > > stabilize our build.
> > > > >
> > > > > Thanks.
> > > > >
> > > > >
> > > > You see it as a big bug fix Duo?
> > > >
> > > Kind of. Just like the AM v1, we can do lots of fix to make it more
> > stable,
> > > but we know that we can not fix all the problems since we store state
> in
> > > several places and it is a 'mission impossible' to make all the states
> > stay
> > > in sync under every situation... So we introduce AM v2.
> > > For the replication peer tracking, it is the same problem. It is hard
> to
> > do
> > > fencing with zk watcher since it is asynchronous, so the UTs are always
> > > kind of flakey in theoretical. And we depend on replication heavily at
> > > Xiaomi, it is always a pain for us.
> > >
> > > >
> > > > I'm late to review. Will have a look after beta-1 goes out. This
> stuff
> > > > looks great from outside, especially distributed procedure framework
> > > which
> > > > we need all over the place.
> > > >
> > > > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> > > could
> > > > be soon after 2.0). Its late for big stuff in 2.0.... but let me
> take a
> > > > looksee sir.
> > > >
> > > Thanks sir. All the concerns here about whether we should merge this
> into
> > > 2.0 are reasonable, I know. Although I really want this in 2.0 because
> I
> > > believe it will help a lot for stabilizing,  I'm OK with merge it to
> 2.1
> > > only if you guys all think so.
> > >
> > > >
> > > > St.Ack
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > > > >
> > > > > > Same questions as Josh's.
> > > > > > 1) We have RCs for beta1 now, which means only commits that can
> go
> > in
> > > > are
> > > > > > bug fixes only. This change - although important, needed from
> long
> > > time
> > > > > and
> > > > > > well done (testing, summary, etc) - seems rather very large to
> get
> > > into
> > > > > 2.0
> > > > > > now. Needs good justification why it has to be 2.1 instead of
> 2.0.
> > > > > >
> > > > > > -- Appy
> > > > > >
> > > > > >
> > > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > -0 From a general project planning point-of-view (not based on
> > the
> > > > > > > technical merit of the code) I am uncomfortable about pulling
> in
> > a
> > > > > brand
> > > > > > > new feature after we've already made one beta RC.
> > > > > > >
> > > > > > > Duo -- can you expand on why this feature is so important that
> we
> > > > > should
> > > > > > > break our release plan? Are there problems that would make
> > > including
> > > > > this
> > > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > > > provide
> > > > > > on
> > > > > > > "why does this need to go into 2.0?" would be helpful.
> > > > > > >
> > > > > > >
> > > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > > > >
> > > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > > > >>
> > > > > > >> We aim to move the peer modification framework from zk watcher
> > to
> > > > > > >> procedure
> > > > > > >> v2 in this issue and the work is done now.
> > > > > > >>
> > > > > > >> Copy the release note here:
> > > > > > >>
> > > > > > >> Introduce 5 procedures to do peer modifications:
> > > > > > >>
> > > > > > >>> AddPeerProcedure
> > > > > > >>> RemovePeerProcedure
> > > > > > >>> UpdatePeerConfigProcedure
> > > > > > >>> EnablePeerProcedure
> > > > > > >>> DisablePeerProcedure
> > > > > > >>>
> > > > > > >>> The procedures are all executed with the following stage:
> > > > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > > > > > >>> 2. Check whether the operation is valid, if not then give up
> > > > > > >>> 3. Update peer storage. Notice that if we have entered this
> > > stage,
> > > > > then
> > > > > > >>> we
> > > > > > >>> can not rollback any more.
> > > > > > >>> 4. Schedule sub procedures to refresh the peer config on
> every
> > > RS.
> > > > > > >>> 5. Do post cleanup if any.
> > > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> > since
> > > we
> > > > > > have
> > > > > > >>> already done the work.
> > > > > > >>>
> > > > > > >>> The procedure will hold an exclusive lock on the peer id, so
> > now
> > > > > there
> > > > > > is
> > > > > > >>> no concurrent modifications on a single peer.
> > > > > > >>>
> > > > > > >>> And now it is guaranteed that once the procedure is done, the
> > > peer
> > > > > > >>> modification has already taken effect on all RSes.
> > > > > > >>>
> > > > > > >>> Abstracte a storage layer for replication peer/queue
> > manangement,
> > > > and
> > > > > > >>> refactored the upper layer to remove zk related
> > > > naming/code/comment.
> > > > > > >>>
> > > > > > >>> Add pre/postExecuteProcedures CP hooks to
> RegionServerObserver,
> > > and
> > > > > add
> > > > > > >>> permission check for executeProcedures method which requires
> > the
> > > > > caller
> > > > > > >>> to
> > > > > > >>> be system user or super user.
> > > > > > >>>
> > > > > > >>> On rolling upgrade: just do not do any replication peer
> > > > modifications
> > > > > > >>> during the rolling upgrading. There is no pb/layout changes
> on
> > > the
> > > > > > >>> peer/queue storage on zk.
> > > > > > >>>
> > > > > > >>>
> > > > > > >> And there are other benefits.
> > > > > > >> First, we have introduced a general procedure framework to
> send
> > > > tasks
> > > > > to
> > > > > > >> RS
> > > > > > >> and report the report back to Master. It can be used to
> > implement
> > > > > other
> > > > > > >> operations such as ACL change.
> > > > > > >> Second, zk is used as a external storage now since we do not
> > > depend
> > > > on
> > > > > > zk
> > > > > > >> watcher any more, it will be much easier to implement a 'table
> > > > based'
> > > > > > >> replication peer/queue storage.
> > > > > > >>
> > > > > > >> Please vote:
> > > > > > >> [+1] Agree
> > > > > > >> [-1] Disagree
> > > > > > >> [0] Neutral
> > > > > > >>
> > > > > > >> Thanks.
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > -- Appy
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Best regards,
> > Andrew
> >
> > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> >    - A23, Crosstalk
> >
>
>
>
> --
> ==============================
> Openinx  blog : http://openinx.github.io
>
> TO BE A GREAT HACKER !
> ==============================
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by OpenInx <op...@gmail.com>.

+1  for a merge to master firstly ( no-binding )

If merge into the master branch as early as possible, we can develop
follow-on features (table based features & sync-replication) as early as
possible.

On Tue, Jan 9, 2018 at 9:28 AM, Andrew Purtell <ap...@apache.org> wrote:

> FWIW, you have my +1 for a merge to master.
>
>
>
> On Mon, Jan 8, 2018 at 5:22 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > Anyway, if no objections on merging this into master, let's do it? So
> that
> > we can start working on the follow-on features, such as table based
> > replication storage, and synchronous replication, etc.
> >
> > Thanks.
> >
> > 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> >
> > > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > >
> > > > This 'new' feature only changes DDL part, not the core part of
> > > replication,
> > > > i.e, how to read wal entries and how to replicate it to the remote
> > > cluster,
> > > > etc. And also there is no pb message/storage layout change, you can
> > think
> > > > of this as a big refactoring. Theoretically we even do not need to
> add
> > > new
> > > > UTs for this feature, i.e, no extra stability works. The only visible
> > > > change to users is that it may require more time on modifying peers
> in
> > > > shell. So in general I think it is less hurt to include it in the
> > coming
> > > > release?
> > > >
> > > > And why I think it SHOULD be included in our 2.0 release is that, the
> > > > synchronous guarantee is really a good thing for our replication
> > related
> > > > UTs. The correctness of the current Test***Replication usually
> depends
> > > on a
> > > > flakey condition - we will not do a log rolling between the
> > modification
> > > on
> > > > zk and the actual loading of the modification on RS. And we have also
> > > done
> > > > a hard work to cleanup the lockings in ReplicationSourceManager and
> > add a
> > > > fat comment to say why it should be synchronized in this way. In
> > general,
> > > > the new code is much easier to read, test and debug, and also reduce
> > the
> > > > possibility of flakeyness, which could help us a lot when we start to
> > > > stabilize our build.
> > > >
> > > > Thanks.
> > > >
> > > >
> > > You see it as a big bug fix Duo?
> > >
> > Kind of. Just like the AM v1, we can do lots of fix to make it more
> stable,
> > but we know that we can not fix all the problems since we store state in
> > several places and it is a 'mission impossible' to make all the states
> stay
> > in sync under every situation... So we introduce AM v2.
> > For the replication peer tracking, it is the same problem. It is hard to
> do
> > fencing with zk watcher since it is asynchronous, so the UTs are always
> > kind of flakey in theoretical. And we depend on replication heavily at
> > Xiaomi, it is always a pain for us.
> >
> > >
> > > I'm late to review. Will have a look after beta-1 goes out. This stuff
> > > looks great from outside, especially distributed procedure framework
> > which
> > > we need all over the place.
> > >
> > > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> > could
> > > be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
> > > looksee sir.
> > >
> > Thanks sir. All the concerns here about whether we should merge this into
> > 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> > believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> > only if you guys all think so.
> >
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > > >
> > > > > Same questions as Josh's.
> > > > > 1) We have RCs for beta1 now, which means only commits that can go
> in
> > > are
> > > > > bug fixes only. This change - although important, needed from long
> > time
> > > > and
> > > > > well done (testing, summary, etc) - seems rather very large to get
> > into
> > > > 2.0
> > > > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > > > >
> > > > > -- Appy
> > > > >
> > > > >
> > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> > wrote:
> > > > >
> > > > > > -0 From a general project planning point-of-view (not based on
> the
> > > > > > technical merit of the code) I am uncomfortable about pulling in
> a
> > > > brand
> > > > > > new feature after we've already made one beta RC.
> > > > > >
> > > > > > Duo -- can you expand on why this feature is so important that we
> > > > should
> > > > > > break our release plan? Are there problems that would make
> > including
> > > > this
> > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > > provide
> > > > > on
> > > > > > "why does this need to go into 2.0?" would be helpful.
> > > > > >
> > > > > >
> > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > > >
> > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > > >>
> > > > > >> We aim to move the peer modification framework from zk watcher
> to
> > > > > >> procedure
> > > > > >> v2 in this issue and the work is done now.
> > > > > >>
> > > > > >> Copy the release note here:
> > > > > >>
> > > > > >> Introduce 5 procedures to do peer modifications:
> > > > > >>
> > > > > >>> AddPeerProcedure
> > > > > >>> RemovePeerProcedure
> > > > > >>> UpdatePeerConfigProcedure
> > > > > >>> EnablePeerProcedure
> > > > > >>> DisablePeerProcedure
> > > > > >>>
> > > > > >>> The procedures are all executed with the following stage:
> > > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > > > > >>> 2. Check whether the operation is valid, if not then give up
> > > > > >>> 3. Update peer storage. Notice that if we have entered this
> > stage,
> > > > then
> > > > > >>> we
> > > > > >>> can not rollback any more.
> > > > > >>> 4. Schedule sub procedures to refresh the peer config on every
> > RS.
> > > > > >>> 5. Do post cleanup if any.
> > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> since
> > we
> > > > > have
> > > > > >>> already done the work.
> > > > > >>>
> > > > > >>> The procedure will hold an exclusive lock on the peer id, so
> now
> > > > there
> > > > > is
> > > > > >>> no concurrent modifications on a single peer.
> > > > > >>>
> > > > > >>> And now it is guaranteed that once the procedure is done, the
> > peer
> > > > > >>> modification has already taken effect on all RSes.
> > > > > >>>
> > > > > >>> Abstracte a storage layer for replication peer/queue
> manangement,
> > > and
> > > > > >>> refactored the upper layer to remove zk related
> > > naming/code/comment.
> > > > > >>>
> > > > > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver,
> > and
> > > > add
> > > > > >>> permission check for executeProcedures method which requires
> the
> > > > caller
> > > > > >>> to
> > > > > >>> be system user or super user.
> > > > > >>>
> > > > > >>> On rolling upgrade: just do not do any replication peer
> > > modifications
> > > > > >>> during the rolling upgrading. There is no pb/layout changes on
> > the
> > > > > >>> peer/queue storage on zk.
> > > > > >>>
> > > > > >>>
> > > > > >> And there are other benefits.
> > > > > >> First, we have introduced a general procedure framework to send
> > > tasks
> > > > to
> > > > > >> RS
> > > > > >> and report the report back to Master. It can be used to
> implement
> > > > other
> > > > > >> operations such as ACL change.
> > > > > >> Second, zk is used as a external storage now since we do not
> > depend
> > > on
> > > > > zk
> > > > > >> watcher any more, it will be much easier to implement a 'table
> > > based'
> > > > > >> replication peer/queue storage.
> > > > > >>
> > > > > >> Please vote:
> > > > > >> [+1] Agree
> > > > > >> [-1] Disagree
> > > > > >> [0] Neutral
> > > > > >>
> > > > > >> Thanks.
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > -- Appy
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>



-- 
==============================
Openinx  blog : http://openinx.github.io

TO BE A GREAT HACKER !
==============================

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Andrew Purtell <ap...@apache.org>.

FWIW, you have my +1 for a merge to master.



On Mon, Jan 8, 2018 at 5:22 PM, 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> Anyway, if no objections on merging this into master, let's do it? So that
> we can start working on the follow-on features, such as table based
> replication storage, and synchronous replication, etc.
>
> Thanks.
>
> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>
> > On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > This 'new' feature only changes DDL part, not the core part of
> > replication,
> > > i.e, how to read wal entries and how to replicate it to the remote
> > cluster,
> > > etc. And also there is no pb message/storage layout change, you can
> think
> > > of this as a big refactoring. Theoretically we even do not need to add
> > new
> > > UTs for this feature, i.e, no extra stability works. The only visible
> > > change to users is that it may require more time on modifying peers in
> > > shell. So in general I think it is less hurt to include it in the
> coming
> > > release?
> > >
> > > And why I think it SHOULD be included in our 2.0 release is that, the
> > > synchronous guarantee is really a good thing for our replication
> related
> > > UTs. The correctness of the current Test***Replication usually depends
> > on a
> > > flakey condition - we will not do a log rolling between the
> modification
> > on
> > > zk and the actual loading of the modification on RS. And we have also
> > done
> > > a hard work to cleanup the lockings in ReplicationSourceManager and
> add a
> > > fat comment to say why it should be synchronized in this way. In
> general,
> > > the new code is much easier to read, test and debug, and also reduce
> the
> > > possibility of flakeyness, which could help us a lot when we start to
> > > stabilize our build.
> > >
> > > Thanks.
> > >
> > >
> > You see it as a big bug fix Duo?
> >
> Kind of. Just like the AM v1, we can do lots of fix to make it more stable,
> but we know that we can not fix all the problems since we store state in
> several places and it is a 'mission impossible' to make all the states stay
> in sync under every situation... So we introduce AM v2.
> For the replication peer tracking, it is the same problem. It is hard to do
> fencing with zk watcher since it is asynchronous, so the UTs are always
> kind of flakey in theoretical. And we depend on replication heavily at
> Xiaomi, it is always a pain for us.
>
> >
> > I'm late to review. Will have a look after beta-1 goes out. This stuff
> > looks great from outside, especially distributed procedure framework
> which
> > we need all over the place.
> >
> > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> could
> > be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
> > looksee sir.
> >
> Thanks sir. All the concerns here about whether we should merge this into
> 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> only if you guys all think so.
>
> >
> > St.Ack
> >
> >
> >
> >
> >
> > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > >
> > > > Same questions as Josh's.
> > > > 1) We have RCs for beta1 now, which means only commits that can go in
> > are
> > > > bug fixes only. This change - although important, needed from long
> time
> > > and
> > > > well done (testing, summary, etc) - seems rather very large to get
> into
> > > 2.0
> > > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > > >
> > > > -- Appy
> > > >
> > > >
> > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> wrote:
> > > >
> > > > > -0 From a general project planning point-of-view (not based on the
> > > > > technical merit of the code) I am uncomfortable about pulling in a
> > > brand
> > > > > new feature after we've already made one beta RC.
> > > > >
> > > > > Duo -- can you expand on why this feature is so important that we
> > > should
> > > > > break our release plan? Are there problems that would make
> including
> > > this
> > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > provide
> > > > on
> > > > > "why does this need to go into 2.0?" would be helpful.
> > > > >
> > > > >
> > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > >
> > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > >>
> > > > >> We aim to move the peer modification framework from zk watcher to
> > > > >> procedure
> > > > >> v2 in this issue and the work is done now.
> > > > >>
> > > > >> Copy the release note here:
> > > > >>
> > > > >> Introduce 5 procedures to do peer modifications:
> > > > >>
> > > > >>> AddPeerProcedure
> > > > >>> RemovePeerProcedure
> > > > >>> UpdatePeerConfigProcedure
> > > > >>> EnablePeerProcedure
> > > > >>> DisablePeerProcedure
> > > > >>>
> > > > >>> The procedures are all executed with the following stage:
> > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > > > >>> 2. Check whether the operation is valid, if not then give up
> > > > >>> 3. Update peer storage. Notice that if we have entered this
> stage,
> > > then
> > > > >>> we
> > > > >>> can not rollback any more.
> > > > >>> 4. Schedule sub procedures to refresh the peer config on every
> RS.
> > > > >>> 5. Do post cleanup if any.
> > > > >>> 6. Call post CP hook. The exception thrown will be ignored since
> we
> > > > have
> > > > >>> already done the work.
> > > > >>>
> > > > >>> The procedure will hold an exclusive lock on the peer id, so now
> > > there
> > > > is
> > > > >>> no concurrent modifications on a single peer.
> > > > >>>
> > > > >>> And now it is guaranteed that once the procedure is done, the
> peer
> > > > >>> modification has already taken effect on all RSes.
> > > > >>>
> > > > >>> Abstracte a storage layer for replication peer/queue manangement,
> > and
> > > > >>> refactored the upper layer to remove zk related
> > naming/code/comment.
> > > > >>>
> > > > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver,
> and
> > > add
> > > > >>> permission check for executeProcedures method which requires the
> > > caller
> > > > >>> to
> > > > >>> be system user or super user.
> > > > >>>
> > > > >>> On rolling upgrade: just do not do any replication peer
> > modifications
> > > > >>> during the rolling upgrading. There is no pb/layout changes on
> the
> > > > >>> peer/queue storage on zk.
> > > > >>>
> > > > >>>
> > > > >> And there are other benefits.
> > > > >> First, we have introduced a general procedure framework to send
> > tasks
> > > to
> > > > >> RS
> > > > >> and report the report back to Master. It can be used to implement
> > > other
> > > > >> operations such as ACL change.
> > > > >> Second, zk is used as a external storage now since we do not
> depend
> > on
> > > > zk
> > > > >> watcher any more, it will be much easier to implement a 'table
> > based'
> > > > >> replication peer/queue storage.
> > > > >>
> > > > >> Please vote:
> > > > >> [+1] Agree
> > > > >> [-1] Disagree
> > > > >> [0] Neutral
> > > > >>
> > > > >> Thanks.
> > > > >>
> > > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > -- Appy
> > > >
> > >
> >
>



-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Apekshit Sharma <ap...@cloudera.com>.

> For the replication peer tracking, it is the same problem. It is hard to
do
> fencing with zk watcher since it is asynchronous, so the UTs are always
> kind of flakey in theoretical.

> the synchronous guarantee is really a good thing for our replication
related UTs.

> And we have also done a hard work to cleanup the lockings in
ReplicationSourceManager and add a
> fat comment to say why it should be synchronized in this way.

That stuff just makes me love it.

Although it's improving stability (some concrete data on this would go long
way), it's a big refactor (750kb patch), and is changing Admin, and will
take a couple days to get reviewed.
As much as I would have loved to see it go in 2.0  (and would have
prioritized it's review if it was 10 days earlier), at this point, it just
seems too late.
Let's do it in 2.1 please.



On Mon, Jan 8, 2018 at 7:34 PM, Stack <st...@duboce.net> wrote:

> On Mon, Jan 8, 2018 at 5:22 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > Anyway, if no objections on merging this into master, let's do it? So
> that
> > we can start working on the follow-on features, such as table based
> > replication storage, and synchronous replication, etc.
> >
> >
>
> +1 for Master.
>
>
>
> > > You see it as a big bug fix Duo?
> > >
> >
>
> ...
>
> > For the replication peer tracking, it is the same problem. It is hard to
> do
> > fencing with zk watcher since it is asynchronous, so the UTs are always
> > kind of flakey in theoretical. And we depend on replication heavily at
> > Xiaomi, it is always a pain for us.
> >
> >
> I have sympathy for this argument (Especially if you call it a bug fix or
> can argue that it fixes a bunch of flakies).
>
> 2.0.0 is so late though. We can't do anything that might shove out the date
> even further. No reason 2.1 couldn't follow just after.
>
> Have you tried this addition at scale? Can you point to marked improvement
> in accounting?
>
> What if you set up a branch-2-HBASE-19397 on
>  jenkins; does it pass as many tests or more?
>
> St.Ack
>
>
> >
> > > I'm late to review. Will have a look after beta-1 goes out. This stuff
> > > looks great from outside, especially distributed procedure framework
> > which
> > > we need all over the place.
> > >
> > > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> > could
> > > be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
> > > looksee sir.
> > >
> > Thanks sir. All the concerns here about whether we should merge this into
> > 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> > believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> > only if you guys all think so.
> >
> > >
> > > St.Ack
> > >
> > >
> > >
> > >
> > >
> > > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > > >
> > > > > Same questions as Josh's.
> > > > > 1) We have RCs for beta1 now, which means only commits that can go
> in
> > > are
> > > > > bug fixes only. This change - although important, needed from long
> > time
> > > > and
> > > > > well done (testing, summary, etc) - seems rather very large to get
> > into
> > > > 2.0
> > > > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > > > >
> > > > > -- Appy
> > > > >
> > > > >
> > > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> > wrote:
> > > > >
> > > > > > -0 From a general project planning point-of-view (not based on
> the
> > > > > > technical merit of the code) I am uncomfortable about pulling in
> a
> > > > brand
> > > > > > new feature after we've already made one beta RC.
> > > > > >
> > > > > > Duo -- can you expand on why this feature is so important that we
> > > > should
> > > > > > break our release plan? Are there problems that would make
> > including
> > > > this
> > > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > > provide
> > > > > on
> > > > > > "why does this need to go into 2.0?" would be helpful.
> > > > > >
> > > > > >
> > > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > > >
> > > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > > >>
> > > > > >> We aim to move the peer modification framework from zk watcher
> to
> > > > > >> procedure
> > > > > >> v2 in this issue and the work is done now.
> > > > > >>
> > > > > >> Copy the release note here:
> > > > > >>
> > > > > >> Introduce 5 procedures to do peer modifications:
> > > > > >>
> > > > > >>> AddPeerProcedure
> > > > > >>> RemovePeerProcedure
> > > > > >>> UpdatePeerConfigProcedure
> > > > > >>> EnablePeerProcedure
> > > > > >>> DisablePeerProcedure
> > > > > >>>
> > > > > >>> The procedures are all executed with the following stage:
> > > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > > > > >>> 2. Check whether the operation is valid, if not then give up
> > > > > >>> 3. Update peer storage. Notice that if we have entered this
> > stage,
> > > > then
> > > > > >>> we
> > > > > >>> can not rollback any more.
> > > > > >>> 4. Schedule sub procedures to refresh the peer config on every
> > RS.
> > > > > >>> 5. Do post cleanup if any.
> > > > > >>> 6. Call post CP hook. The exception thrown will be ignored
> since
> > we
> > > > > have
> > > > > >>> already done the work.
> > > > > >>>
> > > > > >>> The procedure will hold an exclusive lock on the peer id, so
> now
> > > > there
> > > > > is
> > > > > >>> no concurrent modifications on a single peer.
> > > > > >>>
> > > > > >>> And now it is guaranteed that once the procedure is done, the
> > peer
> > > > > >>> modification has already taken effect on all RSes.
> > > > > >>>
> > > > > >>> Abstracte a storage layer for replication peer/queue
> manangement,
> > > and
> > > > > >>> refactored the upper layer to remove zk related
> > > naming/code/comment.
> > > > > >>>
> > > > > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver,
> > and
> > > > add
> > > > > >>> permission check for executeProcedures method which requires
> the
> > > > caller
> > > > > >>> to
> > > > > >>> be system user or super user.
> > > > > >>>
> > > > > >>> On rolling upgrade: just do not do any replication peer
> > > modifications
> > > > > >>> during the rolling upgrading. There is no pb/layout changes on
> > the
> > > > > >>> peer/queue storage on zk.
> > > > > >>>
> > > > > >>>
> > > > > >> And there are other benefits.
> > > > > >> First, we have introduced a general procedure framework to send
> > > tasks
> > > > to
> > > > > >> RS
> > > > > >> and report the report back to Master. It can be used to
> implement
> > > > other
> > > > > >> operations such as ACL change.
> > > > > >> Second, zk is used as a external storage now since we do not
> > depend
> > > on
> > > > > zk
> > > > > >> watcher any more, it will be much easier to implement a 'table
> > > based'
> > > > > >> replication peer/queue storage.
> > > > > >>
> > > > > >> Please vote:
> > > > > >> [+1] Agree
> > > > > >> [-1] Disagree
> > > > > >> [0] Neutral
> > > > > >>
> > > > > >> Thanks.
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > -- Appy
> > > > >
> > > >
> > >
> >
>



-- 

-- Appy

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

On Mon, Jan 8, 2018 at 5:22 PM, 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> Anyway, if no objections on merging this into master, let's do it? So that
> we can start working on the follow-on features, such as table based
> replication storage, and synchronous replication, etc.
>
>

+1 for Master.



> > You see it as a big bug fix Duo?
> >
>

...

> For the replication peer tracking, it is the same problem. It is hard to do
> fencing with zk watcher since it is asynchronous, so the UTs are always
> kind of flakey in theoretical. And we depend on replication heavily at
> Xiaomi, it is always a pain for us.
>
>
I have sympathy for this argument (Especially if you call it a bug fix or
can argue that it fixes a bunch of flakies).

2.0.0 is so late though. We can't do anything that might shove out the date
even further. No reason 2.1 couldn't follow just after.

Have you tried this addition at scale? Can you point to marked improvement
in accounting?

What if you set up a branch-2-HBASE-19397 on
 jenkins; does it pass as many tests or more?

St.Ack


>
> > I'm late to review. Will have a look after beta-1 goes out. This stuff
> > looks great from outside, especially distributed procedure framework
> which
> > we need all over the place.
> >
> > In general I have no problem w/ this in master and an hbase 2.1 (2.1
> could
> > be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
> > looksee sir.
> >
> Thanks sir. All the concerns here about whether we should merge this into
> 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> only if you guys all think so.
>
> >
> > St.Ack
> >
> >
> >
> >
> >
> > > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> > >
> > > > Same questions as Josh's.
> > > > 1) We have RCs for beta1 now, which means only commits that can go in
> > are
> > > > bug fixes only. This change - although important, needed from long
> time
> > > and
> > > > well done (testing, summary, etc) - seems rather very large to get
> into
> > > 2.0
> > > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > > >
> > > > -- Appy
> > > >
> > > >
> > > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org>
> wrote:
> > > >
> > > > > -0 From a general project planning point-of-view (not based on the
> > > > > technical merit of the code) I am uncomfortable about pulling in a
> > > brand
> > > > > new feature after we've already made one beta RC.
> > > > >
> > > > > Duo -- can you expand on why this feature is so important that we
> > > should
> > > > > break our release plan? Are there problems that would make
> including
> > > this
> > > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> > provide
> > > > on
> > > > > "why does this need to go into 2.0?" would be helpful.
> > > > >
> > > > >
> > > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > > >
> > > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > > >>
> > > > >> We aim to move the peer modification framework from zk watcher to
> > > > >> procedure
> > > > >> v2 in this issue and the work is done now.
> > > > >>
> > > > >> Copy the release note here:
> > > > >>
> > > > >> Introduce 5 procedures to do peer modifications:
> > > > >>
> > > > >>> AddPeerProcedure
> > > > >>> RemovePeerProcedure
> > > > >>> UpdatePeerConfigProcedure
> > > > >>> EnablePeerProcedure
> > > > >>> DisablePeerProcedure
> > > > >>>
> > > > >>> The procedures are all executed with the following stage:
> > > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > > > >>> 2. Check whether the operation is valid, if not then give up
> > > > >>> 3. Update peer storage. Notice that if we have entered this
> stage,
> > > then
> > > > >>> we
> > > > >>> can not rollback any more.
> > > > >>> 4. Schedule sub procedures to refresh the peer config on every
> RS.
> > > > >>> 5. Do post cleanup if any.
> > > > >>> 6. Call post CP hook. The exception thrown will be ignored since
> we
> > > > have
> > > > >>> already done the work.
> > > > >>>
> > > > >>> The procedure will hold an exclusive lock on the peer id, so now
> > > there
> > > > is
> > > > >>> no concurrent modifications on a single peer.
> > > > >>>
> > > > >>> And now it is guaranteed that once the procedure is done, the
> peer
> > > > >>> modification has already taken effect on all RSes.
> > > > >>>
> > > > >>> Abstracte a storage layer for replication peer/queue manangement,
> > and
> > > > >>> refactored the upper layer to remove zk related
> > naming/code/comment.
> > > > >>>
> > > > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver,
> and
> > > add
> > > > >>> permission check for executeProcedures method which requires the
> > > caller
> > > > >>> to
> > > > >>> be system user or super user.
> > > > >>>
> > > > >>> On rolling upgrade: just do not do any replication peer
> > modifications
> > > > >>> during the rolling upgrading. There is no pb/layout changes on
> the
> > > > >>> peer/queue storage on zk.
> > > > >>>
> > > > >>>
> > > > >> And there are other benefits.
> > > > >> First, we have introduced a general procedure framework to send
> > tasks
> > > to
> > > > >> RS
> > > > >> and report the report back to Master. It can be used to implement
> > > other
> > > > >> operations such as ACL change.
> > > > >> Second, zk is used as a external storage now since we do not
> depend
> > on
> > > > zk
> > > > >> watcher any more, it will be much easier to implement a 'table
> > based'
> > > > >> replication peer/queue storage.
> > > > >>
> > > > >> Please vote:
> > > > >> [+1] Agree
> > > > >> [-1] Disagree
> > > > >> [0] Neutral
> > > > >>
> > > > >> Thanks.
> > > > >>
> > > > >>
> > > >
> > > >
> > > > --
> > > >
> > > > -- Appy
> > > >
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Ted Yu <yu...@gmail.com>.

+1 on merging to master branch.

On Tue, Jan 9, 2018 at 6:51 AM, Josh Elser <el...@apache.org> wrote:

> +1 on merge to Master.
>
> I appreciate the details you shared, Duo, but I think I'm still -0 on a
> branch-2 merge at this point. I'm with Stack: would rather pull it into a
> fast 2.1 release.
>
>
> On 1/8/18 8:22 PM, 张铎(Duo Zhang) wrote:
>
>> Anyway, if no objections on merging this into master, let's do it? So that
>> we can start working on the follow-on features, such as table based
>> replication storage, and synchronous replication, etc.
>>
>> Thanks.
>>
>> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
>>
>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
>>> wrote:
>>>
>>> This 'new' feature only changes DDL part, not the core part of
>>>>
>>> replication,
>>>
>>>> i.e, how to read wal entries and how to replicate it to the remote
>>>>
>>> cluster,
>>>
>>>> etc. And also there is no pb message/storage layout change, you can
>>>> think
>>>> of this as a big refactoring. Theoretically we even do not need to add
>>>>
>>> new
>>>
>>>> UTs for this feature, i.e, no extra stability works. The only visible
>>>> change to users is that it may require more time on modifying peers in
>>>> shell. So in general I think it is less hurt to include it in the coming
>>>> release?
>>>>
>>>> And why I think it SHOULD be included in our 2.0 release is that, the
>>>> synchronous guarantee is really a good thing for our replication related
>>>> UTs. The correctness of the current Test***Replication usually depends
>>>>
>>> on a
>>>
>>>> flakey condition - we will not do a log rolling between the modification
>>>>
>>> on
>>>
>>>> zk and the actual loading of the modification on RS. And we have also
>>>>
>>> done
>>>
>>>> a hard work to cleanup the lockings in ReplicationSourceManager and add
>>>> a
>>>> fat comment to say why it should be synchronized in this way. In
>>>> general,
>>>> the new code is much easier to read, test and debug, and also reduce the
>>>> possibility of flakeyness, which could help us a lot when we start to
>>>> stabilize our build.
>>>>
>>>> Thanks.
>>>>
>>>>
>>>> You see it as a big bug fix Duo?
>>>
>>> Kind of. Just like the AM v1, we can do lots of fix to make it more
>> stable,
>> but we know that we can not fix all the problems since we store state in
>> several places and it is a 'mission impossible' to make all the states
>> stay
>> in sync under every situation... So we introduce AM v2.
>> For the replication peer tracking, it is the same problem. It is hard to
>> do
>> fencing with zk watcher since it is asynchronous, so the UTs are always
>> kind of flakey in theoretical. And we depend on replication heavily at
>> Xiaomi, it is always a pain for us.
>>
>>
>>> I'm late to review. Will have a look after beta-1 goes out. This stuff
>>> looks great from outside, especially distributed procedure framework
>>> which
>>> we need all over the place.
>>>
>>> In general I have no problem w/ this in master and an hbase 2.1 (2.1
>>> could
>>> be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
>>> looksee sir.
>>>
>>> Thanks sir. All the concerns here about whether we should merge this into
>> 2.0 are reasonable, I know. Although I really want this in 2.0 because I
>> believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
>> only if you guys all think so.
>>
>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>>
>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>>>>
>>>> Same questions as Josh's.
>>>>> 1) We have RCs for beta1 now, which means only commits that can go in
>>>>>
>>>> are
>>>
>>>> bug fixes only. This change - although important, needed from long time
>>>>>
>>>> and
>>>>
>>>>> well done (testing, summary, etc) - seems rather very large to get into
>>>>>
>>>> 2.0
>>>>
>>>>> now. Needs good justification why it has to be 2.1 instead of 2.0.
>>>>>
>>>>> -- Appy
>>>>>
>>>>>
>>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:
>>>>>
>>>>> -0 From a general project planning point-of-view (not based on the
>>>>>> technical merit of the code) I am uncomfortable about pulling in a
>>>>>>
>>>>> brand
>>>>
>>>>> new feature after we've already made one beta RC.
>>>>>>
>>>>>> Duo -- can you expand on why this feature is so important that we
>>>>>>
>>>>> should
>>>>
>>>>> break our release plan? Are there problems that would make including
>>>>>>
>>>>> this
>>>>
>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
>>>>>>
>>>>> provide
>>>
>>>> on
>>>>>
>>>>>> "why does this need to go into 2.0?" would be helpful.
>>>>>>
>>>>>>
>>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>>>>>>>
>>>>>>> We aim to move the peer modification framework from zk watcher to
>>>>>>> procedure
>>>>>>> v2 in this issue and the work is done now.
>>>>>>>
>>>>>>> Copy the release note here:
>>>>>>>
>>>>>>> Introduce 5 procedures to do peer modifications:
>>>>>>>
>>>>>>> AddPeerProcedure
>>>>>>>> RemovePeerProcedure
>>>>>>>> UpdatePeerConfigProcedure
>>>>>>>> EnablePeerProcedure
>>>>>>>> DisablePeerProcedure
>>>>>>>>
>>>>>>>> The procedures are all executed with the following stage:
>>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
>>>>>>>> 2. Check whether the operation is valid, if not then give up
>>>>>>>> 3. Update peer storage. Notice that if we have entered this stage,
>>>>>>>>
>>>>>>> then
>>>>
>>>>> we
>>>>>>>> can not rollback any more.
>>>>>>>> 4. Schedule sub procedures to refresh the peer config on every RS.
>>>>>>>> 5. Do post cleanup if any.
>>>>>>>> 6. Call post CP hook. The exception thrown will be ignored since we
>>>>>>>>
>>>>>>> have
>>>>>
>>>>>> already done the work.
>>>>>>>>
>>>>>>>> The procedure will hold an exclusive lock on the peer id, so now
>>>>>>>>
>>>>>>> there
>>>>
>>>>> is
>>>>>
>>>>>> no concurrent modifications on a single peer.
>>>>>>>>
>>>>>>>> And now it is guaranteed that once the procedure is done, the peer
>>>>>>>> modification has already taken effect on all RSes.
>>>>>>>>
>>>>>>>> Abstracte a storage layer for replication peer/queue manangement,
>>>>>>>>
>>>>>>> and
>>>
>>>> refactored the upper layer to remove zk related
>>>>>>>>
>>>>>>> naming/code/comment.
>>>
>>>>
>>>>>>>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and
>>>>>>>>
>>>>>>> add
>>>>
>>>>> permission check for executeProcedures method which requires the
>>>>>>>>
>>>>>>> caller
>>>>
>>>>> to
>>>>>>>> be system user or super user.
>>>>>>>>
>>>>>>>> On rolling upgrade: just do not do any replication peer
>>>>>>>>
>>>>>>> modifications
>>>
>>>> during the rolling upgrading. There is no pb/layout changes on the
>>>>>>>> peer/queue storage on zk.
>>>>>>>>
>>>>>>>>
>>>>>>>> And there are other benefits.
>>>>>>> First, we have introduced a general procedure framework to send
>>>>>>>
>>>>>> tasks
>>>
>>>> to
>>>>
>>>>> RS
>>>>>>> and report the report back to Master. It can be used to implement
>>>>>>>
>>>>>> other
>>>>
>>>>> operations such as ACL change.
>>>>>>> Second, zk is used as a external storage now since we do not depend
>>>>>>>
>>>>>> on
>>>
>>>> zk
>>>>>
>>>>>> watcher any more, it will be much easier to implement a 'table
>>>>>>>
>>>>>> based'
>>>
>>>> replication peer/queue storage.
>>>>>>>
>>>>>>> Please vote:
>>>>>>> [+1] Agree
>>>>>>> [-1] Disagree
>>>>>>> [0] Neutral
>>>>>>>
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> -- Appy
>>>>>
>>>>>
>>>>
>>>
>>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Josh Elser <el...@apache.org>.

+1 on merge to Master.

I appreciate the details you shared, Duo, but I think I'm still -0 on a 
branch-2 merge at this point. I'm with Stack: would rather pull it into 
a fast 2.1 release.

On 1/8/18 8:22 PM, 张铎(Duo Zhang) wrote:
> Anyway, if no objections on merging this into master, let's do it? So that
> we can start working on the follow-on features, such as table based
> replication storage, and synchronous replication, etc.
> 
> Thanks.
> 
> 2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:
> 
>> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
>> wrote:
>>
>>> This 'new' feature only changes DDL part, not the core part of
>> replication,
>>> i.e, how to read wal entries and how to replicate it to the remote
>> cluster,
>>> etc. And also there is no pb message/storage layout change, you can think
>>> of this as a big refactoring. Theoretically we even do not need to add
>> new
>>> UTs for this feature, i.e, no extra stability works. The only visible
>>> change to users is that it may require more time on modifying peers in
>>> shell. So in general I think it is less hurt to include it in the coming
>>> release?
>>>
>>> And why I think it SHOULD be included in our 2.0 release is that, the
>>> synchronous guarantee is really a good thing for our replication related
>>> UTs. The correctness of the current Test***Replication usually depends
>> on a
>>> flakey condition - we will not do a log rolling between the modification
>> on
>>> zk and the actual loading of the modification on RS. And we have also
>> done
>>> a hard work to cleanup the lockings in ReplicationSourceManager and add a
>>> fat comment to say why it should be synchronized in this way. In general,
>>> the new code is much easier to read, test and debug, and also reduce the
>>> possibility of flakeyness, which could help us a lot when we start to
>>> stabilize our build.
>>>
>>> Thanks.
>>>
>>>
>> You see it as a big bug fix Duo?
>>
> Kind of. Just like the AM v1, we can do lots of fix to make it more stable,
> but we know that we can not fix all the problems since we store state in
> several places and it is a 'mission impossible' to make all the states stay
> in sync under every situation... So we introduce AM v2.
> For the replication peer tracking, it is the same problem. It is hard to do
> fencing with zk watcher since it is asynchronous, so the UTs are always
> kind of flakey in theoretical. And we depend on replication heavily at
> Xiaomi, it is always a pain for us.
> 
>>
>> I'm late to review. Will have a look after beta-1 goes out. This stuff
>> looks great from outside, especially distributed procedure framework which
>> we need all over the place.
>>
>> In general I have no problem w/ this in master and an hbase 2.1 (2.1 could
>> be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
>> looksee sir.
>>
> Thanks sir. All the concerns here about whether we should merge this into
> 2.0 are reasonable, I know. Although I really want this in 2.0 because I
> believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
> only if you guys all think so.
> 
>>
>> St.Ack
>>
>>
>>
>>
>>
>>> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>>>
>>>> Same questions as Josh's.
>>>> 1) We have RCs for beta1 now, which means only commits that can go in
>> are
>>>> bug fixes only. This change - although important, needed from long time
>>> and
>>>> well done (testing, summary, etc) - seems rather very large to get into
>>> 2.0
>>>> now. Needs good justification why it has to be 2.1 instead of 2.0.
>>>>
>>>> -- Appy
>>>>
>>>>
>>>> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:
>>>>
>>>>> -0 From a general project planning point-of-view (not based on the
>>>>> technical merit of the code) I am uncomfortable about pulling in a
>>> brand
>>>>> new feature after we've already made one beta RC.
>>>>>
>>>>> Duo -- can you expand on why this feature is so important that we
>>> should
>>>>> break our release plan? Are there problems that would make including
>>> this
>>>>> in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
>> provide
>>>> on
>>>>> "why does this need to go into 2.0?" would be helpful.
>>>>>
>>>>>
>>>>> On 1/6/18 1:54 AM, Duo Zhang wrote:
>>>>>
>>>>>> https://issues.apache.org/jira/browse/HBASE-19397
>>>>>>
>>>>>> We aim to move the peer modification framework from zk watcher to
>>>>>> procedure
>>>>>> v2 in this issue and the work is done now.
>>>>>>
>>>>>> Copy the release note here:
>>>>>>
>>>>>> Introduce 5 procedures to do peer modifications:
>>>>>>
>>>>>>> AddPeerProcedure
>>>>>>> RemovePeerProcedure
>>>>>>> UpdatePeerConfigProcedure
>>>>>>> EnablePeerProcedure
>>>>>>> DisablePeerProcedure
>>>>>>>
>>>>>>> The procedures are all executed with the following stage:
>>>>>>> 1. Call pre CP hook, if an exception is thrown then give up
>>>>>>> 2. Check whether the operation is valid, if not then give up
>>>>>>> 3. Update peer storage. Notice that if we have entered this stage,
>>> then
>>>>>>> we
>>>>>>> can not rollback any more.
>>>>>>> 4. Schedule sub procedures to refresh the peer config on every RS.
>>>>>>> 5. Do post cleanup if any.
>>>>>>> 6. Call post CP hook. The exception thrown will be ignored since we
>>>> have
>>>>>>> already done the work.
>>>>>>>
>>>>>>> The procedure will hold an exclusive lock on the peer id, so now
>>> there
>>>> is
>>>>>>> no concurrent modifications on a single peer.
>>>>>>>
>>>>>>> And now it is guaranteed that once the procedure is done, the peer
>>>>>>> modification has already taken effect on all RSes.
>>>>>>>
>>>>>>> Abstracte a storage layer for replication peer/queue manangement,
>> and
>>>>>>> refactored the upper layer to remove zk related
>> naming/code/comment.
>>>>>>>
>>>>>>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and
>>> add
>>>>>>> permission check for executeProcedures method which requires the
>>> caller
>>>>>>> to
>>>>>>> be system user or super user.
>>>>>>>
>>>>>>> On rolling upgrade: just do not do any replication peer
>> modifications
>>>>>>> during the rolling upgrading. There is no pb/layout changes on the
>>>>>>> peer/queue storage on zk.
>>>>>>>
>>>>>>>
>>>>>> And there are other benefits.
>>>>>> First, we have introduced a general procedure framework to send
>> tasks
>>> to
>>>>>> RS
>>>>>> and report the report back to Master. It can be used to implement
>>> other
>>>>>> operations such as ACL change.
>>>>>> Second, zk is used as a external storage now since we do not depend
>> on
>>>> zk
>>>>>> watcher any more, it will be much easier to implement a 'table
>> based'
>>>>>> replication peer/queue storage.
>>>>>>
>>>>>> Please vote:
>>>>>> [+1] Agree
>>>>>> [-1] Disagree
>>>>>> [0] Neutral
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> -- Appy
>>>>
>>>
>>
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

Anyway, if no objections on merging this into master, let's do it? So that
we can start working on the follow-on features, such as table based
replication storage, and synchronous replication, etc.

Thanks.

2018-01-09 7:19 GMT+08:00 Stack <st...@duboce.net>:

> On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > This 'new' feature only changes DDL part, not the core part of
> replication,
> > i.e, how to read wal entries and how to replicate it to the remote
> cluster,
> > etc. And also there is no pb message/storage layout change, you can think
> > of this as a big refactoring. Theoretically we even do not need to add
> new
> > UTs for this feature, i.e, no extra stability works. The only visible
> > change to users is that it may require more time on modifying peers in
> > shell. So in general I think it is less hurt to include it in the coming
> > release?
> >
> > And why I think it SHOULD be included in our 2.0 release is that, the
> > synchronous guarantee is really a good thing for our replication related
> > UTs. The correctness of the current Test***Replication usually depends
> on a
> > flakey condition - we will not do a log rolling between the modification
> on
> > zk and the actual loading of the modification on RS. And we have also
> done
> > a hard work to cleanup the lockings in ReplicationSourceManager and add a
> > fat comment to say why it should be synchronized in this way. In general,
> > the new code is much easier to read, test and debug, and also reduce the
> > possibility of flakeyness, which could help us a lot when we start to
> > stabilize our build.
> >
> > Thanks.
> >
> >
> You see it as a big bug fix Duo?
>
Kind of. Just like the AM v1, we can do lots of fix to make it more stable,
but we know that we can not fix all the problems since we store state in
several places and it is a 'mission impossible' to make all the states stay
in sync under every situation... So we introduce AM v2.
For the replication peer tracking, it is the same problem. It is hard to do
fencing with zk watcher since it is asynchronous, so the UTs are always
kind of flakey in theoretical. And we depend on replication heavily at
Xiaomi, it is always a pain for us.

>
> I'm late to review. Will have a look after beta-1 goes out. This stuff
> looks great from outside, especially distributed procedure framework which
> we need all over the place.
>
> In general I have no problem w/ this in master and an hbase 2.1 (2.1 could
> be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
> looksee sir.
>
Thanks sir. All the concerns here about whether we should merge this into
2.0 are reasonable, I know. Although I really want this in 2.0 because I
believe it will help a lot for stabilizing,  I'm OK with merge it to 2.1
only if you guys all think so.

>
> St.Ack
>
>
>
>
>
> > 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
> >
> > > Same questions as Josh's.
> > > 1) We have RCs for beta1 now, which means only commits that can go in
> are
> > > bug fixes only. This change - although important, needed from long time
> > and
> > > well done (testing, summary, etc) - seems rather very large to get into
> > 2.0
> > > now. Needs good justification why it has to be 2.1 instead of 2.0.
> > >
> > > -- Appy
> > >
> > >
> > > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:
> > >
> > > > -0 From a general project planning point-of-view (not based on the
> > > > technical merit of the code) I am uncomfortable about pulling in a
> > brand
> > > > new feature after we've already made one beta RC.
> > > >
> > > > Duo -- can you expand on why this feature is so important that we
> > should
> > > > break our release plan? Are there problems that would make including
> > this
> > > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can
> provide
> > > on
> > > > "why does this need to go into 2.0?" would be helpful.
> > > >
> > > >
> > > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > > >
> > > >> https://issues.apache.org/jira/browse/HBASE-19397
> > > >>
> > > >> We aim to move the peer modification framework from zk watcher to
> > > >> procedure
> > > >> v2 in this issue and the work is done now.
> > > >>
> > > >> Copy the release note here:
> > > >>
> > > >> Introduce 5 procedures to do peer modifications:
> > > >>
> > > >>> AddPeerProcedure
> > > >>> RemovePeerProcedure
> > > >>> UpdatePeerConfigProcedure
> > > >>> EnablePeerProcedure
> > > >>> DisablePeerProcedure
> > > >>>
> > > >>> The procedures are all executed with the following stage:
> > > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > > >>> 2. Check whether the operation is valid, if not then give up
> > > >>> 3. Update peer storage. Notice that if we have entered this stage,
> > then
> > > >>> we
> > > >>> can not rollback any more.
> > > >>> 4. Schedule sub procedures to refresh the peer config on every RS.
> > > >>> 5. Do post cleanup if any.
> > > >>> 6. Call post CP hook. The exception thrown will be ignored since we
> > > have
> > > >>> already done the work.
> > > >>>
> > > >>> The procedure will hold an exclusive lock on the peer id, so now
> > there
> > > is
> > > >>> no concurrent modifications on a single peer.
> > > >>>
> > > >>> And now it is guaranteed that once the procedure is done, the peer
> > > >>> modification has already taken effect on all RSes.
> > > >>>
> > > >>> Abstracte a storage layer for replication peer/queue manangement,
> and
> > > >>> refactored the upper layer to remove zk related
> naming/code/comment.
> > > >>>
> > > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and
> > add
> > > >>> permission check for executeProcedures method which requires the
> > caller
> > > >>> to
> > > >>> be system user or super user.
> > > >>>
> > > >>> On rolling upgrade: just do not do any replication peer
> modifications
> > > >>> during the rolling upgrading. There is no pb/layout changes on the
> > > >>> peer/queue storage on zk.
> > > >>>
> > > >>>
> > > >> And there are other benefits.
> > > >> First, we have introduced a general procedure framework to send
> tasks
> > to
> > > >> RS
> > > >> and report the report back to Master. It can be used to implement
> > other
> > > >> operations such as ACL change.
> > > >> Second, zk is used as a external storage now since we do not depend
> on
> > > zk
> > > >> watcher any more, it will be much easier to implement a 'table
> based'
> > > >> replication peer/queue storage.
> > > >>
> > > >> Please vote:
> > > >> [+1] Agree
> > > >> [-1] Disagree
> > > >> [0] Neutral
> > > >>
> > > >> Thanks.
> > > >>
> > > >>
> > >
> > >
> > > --
> > >
> > > -- Appy
> > >
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Stack <st...@duboce.net>.

On Mon, Jan 8, 2018 at 2:50 PM, 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> This 'new' feature only changes DDL part, not the core part of replication,
> i.e, how to read wal entries and how to replicate it to the remote cluster,
> etc. And also there is no pb message/storage layout change, you can think
> of this as a big refactoring. Theoretically we even do not need to add new
> UTs for this feature, i.e, no extra stability works. The only visible
> change to users is that it may require more time on modifying peers in
> shell. So in general I think it is less hurt to include it in the coming
> release?
>
> And why I think it SHOULD be included in our 2.0 release is that, the
> synchronous guarantee is really a good thing for our replication related
> UTs. The correctness of the current Test***Replication usually depends on a
> flakey condition - we will not do a log rolling between the modification on
> zk and the actual loading of the modification on RS. And we have also done
> a hard work to cleanup the lockings in ReplicationSourceManager and add a
> fat comment to say why it should be synchronized in this way. In general,
> the new code is much easier to read, test and debug, and also reduce the
> possibility of flakeyness, which could help us a lot when we start to
> stabilize our build.
>
> Thanks.
>
>
You see it as a big bug fix Duo?

I'm late to review. Will have a look after beta-1 goes out. This stuff
looks great from outside, especially distributed procedure framework which
we need all over the place.

In general I have no problem w/ this in master and an hbase 2.1 (2.1 could
be soon after 2.0). Its late for big stuff in 2.0.... but let me take a
looksee sir.

St.Ack





> 2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:
>
> > Same questions as Josh's.
> > 1) We have RCs for beta1 now, which means only commits that can go in are
> > bug fixes only. This change - although important, needed from long time
> and
> > well done (testing, summary, etc) - seems rather very large to get into
> 2.0
> > now. Needs good justification why it has to be 2.1 instead of 2.0.
> >
> > -- Appy
> >
> >
> > On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:
> >
> > > -0 From a general project planning point-of-view (not based on the
> > > technical merit of the code) I am uncomfortable about pulling in a
> brand
> > > new feature after we've already made one beta RC.
> > >
> > > Duo -- can you expand on why this feature is so important that we
> should
> > > break our release plan? Are there problems that would make including
> this
> > > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can provide
> > on
> > > "why does this need to go into 2.0?" would be helpful.
> > >
> > >
> > > On 1/6/18 1:54 AM, Duo Zhang wrote:
> > >
> > >> https://issues.apache.org/jira/browse/HBASE-19397
> > >>
> > >> We aim to move the peer modification framework from zk watcher to
> > >> procedure
> > >> v2 in this issue and the work is done now.
> > >>
> > >> Copy the release note here:
> > >>
> > >> Introduce 5 procedures to do peer modifications:
> > >>
> > >>> AddPeerProcedure
> > >>> RemovePeerProcedure
> > >>> UpdatePeerConfigProcedure
> > >>> EnablePeerProcedure
> > >>> DisablePeerProcedure
> > >>>
> > >>> The procedures are all executed with the following stage:
> > >>> 1. Call pre CP hook, if an exception is thrown then give up
> > >>> 2. Check whether the operation is valid, if not then give up
> > >>> 3. Update peer storage. Notice that if we have entered this stage,
> then
> > >>> we
> > >>> can not rollback any more.
> > >>> 4. Schedule sub procedures to refresh the peer config on every RS.
> > >>> 5. Do post cleanup if any.
> > >>> 6. Call post CP hook. The exception thrown will be ignored since we
> > have
> > >>> already done the work.
> > >>>
> > >>> The procedure will hold an exclusive lock on the peer id, so now
> there
> > is
> > >>> no concurrent modifications on a single peer.
> > >>>
> > >>> And now it is guaranteed that once the procedure is done, the peer
> > >>> modification has already taken effect on all RSes.
> > >>>
> > >>> Abstracte a storage layer for replication peer/queue manangement, and
> > >>> refactored the upper layer to remove zk related naming/code/comment.
> > >>>
> > >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and
> add
> > >>> permission check for executeProcedures method which requires the
> caller
> > >>> to
> > >>> be system user or super user.
> > >>>
> > >>> On rolling upgrade: just do not do any replication peer modifications
> > >>> during the rolling upgrading. There is no pb/layout changes on the
> > >>> peer/queue storage on zk.
> > >>>
> > >>>
> > >> And there are other benefits.
> > >> First, we have introduced a general procedure framework to send tasks
> to
> > >> RS
> > >> and report the report back to Master. It can be used to implement
> other
> > >> operations such as ACL change.
> > >> Second, zk is used as a external storage now since we do not depend on
> > zk
> > >> watcher any more, it will be much easier to implement a 'table based'
> > >> replication peer/queue storage.
> > >>
> > >> Please vote:
> > >> [+1] Agree
> > >> [-1] Disagree
> > >> [0] Neutral
> > >>
> > >> Thanks.
> > >>
> > >>
> >
> >
> > --
> >
> > -- Appy
> >
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

This 'new' feature only changes DDL part, not the core part of replication,
i.e, how to read wal entries and how to replicate it to the remote cluster,
etc. And also there is no pb message/storage layout change, you can think
of this as a big refactoring. Theoretically we even do not need to add new
UTs for this feature, i.e, no extra stability works. The only visible
change to users is that it may require more time on modifying peers in
shell. So in general I think it is less hurt to include it in the coming
release?

And why I think it SHOULD be included in our 2.0 release is that, the
synchronous guarantee is really a good thing for our replication related
UTs. The correctness of the current Test***Replication usually depends on a
flakey condition - we will not do a log rolling between the modification on
zk and the actual loading of the modification on RS. And we have also done
a hard work to cleanup the lockings in ReplicationSourceManager and add a
fat comment to say why it should be synchronized in this way. In general,
the new code is much easier to read, test and debug, and also reduce the
possibility of flakeyness, which could help us a lot when we start to
stabilize our build.

Thanks.

2018-01-09 4:53 GMT+08:00 Apekshit Sharma <ap...@cloudera.com>:

> Same questions as Josh's.
> 1) We have RCs for beta1 now, which means only commits that can go in are
> bug fixes only. This change - although important, needed from long time and
> well done (testing, summary, etc) - seems rather very large to get into 2.0
> now. Needs good justification why it has to be 2.1 instead of 2.0.
>
> -- Appy
>
>
> On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:
>
> > -0 From a general project planning point-of-view (not based on the
> > technical merit of the code) I am uncomfortable about pulling in a brand
> > new feature after we've already made one beta RC.
> >
> > Duo -- can you expand on why this feature is so important that we should
> > break our release plan? Are there problems that would make including this
> > in a 2.1/3.0 unnecessarily difficult? Any kind of color you can provide
> on
> > "why does this need to go into 2.0?" would be helpful.
> >
> >
> > On 1/6/18 1:54 AM, Duo Zhang wrote:
> >
> >> https://issues.apache.org/jira/browse/HBASE-19397
> >>
> >> We aim to move the peer modification framework from zk watcher to
> >> procedure
> >> v2 in this issue and the work is done now.
> >>
> >> Copy the release note here:
> >>
> >> Introduce 5 procedures to do peer modifications:
> >>
> >>> AddPeerProcedure
> >>> RemovePeerProcedure
> >>> UpdatePeerConfigProcedure
> >>> EnablePeerProcedure
> >>> DisablePeerProcedure
> >>>
> >>> The procedures are all executed with the following stage:
> >>> 1. Call pre CP hook, if an exception is thrown then give up
> >>> 2. Check whether the operation is valid, if not then give up
> >>> 3. Update peer storage. Notice that if we have entered this stage, then
> >>> we
> >>> can not rollback any more.
> >>> 4. Schedule sub procedures to refresh the peer config on every RS.
> >>> 5. Do post cleanup if any.
> >>> 6. Call post CP hook. The exception thrown will be ignored since we
> have
> >>> already done the work.
> >>>
> >>> The procedure will hold an exclusive lock on the peer id, so now there
> is
> >>> no concurrent modifications on a single peer.
> >>>
> >>> And now it is guaranteed that once the procedure is done, the peer
> >>> modification has already taken effect on all RSes.
> >>>
> >>> Abstracte a storage layer for replication peer/queue manangement, and
> >>> refactored the upper layer to remove zk related naming/code/comment.
> >>>
> >>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and add
> >>> permission check for executeProcedures method which requires the caller
> >>> to
> >>> be system user or super user.
> >>>
> >>> On rolling upgrade: just do not do any replication peer modifications
> >>> during the rolling upgrading. There is no pb/layout changes on the
> >>> peer/queue storage on zk.
> >>>
> >>>
> >> And there are other benefits.
> >> First, we have introduced a general procedure framework to send tasks to
> >> RS
> >> and report the report back to Master. It can be used to implement other
> >> operations such as ACL change.
> >> Second, zk is used as a external storage now since we do not depend on
> zk
> >> watcher any more, it will be much easier to implement a 'table based'
> >> replication peer/queue storage.
> >>
> >> Please vote:
> >> [+1] Agree
> >> [-1] Disagree
> >> [0] Neutral
> >>
> >> Thanks.
> >>
> >>
>
>
> --
>
> -- Appy
>

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Apekshit Sharma <ap...@cloudera.com>.

Same questions as Josh's.
1) We have RCs for beta1 now, which means only commits that can go in are
bug fixes only. This change - although important, needed from long time and
well done (testing, summary, etc) - seems rather very large to get into 2.0
now. Needs good justification why it has to be 2.1 instead of 2.0.

-- Appy


On Mon, Jan 8, 2018 at 8:34 AM, Josh Elser <el...@apache.org> wrote:

> -0 From a general project planning point-of-view (not based on the
> technical merit of the code) I am uncomfortable about pulling in a brand
> new feature after we've already made one beta RC.
>
> Duo -- can you expand on why this feature is so important that we should
> break our release plan? Are there problems that would make including this
> in a 2.1/3.0 unnecessarily difficult? Any kind of color you can provide on
> "why does this need to go into 2.0?" would be helpful.
>
>
> On 1/6/18 1:54 AM, Duo Zhang wrote:
>
>> https://issues.apache.org/jira/browse/HBASE-19397
>>
>> We aim to move the peer modification framework from zk watcher to
>> procedure
>> v2 in this issue and the work is done now.
>>
>> Copy the release note here:
>>
>> Introduce 5 procedures to do peer modifications:
>>
>>> AddPeerProcedure
>>> RemovePeerProcedure
>>> UpdatePeerConfigProcedure
>>> EnablePeerProcedure
>>> DisablePeerProcedure
>>>
>>> The procedures are all executed with the following stage:
>>> 1. Call pre CP hook, if an exception is thrown then give up
>>> 2. Check whether the operation is valid, if not then give up
>>> 3. Update peer storage. Notice that if we have entered this stage, then
>>> we
>>> can not rollback any more.
>>> 4. Schedule sub procedures to refresh the peer config on every RS.
>>> 5. Do post cleanup if any.
>>> 6. Call post CP hook. The exception thrown will be ignored since we have
>>> already done the work.
>>>
>>> The procedure will hold an exclusive lock on the peer id, so now there is
>>> no concurrent modifications on a single peer.
>>>
>>> And now it is guaranteed that once the procedure is done, the peer
>>> modification has already taken effect on all RSes.
>>>
>>> Abstracte a storage layer for replication peer/queue manangement, and
>>> refactored the upper layer to remove zk related naming/code/comment.
>>>
>>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and add
>>> permission check for executeProcedures method which requires the caller
>>> to
>>> be system user or super user.
>>>
>>> On rolling upgrade: just do not do any replication peer modifications
>>> during the rolling upgrading. There is no pb/layout changes on the
>>> peer/queue storage on zk.
>>>
>>>
>> And there are other benefits.
>> First, we have introduced a general procedure framework to send tasks to
>> RS
>> and report the report back to Master. It can be used to implement other
>> operations such as ACL change.
>> Second, zk is used as a external storage now since we do not depend on zk
>> watcher any more, it will be much easier to implement a 'table based'
>> replication peer/queue storage.
>>
>> Please vote:
>> [+1] Agree
>> [-1] Disagree
>> [0] Neutral
>>
>> Thanks.
>>
>>


-- 

-- Appy

Re: [VOTE] Merge branch HBASE-19397 back to master and branch-2.

Posted by Josh Elser <el...@apache.org>.

-0 From a general project planning point-of-view (not based on the 
technical merit of the code) I am uncomfortable about pulling in a brand 
new feature after we've already made one beta RC.

Duo -- can you expand on why this feature is so important that we should 
break our release plan? Are there problems that would make including 
this in a 2.1/3.0 unnecessarily difficult? Any kind of color you can 
provide on "why does this need to go into 2.0?" would be helpful.

On 1/6/18 1:54 AM, Duo Zhang wrote:
> https://issues.apache.org/jira/browse/HBASE-19397
> 
> We aim to move the peer modification framework from zk watcher to procedure
> v2 in this issue and the work is done now.
> 
> Copy the release note here:
> 
> Introduce 5 procedures to do peer modifications:
>> AddPeerProcedure
>> RemovePeerProcedure
>> UpdatePeerConfigProcedure
>> EnablePeerProcedure
>> DisablePeerProcedure
>>
>> The procedures are all executed with the following stage:
>> 1. Call pre CP hook, if an exception is thrown then give up
>> 2. Check whether the operation is valid, if not then give up
>> 3. Update peer storage. Notice that if we have entered this stage, then we
>> can not rollback any more.
>> 4. Schedule sub procedures to refresh the peer config on every RS.
>> 5. Do post cleanup if any.
>> 6. Call post CP hook. The exception thrown will be ignored since we have
>> already done the work.
>>
>> The procedure will hold an exclusive lock on the peer id, so now there is
>> no concurrent modifications on a single peer.
>>
>> And now it is guaranteed that once the procedure is done, the peer
>> modification has already taken effect on all RSes.
>>
>> Abstracte a storage layer for replication peer/queue manangement, and
>> refactored the upper layer to remove zk related naming/code/comment.
>>
>> Add pre/postExecuteProcedures CP hooks to RegionServerObserver, and add
>> permission check for executeProcedures method which requires the caller to
>> be system user or super user.
>>
>> On rolling upgrade: just do not do any replication peer modifications
>> during the rolling upgrading. There is no pb/layout changes on the
>> peer/queue storage on zk.
>>
> 
> And there are other benefits.
> First, we have introduced a general procedure framework to send tasks to RS
> and report the report back to Master. It can be used to implement other
> operations such as ACL change.
> Second, zk is used as a external storage now since we do not depend on zk
> watcher any more, it will be much easier to implement a 'table based'
> replication peer/queue storage.
> 
> Please vote:
> [+1] Agree
> [-1] Disagree
> [0] Neutral
> 
> Thanks.
>