You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Enis Söztutar <en...@hortonworks.com> on 2014/01/15 09:44:39 UTC

[PROPOSAL] HBASE-10070 branch

Hi,

I just wanted to give some updates on the HBASE-10070 efforts from the
technical side, and development side, and propose a branch.

>From the technical side:
The changes for region replicas phase 1 are becoming more mature and
stable, and most of the "base" changes are starting to become good
candidates for review. The code has been rebased to trunk, and the main
working repo has been moved to the HBASE-10070 branch at
https://github.com/enis/hbase/tree/hbase-10070.

An overview of the changes that is working include:
 - HRegionInfo & MetaReader & MetaEditor changes for support region replicas
 - HTableDescriptor changes and shell changes for supporting
REGION_REPLICATION
 - WebUI changes to display whether a region is a replica or not
 - AssignmentManager changes coupled with RegionStates & Master changes to
create and assign replicas, alter table, enable table, etc support.
 - Fixed hbck to work with replicas
 - A Consistency API from client side together with shell support
 - Load Balancer changes for region replicas for replica placement
 - Region and RegionServer changes for opening region replicas, and
refreshing store files
 - Client side changes for RPC failover support for eventual consistent
gets
 - End to end test mentioned in
https://issues.apache.org/jira/browse/HBASE-10070?focusedCommentId=13849978&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13849978

These are some of the remaining things that we are currently working on:
 - RPC failover support for multi-gets
 - RPC failover support for scans
 - RPC cancellation
 - Ability to refresh the client's knowledge about replica location/count
changes
 - Integration tests
 - General hardening
 - Perf tests

Development side:
As discussed in the issue design doc
https://issues.apache.org/jira/secure/attachment/12616659/HighAvailabilityDesignforreadsApachedoc.pdf"Apache
code development process" section, at this time we would like to
propose:
 (1) Creation of HBASE-10070 branch in svn which will be a fork of trunk as
of the date branch is created. All of the target authors (me, Devaraj,
Nicolas, Sergey) are already committers. I do not remember whether our
bylaws require votes on creating branches.
 (2) The branch will only contain commits that have been reviewed and +1'ed
from 2 other committers other than the patch author. Every commit in this
branch will have a single patch (maybe with unforeseen addendums) and and
associated jira which is a subtask of HBASE-10070.
 (3) We will use the branch HBASE-10070 hosted at my github repo
https://github.com/enis/hbase/tree/hbase-10070 as a working branch with
semi-dirty history and "this branch might eat your hard drive" guarantees.
 (4) All code contributions / review will be welcome as always. I can give
you push perms to the github branch if you are interested in contributing.
 (5) Once we have HBASE-10070 Phase 1 tasks done (as described in the doc),
we will put up a VOTE to merge the branch in. We will require 3 +1's for
the merge in. If we can get early reviews the merge vote will be much less
pain since the branch will be in a clean state and there have been reviews
per patch. We might need a final rebase, but that should not cause major
work I imagine.

We are hoping this will be a nice way to develop and deliver the feature to
the trunk, but as always all suggestions, comments welcome.

Enis

Re: [PROPOSAL] HBASE-10070 branch

Posted by Devaraj Das <dd...@hortonworks.com>.
On Wed, Jan 15, 2014 at 4:43 PM, Elliott Clark <ec...@apache.org> wrote:
> On Wed, Jan 15, 2014 at 3:57 PM, Enis Söztutar <en...@gmail.com> wrote:
>
>> I am afraid, it is not coprocessors or current set of plugins only. We need
>> changes in the
>> RPC, meta, region server, LB and master. Since we cannot easily get hooks
>> into all these in
>> a clean manner, implementing this purely outside would be next to
>> impossible.
>>
>
> I'm pretty unconvinced that this is the correct way forward.  It seems to
> introduce a lot of risk without a lot of gain.  Right now to me the 100%
> correct way forward is through paxos.  That's a lot of work but it has the
> most payoff in the end.  It will allow much faster recovery, much easier
> read sharding, it allows the greatest flexibility on IO.
>

Elliott, if I am not mistaken, we will need the replica management
work for the Paxos case as well. A lot of the work done in HBASE-10070
(to start with, the master/loadbalancer side of the region-replica
management work) would be leveraged if we choose to implement Paxos.

> On the other end of the spectrum is something like MySQL/Postgres read
> slaves (either tables or clusters).  Read slaves built on top of what's
> currently there seem to give all of the benefits of read slaves built into
> the current HBase without all of the risk. Sharding on top of the already
> built datastore is a pretty well known and well understood problem.  There
> are lots of great example of making this scale to pretty insane heights.
>  You lose very little flexibility and incur almost not risk to the
> stability of HBase.

We have gone over this point before. We are trying to address the
issue within a single cluster. We don't want to create more storage
overhead if we can help it (which we would have if we did
intra-cluster replication).
Again the default behavior of single replica per region, etc. is kept
intact. This should be true even from the stability point of view.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: [PROPOSAL] HBASE-10070 branch

Posted by Elliott Clark <ec...@apache.org>.
On Wed, Jan 15, 2014 at 3:57 PM, Enis Söztutar <en...@gmail.com> wrote:

> I am afraid, it is not coprocessors or current set of plugins only. We need
> changes in the
> RPC, meta, region server, LB and master. Since we cannot easily get hooks
> into all these in
> a clean manner, implementing this purely outside would be next to
> impossible.
>

I'm pretty unconvinced that this is the correct way forward.  It seems to
introduce a lot of risk without a lot of gain.  Right now to me the 100%
correct way forward is through paxos.  That's a lot of work but it has the
most payoff in the end.  It will allow much faster recovery, much easier
read sharding, it allows the greatest flexibility on IO.

On the other end of the spectrum is something like MySQL/Postgres read
slaves (either tables or clusters).  Read slaves built on top of what's
currently there seem to give all of the benefits of read slaves built into
the current HBase without all of the risk. Sharding on top of the already
built datastore is a pretty well known and well understood problem.  There
are lots of great example of making this scale to pretty insane heights.
 You lose very little flexibility and incur almost not risk to the
stability of HBase.

Re: [PROPOSAL] HBASE-10070 branch

Posted by Enis Söztutar <en...@gmail.com>.
On Wed, Jan 15, 2014 at 2:38 PM, Andrew Purtell <ap...@apache.org> wrote:

> On Wed, Jan 15, 2014 at 2:24 PM, Stack <st...@duboce.net> wrote:
>
> > > However, with different tables, it will be unintuitive
> > > since the meta, and the
> > > client side would have to bring different regions of different tables
> to
> > > make sense. Those tables
> > > will not have any associated data, but refer to the other tables etc.
> > >
> > >
> > That is right.  HBase core would be go untouched.  The read replica
> > 'construct' would be an imposition done in a layer above.
> >
>
> This is why I like the latter idea, but maybe it isn't good enough. I need
> to check if HBASE-10070 has something on the specific objectives here
> before I ask. Did we fit the solution to the problem or the problem to the
> solution? When I was considering HBASE-2357, I came to the conclusion it
> was the latter.
>

I am afraid, it is not coprocessors or current set of plugins only. We need
changes in the
RPC, meta, region server, LB and master. Since we cannot easily get hooks
into all these in
a clean manner, implementing this purely outside would be next to
impossible.

I think we looked into various different options including across and intra
cluster replication, multi clusters
in the same DC, etc but come to the conclusion that the proposed approach
would be the cleanest way.


>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: [PROPOSAL] HBASE-10070 branch

Posted by Stack <st...@duboce.net>.
On Wed, Jan 15, 2014 at 2:38 PM, Andrew Purtell <ap...@apache.org> wrote:

> On Wed, Jan 15, 2014 at 2:24 PM, Stack <st...@duboce.net> wrote:
>
> > > However, with different tables, it will be unintuitive
> > > since the meta, and the
> > > client side would have to bring different regions of different tables
> to
> > > make sense. Those tables
> > > will not have any associated data, but refer to the other tables etc.
> > >
> > >
> > That is right.  HBase core would be go untouched.  The read replica
> > 'construct' would be an imposition done in a layer above.
> >
>
> This is why I like the latter idea, but maybe it isn't good enough. I need
> to check if HBASE-10070 has something on the specific objectives here
> before I ask. Did we fit the solution to the problem or the problem to the
> solution? When I was considering HBASE-2357, I came to the conclusion it
> was the latter.
>
>
I ask the above over in HBASE-10070 (if this solution is what the user
asked for).

Let me add that I just finished the design doc and it is quality and it
looks viable to me with an attempt at minimal imposition on hbase core.

Was going to check out the branch next.

St.Ack

Re: [PROPOSAL] HBASE-10070 branch

Posted by Andrew Purtell <ap...@apache.org>.
On Wed, Jan 15, 2014 at 2:24 PM, Stack <st...@duboce.net> wrote:

> > However, with different tables, it will be unintuitive
> > since the meta, and the
> > client side would have to bring different regions of different tables to
> > make sense. Those tables
> > will not have any associated data, but refer to the other tables etc.
> >
> >
> That is right.  HBase core would be go untouched.  The read replica
> 'construct' would be an imposition done in a layer above.
>

This is why I like the latter idea, but maybe it isn't good enough. I need
to check if HBASE-10070 has something on the specific objectives here
before I ask. Did we fit the solution to the problem or the problem to the
solution? When I was considering HBASE-2357, I came to the conclusion it
was the latter.

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [PROPOSAL] HBASE-10070 branch

Posted by Stack <st...@duboce.net>.
On Wed, Jan 15, 2014 at 1:47 PM, Enis Söztutar <en...@gmail.com> wrote:

> >
> > > > I am late to the game so take my comments w/ a grain of salt -- I'll
> > > take a
> > > > look at HBASE-10070 -- but high-level do we have to go the read
> > replicas
> > > > route?  IMO, having our current already-strained AssignmentManager
> code
> > > > base manage three replicas instead of one will ensure that Jimmy
> Xiang
> > > and
> > > > Jeffrey Zhong do nothing else for the next year or two but work on
> the
> > > new
> > > > interesting use cases introduced by this new level of complexity put
> > > upon a
> > > > system that has just achieved a hard-won stability.
> > > >
> > >
> > > Stack, the model is that the replicas (HRegionInfo with an added field
> > > 'replicaId') are treated just as any other region in the AM. You can
> > > see the code - it's not adding much at all in terms of new code to
> > > handle replicas.
> > >
> > >
> >
>
> Adding to what Devaraj said, we opted for actually creating one more
> HRegionInfo object
> per region per replica count so that the assignment state machine is not
> affected. The high level
> change is that we are creating replica x num regions many regions, and
> assign them. The LB
> ensures that replica's are placed with high availability across hosts and
> racks.
>
>
Ok.  Then it is about the same amount of work in either case.  The LB is to
be altered to factor namespaces.  This replicas work seems equivalent only
along another dimension (can the dimensions be joined so we get
namespace-aware balancing when you addreplica-aware balancing is added?)



> However, with different tables, it will be unintuitive
> since the meta, and the
> client side would have to bring different regions of different tables to
> make sense. Those tables
> will not have any associated data, but refer to the other tables etc.
>
>
That is right.  HBase core would be go untouched.  The read replica
'construct' would be an imposition done in a layer above.


> Trying to minimize the new code getting to the objective.
> >
> >
> >
> I think these should be addressed by region changes section in the design
> doc. In region-snapshots
> section, we detail how this will be like single-region snapshots. We do not
> need table snapshots per se,
> since we are opening the region replica from the files of the primary.
> There is already a working patch for this
> in the branch. In async-wal replication section, we mention how this can be
> build using the existing replication
> mechanism. We cannot directly replicate to a different table since we do
> not want to multiply the actual data in hdfs.
> But we will tap into the replica sink to do the in-cluster replication.
>
>
OK.


> > Quorum read/writes as in paxos, raft (Liyin talked about the Facebook
> > Hydrabase project at his keynote at hbasecon last year).
> >
>
> That won't happen without a major architecture surgery in HBase.
> HBASE-10070 is some
> major work, but is in no way a major arch change I would say. Hydrabase /
> megastore is also
> across DC, while we are mostly interested in intra-DC availability right
> now.
>
>
Timeline is one of the questions I have up on HBASE-10070.

Thanks E,
St.Ack

Re: [PROPOSAL] HBASE-10070 branch

Posted by Enis Söztutar <en...@gmail.com>.
>
> > > I am late to the game so take my comments w/ a grain of salt -- I'll
> > take a
> > > look at HBASE-10070 -- but high-level do we have to go the read
> replicas
> > > route?  IMO, having our current already-strained AssignmentManager code
> > > base manage three replicas instead of one will ensure that Jimmy Xiang
> > and
> > > Jeffrey Zhong do nothing else for the next year or two but work on the
> > new
> > > interesting use cases introduced by this new level of complexity put
> > upon a
> > > system that has just achieved a hard-won stability.
> > >
> >
> > Stack, the model is that the replicas (HRegionInfo with an added field
> > 'replicaId') are treated just as any other region in the AM. You can
> > see the code - it's not adding much at all in terms of new code to
> > handle replicas.
> >
> >
>

Adding to what Devaraj said, we opted for actually creating one more
HRegionInfo object
per region per replica count so that the assignment state machine is not
affected. The high level
change is that we are creating replica x num regions many regions, and
assign them. The LB
ensures that replica's are placed with high availability across hosts and
racks.


>
>
>
> > > A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were
> > > wondering if you couldn't solve this read replicas in a more hbase
> > 'native'
> > > way* by just bringing up three tables -- a main table and then two
> > snapshot
> > > clones with the clones refreshed on a period (via snapshot or via
> > > in-cluster replication) --  and then a shim on top of an HBase client
> > would
> > > read from the main table until failure and then from a snapshot until
> the
> > > main came back.  Reads from snapshot tables could be marked 'stale'.
> >  You'd
> > > have to modify the balancer so the tables -- or at least their regions
> --
> > > were physically distinct... you might be able just have the three
> tables
> > > each in a different namespace.
>

Doing region replicas via tables vs multiplying the num regions will
involve a very similar
amount of code changes. The LB still have to be aware of the fact that
regions from
different tables should not be co-hosted. As per above, in neither case,
the assignment state
machine is altered. However, with different tables, it will be unintuitive
since the meta, and the
client side would have to bring different regions of different tables to
make sense. Those tables
will not have any associated data, but refer to the other tables etc.


>  > >
> >
> > At a high level, considering all the work that would be needed in the
> > client (for it to be able to be aware of the primary and the snapshot
> > regions)
>
>
> Minor.  Right?  Snapshot tables would have a _snapshot suffix?
>
>
> > and in the master (to do with managing the placements of the
> > regions),
>
>
> Balancer already factors myriad attributes.  Adding one more rule seems
> like it would be near-in scope.
>
> And this would be work not in the client but in layer above the client.
>
>
>
> > I am not convinced. Also, consider that you will be taking a
> > lot of snapshots and adding to the filesystem's load for the file
> > creations.
> >
> >
> Snapshotting is a well-worn and tested code path.  Making them is pretty
> lightweight op.  Frequency would depend on what the app needs.
>
> Could go the replication route too, another well-worn and  tested code
> path.
>
> Trying to minimize the new code getting to the objective.
>
>
>
I think these should be addressed by region changes section in the design
doc. In region-snapshots
section, we detail how this will be like single-region snapshots. We do not
need table snapshots per se,
since we are opening the region replica from the files of the primary.
There is already a working patch for this
in the branch. In async-wal replication section, we mention how this can be
build using the existing replication
mechanism. We cannot directly replicate to a different table since we do
not want to multiply the actual data in hdfs.
But we will tap into the replica sink to do the in-cluster replication.


>
> > > Or how much more work would it take to follow the route our Facebook
> > > brothers and sisters have taken doing quorum reads and writes
> incluster?
> > >
> >
> > If you talking about Facebook's work that is talked about in
> > HBASE-7509, the quorum reads is something that we will benefit from,
> > and that will help the filesystem side of the story, but we still need
> > multiple (redundant) regions for the hbase side. If a region is not
> > reachable, the client could go to another replica for the region...
> >
> >
> No.
>
> Quorum read/writes as in paxos, raft (Liyin talked about the Facebook
> Hydrabase project at his keynote at hbasecon last year).
>

That won't happen without a major architecture surgery in HBase.
HBASE-10070 is some
major work, but is in no way a major arch change I would say. Hydrabase /
megastore is also
across DC, while we are mostly interested in intra-DC availability right
now.


>
>
>
> > > * When I say 'native' way in the above, what I mean by this is that
> HBase
> > > has always been about giving clients a 'consistent' view -- at least
> when
> > > the query is to the source cluster.  Introducing talk and APIs that
> talk
> > of
> > > 'eventual consistency' muddies our story.
> > >
> > >
> >
> > As we have discussed in the jira, there are use cases. And it's
> > optional - all the APIs provide 'consistency' by default (status quo).
> >
> >
> Sorry I'm behind.  Let me review.  My concern is that our shell and API now
> will have notions of consistency other than "what you write is what you
> read" all over them because we took on a use case that is 'interesting' but
> up to this at least, a rare request.
>

I think the Consistency API from the client and the shell is intuitive and
can be configured per
request, which is the expected behavior. (
https://github.com/enis/hbase/commit/cf2c94022200a6fa7f3153b7e0655134fb73ec8c
and
https://github.com/enis/hbase/commit/75a4d9d7734ffa4f8a7b5aeb382f7a08e444984e
)



>
> Thanks,
> St.Ack
>

Re: [PROPOSAL] HBASE-10070 branch

Posted by Stack <st...@duboce.net>.
On Wed, Jan 15, 2014 at 12:51 PM, Devaraj Das <dd...@hortonworks.com> wrote:

> Some responses inline. Thanks for the inputs.
>
> On Wed, Jan 15, 2014 at 11:17 AM, Stack <st...@duboce.net> wrote:
> > On Wed, Jan 15, 2014 at 12:44 AM, Enis Söztutar <enis@hortonworks.com
> >wrote:
> >
> >> Hi,
> >>
> >> I just wanted to give some updates on the HBASE-10070 efforts from the
> >> technical side, and development side, and propose a branch.
> >>
> >> From the technical side:
> >> The changes for region replicas phase 1 are becoming more mature and
> >> stable, and most of the "base" changes are starting to become good
> >> candidates for review. The code has been rebased to trunk, and the main
> >> working repo has been moved to the HBASE-10070 branch at
> >> https://github.com/enis/hbase/tree/hbase-10070.
> >>
> >> An overview of the changes that is working include:
> >>  - HRegionInfo & MetaReader & MetaEditor changes for support region
> >> replicas
> >>  - HTableDescriptor changes and shell changes for supporting
> >> REGION_REPLICATION
> >>  - WebUI changes to display whether a region is a replica or not
> >>  - AssignmentManager changes coupled with RegionStates & Master changes
> to
> >> create and assign replicas, alter table, enable table, etc support.
> >>
> >
> >
> > Thanks for the writeup.
> >
> > I am late to the game so take my comments w/ a grain of salt -- I'll
> take a
> > look at HBASE-10070 -- but high-level do we have to go the read replicas
> > route?  IMO, having our current already-strained AssignmentManager code
> > base manage three replicas instead of one will ensure that Jimmy Xiang
> and
> > Jeffrey Zhong do nothing else for the next year or two but work on the
> new
> > interesting use cases introduced by this new level of complexity put
> upon a
> > system that has just achieved a hard-won stability.
> >
>
> Stack, the model is that the replicas (HRegionInfo with an added field
> 'replicaId') are treated just as any other region in the AM. You can
> see the code - it's not adding much at all in terms of new code to
> handle replicas.
>
>
I'm getting there.  Will check it out.



> > A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were
> > wondering if you couldn't solve this read replicas in a more hbase
> 'native'
> > way* by just bringing up three tables -- a main table and then two
> snapshot
> > clones with the clones refreshed on a period (via snapshot or via
> > in-cluster replication) --  and then a shim on top of an HBase client
> would
> > read from the main table until failure and then from a snapshot until the
> > main came back.  Reads from snapshot tables could be marked 'stale'.
>  You'd
> > have to modify the balancer so the tables -- or at least their regions --
> > were physically distinct... you might be able just have the three tables
> > each in a different namespace.
> >
>
> At a high level, considering all the work that would be needed in the
> client (for it to be able to be aware of the primary and the snapshot
> regions)


Minor.  Right?  Snapshot tables would have a _snapshot suffix?


> and in the master (to do with managing the placements of the
> regions),


Balancer already factors myriad attributes.  Adding one more rule seems
like it would be near-in scope.

And this would be work not in the client but in layer above the client.



> I am not convinced. Also, consider that you will be taking a
> lot of snapshots and adding to the filesystem's load for the file
> creations.
>
>
Snapshotting is a well-worn and tested code path.  Making them is pretty
lightweight op.  Frequency would depend on what the app needs.

Could go the replication route too, another well-worn and  tested code path.

Trying to minimize the new code getting to the objective.



> > Or how much more work would it take to follow the route our Facebook
> > brothers and sisters have taken doing quorum reads and writes incluster?
> >
>
> If you talking about Facebook's work that is talked about in
> HBASE-7509, the quorum reads is something that we will benefit from,
> and that will help the filesystem side of the story, but we still need
> multiple (redundant) regions for the hbase side. If a region is not
> reachable, the client could go to another replica for the region...
>
>
No.

Quorum read/writes as in paxos, raft (Liyin talked about the Facebook
Hydrabase project at his keynote at hbasecon last year).



> > * When I say 'native' way in the above, what I mean by this is that HBase
> > has always been about giving clients a 'consistent' view -- at least when
> > the query is to the source cluster.  Introducing talk and APIs that talk
> of
> > 'eventual consistency' muddies our story.
> >
> >
>
> As we have discussed in the jira, there are use cases. And it's
> optional - all the APIs provide 'consistency' by default (status quo).
>
>
Sorry I'm behind.  Let me review.  My concern is that our shell and API now
will have notions of consistency other than "what you write is what you
read" all over them because we took on a use case that is 'interesting' but
up to this at least, a rare request.

Thanks,
St.Ack

Re: [PROPOSAL] HBASE-10070 branch

Posted by Devaraj Das <dd...@hortonworks.com>.
Some responses inline. Thanks for the inputs.

On Wed, Jan 15, 2014 at 11:17 AM, Stack <st...@duboce.net> wrote:
> On Wed, Jan 15, 2014 at 12:44 AM, Enis Söztutar <en...@hortonworks.com>wrote:
>
>> Hi,
>>
>> I just wanted to give some updates on the HBASE-10070 efforts from the
>> technical side, and development side, and propose a branch.
>>
>> From the technical side:
>> The changes for region replicas phase 1 are becoming more mature and
>> stable, and most of the "base" changes are starting to become good
>> candidates for review. The code has been rebased to trunk, and the main
>> working repo has been moved to the HBASE-10070 branch at
>> https://github.com/enis/hbase/tree/hbase-10070.
>>
>> An overview of the changes that is working include:
>>  - HRegionInfo & MetaReader & MetaEditor changes for support region
>> replicas
>>  - HTableDescriptor changes and shell changes for supporting
>> REGION_REPLICATION
>>  - WebUI changes to display whether a region is a replica or not
>>  - AssignmentManager changes coupled with RegionStates & Master changes to
>> create and assign replicas, alter table, enable table, etc support.
>>
>
>
> Thanks for the writeup.
>
> I am late to the game so take my comments w/ a grain of salt -- I'll take a
> look at HBASE-10070 -- but high-level do we have to go the read replicas
> route?  IMO, having our current already-strained AssignmentManager code
> base manage three replicas instead of one will ensure that Jimmy Xiang and
> Jeffrey Zhong do nothing else for the next year or two but work on the new
> interesting use cases introduced by this new level of complexity put upon a
> system that has just achieved a hard-won stability.
>

Stack, the model is that the replicas (HRegionInfo with an added field
'replicaId') are treated just as any other region in the AM. You can
see the code - it's not adding much at all in terms of new code to
handle replicas.

> A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were
> wondering if you couldn't solve this read replicas in a more hbase 'native'
> way* by just bringing up three tables -- a main table and then two snapshot
> clones with the clones refreshed on a period (via snapshot or via
> in-cluster replication) --  and then a shim on top of an HBase client would
> read from the main table until failure and then from a snapshot until the
> main came back.  Reads from snapshot tables could be marked 'stale'.  You'd
> have to modify the balancer so the tables -- or at least their regions --
> were physically distinct... you might be able just have the three tables
> each in a different namespace.
>

At a high level, considering all the work that would be needed in the
client (for it to be able to be aware of the primary and the snapshot
regions) and in the master (to do with managing the placements of the
regions), I am not convinced. Also, consider that you will be taking a
lot of snapshots and adding to the filesystem's load for the file
creations.

> Or how much more work would it take to follow the route our Facebook
> brothers and sisters have taken doing quorum reads and writes incluster?
>

If you talking about Facebook's work that is talked about in
HBASE-7509, the quorum reads is something that we will benefit from,
and that will help the filesystem side of the story, but we still need
multiple (redundant) regions for the hbase side. If a region is not
reachable, the client could go to another replica for the region...

> * When I say 'native' way in the above, what I mean by this is that HBase
> has always been about giving clients a 'consistent' view -- at least when
> the query is to the source cluster.  Introducing talk and APIs that talk of
> 'eventual consistency' muddies our story.
>
>

As we have discussed in the jira, there are use cases. And it's
optional - all the APIs provide 'consistency' by default (status quo).

>
>> These are some of the remaining things that we are currently working on:
>>  - RPC failover support for multi-gets
>>  - RPC failover support for scans
>>  - RPC cancellation
>>
>
> This all sounds great.  I was sort of hoping we wouldn't have to do stuff
> like cancellation ourselves though.  Was hoping we could take on an already
> done 'rpc' engine that did this kind of stuff for us.
>
> ...
>
>
>
>> Development side:
>> As discussed in the issue design doc
>>
>> https://issues.apache.org/jira/secure/attachment/12616659/HighAvailabilityDesignforreadsApachedoc.pdf
>> "Apache
>> code development process" section, at this time we would like to
>> propose:
>>  (1) Creation of HBASE-10070 branch in svn which will be a fork of trunk as
>> of the date branch is created. All of the target authors (me, Devaraj,
>> Nicolas, Sergey) are already committers. I do not remember whether our
>> bylaws require votes on creating branches.
>>
>
> We don't have bylaws.  It is my understanding that any committer can freely
> make branches and I see nothing wrong w/ this.
>
>

Great

>
>>  (2) The branch will only contain commits that have been reviewed and +1'ed
>> from 2 other committers other than the patch author. Every commit in this
>> branch will have a single patch (maybe with unforeseen addendums) and and
>> associated jira which is a subtask of HBASE-10070.
>>
>
> OK.
>
>
>>  (3) We will use the branch HBASE-10070 hosted at my github repo
>> https://github.com/enis/hbase/tree/hbase-10070 as a working branch with
>> semi-dirty history and "this branch might eat your hard drive" guarantees.
>>  (4) All code contributions / review will be welcome as always. I can give
>> you push perms to the github branch if you are interested in contributing.
>>  (5) Once we have HBASE-10070 Phase 1 tasks done (as described in the doc),
>> we will put up a VOTE to merge the branch in. We will require 3 +1's for
>> the merge in. If we can get early reviews the merge vote will be much less
>> pain since the branch will be in a clean state and there have been reviews
>> per patch. We might need a final rebase, but that should not cause major
>> work I imagine.
>>
>> We are hoping this will be a nice way to develop and deliver the feature to
>> the trunk, but as always all suggestions, comments welcome.
>>
>
> All above sounds good.  Let me go look at what is there in HBASE-10070.

Thanks, please have a look at the code. It's not scary at all :-)

>
> St.Ack

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: [PROPOSAL] HBASE-10070 branch

Posted by Andrew Purtell <ap...@apache.org>.
On Wed, Jan 15, 2014 at 11:17 AM, Stack <st...@duboce.net> wrote:

> A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were
> wondering if you couldn't solve this read replicas in a more hbase 'native'
> way* by just bringing up three tables -- a main table and then two snapshot
> clones with the clones refreshed on a period (via snapshot or via
> in-cluster replication) --  and then a shim on top of an HBase client would
> read from the main table until failure and then from a snapshot until the
> main came back.  Reads from snapshot tables could be marked 'stale'.  You'd
> have to modify the balancer so the tables -- or at least their regions --
> were physically distinct... you might be able just have the three tables
> each in a different namespace.
>

I like the idea of building on what we have today without introducing a lot
of new code and/or making existing code twisty.



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: [PROPOSAL] HBASE-10070 branch

Posted by Stack <st...@duboce.net>.
On Wed, Jan 15, 2014 at 12:44 AM, Enis Söztutar <en...@hortonworks.com>wrote:

> Hi,
>
> I just wanted to give some updates on the HBASE-10070 efforts from the
> technical side, and development side, and propose a branch.
>
> From the technical side:
> The changes for region replicas phase 1 are becoming more mature and
> stable, and most of the "base" changes are starting to become good
> candidates for review. The code has been rebased to trunk, and the main
> working repo has been moved to the HBASE-10070 branch at
> https://github.com/enis/hbase/tree/hbase-10070.
>
> An overview of the changes that is working include:
>  - HRegionInfo & MetaReader & MetaEditor changes for support region
> replicas
>  - HTableDescriptor changes and shell changes for supporting
> REGION_REPLICATION
>  - WebUI changes to display whether a region is a replica or not
>  - AssignmentManager changes coupled with RegionStates & Master changes to
> create and assign replicas, alter table, enable table, etc support.
>


Thanks for the writeup.

I am late to the game so take my comments w/ a grain of salt -- I'll take a
look at HBASE-10070 -- but high-level do we have to go the read replicas
route?  IMO, having our current already-strained AssignmentManager code
base manage three replicas instead of one will ensure that Jimmy Xiang and
Jeffrey Zhong do nothing else for the next year or two but work on the new
interesting use cases introduced by this new level of complexity put upon a
system that has just achieved a hard-won stability.

A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were
wondering if you couldn't solve this read replicas in a more hbase 'native'
way* by just bringing up three tables -- a main table and then two snapshot
clones with the clones refreshed on a period (via snapshot or via
in-cluster replication) --  and then a shim on top of an HBase client would
read from the main table until failure and then from a snapshot until the
main came back.  Reads from snapshot tables could be marked 'stale'.  You'd
have to modify the balancer so the tables -- or at least their regions --
were physically distinct... you might be able just have the three tables
each in a different namespace.

Or how much more work would it take to follow the route our Facebook
brothers and sisters have taken doing quorum reads and writes incluster?

* When I say 'native' way in the above, what I mean by this is that HBase
has always been about giving clients a 'consistent' view -- at least when
the query is to the source cluster.  Introducing talk and APIs that talk of
'eventual consistency' muddies our story.



> These are some of the remaining things that we are currently working on:
>  - RPC failover support for multi-gets
>  - RPC failover support for scans
>  - RPC cancellation
>

This all sounds great.  I was sort of hoping we wouldn't have to do stuff
like cancellation ourselves though.  Was hoping we could take on an already
done 'rpc' engine that did this kind of stuff for us.

...



> Development side:
> As discussed in the issue design doc
>
> https://issues.apache.org/jira/secure/attachment/12616659/HighAvailabilityDesignforreadsApachedoc.pdf
> "Apache
> code development process" section, at this time we would like to
> propose:
>  (1) Creation of HBASE-10070 branch in svn which will be a fork of trunk as
> of the date branch is created. All of the target authors (me, Devaraj,
> Nicolas, Sergey) are already committers. I do not remember whether our
> bylaws require votes on creating branches.
>

We don't have bylaws.  It is my understanding that any committer can freely
make branches and I see nothing wrong w/ this.



>  (2) The branch will only contain commits that have been reviewed and +1'ed
> from 2 other committers other than the patch author. Every commit in this
> branch will have a single patch (maybe with unforeseen addendums) and and
> associated jira which is a subtask of HBASE-10070.
>

OK.


>  (3) We will use the branch HBASE-10070 hosted at my github repo
> https://github.com/enis/hbase/tree/hbase-10070 as a working branch with
> semi-dirty history and "this branch might eat your hard drive" guarantees.
>  (4) All code contributions / review will be welcome as always. I can give
> you push perms to the github branch if you are interested in contributing.
>  (5) Once we have HBASE-10070 Phase 1 tasks done (as described in the doc),
> we will put up a VOTE to merge the branch in. We will require 3 +1's for
> the merge in. If we can get early reviews the merge vote will be much less
> pain since the branch will be in a clean state and there have been reviews
> per patch. We might need a final rebase, but that should not cause major
> work I imagine.
>
> We are hoping this will be a nice way to develop and deliver the feature to
> the trunk, but as always all suggestions, comments welcome.
>

All above sounds good.  Let me go look at what is there in HBASE-10070.

St.Ack