You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2013/05/19 21:13:45 UTC

Questions about HBase replication

Hi,

I have a couple of questions about HBase replication...

1) When we ship edits to slave cluster - do we retain the timestamps in the
edits - if we don't, I can imagine hitting some inconsistencies ?

2) When a region server fails, the master renames the directory containing
WAL(s). Does this impact reading of those logs for replication ?

Thanks
Varun

Re: Questions about HBase replication

Posted by Varun Sharma <va...@pinterest.com>.

So, we have a separate thread doing the recovered logs. That is good to
know. I was mostly concerned about any potential races b/w the master
renaming the log files, doing the distributed log split and doing a lease
recovery over the final file when the DN also dies. Apart from that, it
seemed to me since the master has an authoritative view of the cluster, it
could do the log assignment better in wake of failures (possibly also using
any block placements etc). However, I dont have data to illustrate that
this is a must have but it looked like a somewhat cleaner solution and
would only require each region server to care about its own replication
znode.

One thing i am seeing in the region server logs though is that the deletion
from the zookeeper is taking 30 minutes to an hour after the whole WAL is
replicated - there are two outstanding WAL(s) and only the newer one is
being replicated - only the znode of the newer WAL is getting updated while
the older WAL is just lying around. Still digging into it... (this is on
0.94.7)

Thanks
Varun

On Mon, May 20, 2013 at 4:14 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> > Yes, but the region server now has 2X the number of WAL(s) to replicate
> and
> > could suffer higher replication lag as a result...
>
> In my experience this hasn't been an issue. Keep in mind that the RS
> will only replicate what's in the queue when it was recovered and
> nothing more. It means you have one more thread reading from a likely
> remote disk (low penalty), then it has to build its own set of edits
> to replicate (unless you are already severly CPU contented that won't
> be an issue), then you send those edits to the other cluster (unless
> you are already filling that machine's pipe, it won't be an issue).
>
> Was there anything you were thinking about? You'd rather spread those
> logs to a bunch of machines?
>
> J-D
>

Re: Questions about HBase replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.

> Yes, but the region server now has 2X the number of WAL(s) to replicate and
> could suffer higher replication lag as a result...

In my experience this hasn't been an issue. Keep in mind that the RS
will only replicate what's in the queue when it was recovered and
nothing more. It means you have one more thread reading from a likely
remote disk (low penalty), then it has to build its own set of edits
to replicate (unless you are already severly CPU contented that won't
be an issue), then you send those edits to the other cluster (unless
you are already filling that machine's pipe, it won't be an issue).

Was there anything you were thinking about? You'd rather spread those
logs to a bunch of machines?

J-D

Re: Questions about HBase replication

Posted by Varun Sharma <va...@pinterest.com>.

On Mon, May 20, 2013 at 3:54 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> On Mon, May 20, 2013 at 3:48 PM, Varun Sharma <va...@pinterest.com> wrote:
> > Thanks JD for the response... I was just wondering if issues have ever
> been
> > seen with regards to moving over a large number of WAL(s) entirely from
> one
> > region server to another since that would double the replication related
> > load on the one server which takes over.
>
> We only move the znodes, no data is actually being re-written.
>

Yes, but the region server now has 2X the number of WAL(s) to replicate and
could suffer higher replication lag as a result...

> >
> > Another side question: After the WAL has been replicated - is it purged
> > immediately or soonish from the zookeeper ?
>
> The WAL's znode reference is deleted immediately. The actual WAL will
> be deleted according to the chain of log cleaners.
>
> J-D
>

Re: Questions about HBase replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Mon, May 20, 2013 at 3:48 PM, Varun Sharma <va...@pinterest.com> wrote:
> Thanks JD for the response... I was just wondering if issues have ever been
> seen with regards to moving over a large number of WAL(s) entirely from one
> region server to another since that would double the replication related
> load on the one server which takes over.

We only move the znodes, no data is actually being re-written.

>
> Another side question: After the WAL has been replicated - is it purged
> immediately or soonish from the zookeeper ?

The WAL's znode reference is deleted immediately. The actual WAL will
be deleted according to the chain of log cleaners.

J-D

Re: Questions about HBase replication

Posted by Varun Sharma <va...@pinterest.com>.

Thanks JD for the response... I was just wondering if issues have ever been
seen with regards to moving over a large number of WAL(s) entirely from one
region server to another since that would double the replication related
load on the one server which takes over.

Another side question: After the WAL has been replicated - is it purged
immediately or soonish from the zookeeper ?

Thanks
Varun


On Mon, May 20, 2013 at 9:57 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> On Mon, May 20, 2013 at 12:35 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> > Hi Lars,
> >
> > Thanks for the response.
> >
> > Regarding #2 again, so if RS1 failed, then the following happens...
> > 1) RS2 takes over its logs...
> > 2) Master renames the log containing directory to have a -splitting in
> the
> > path
> > 3) Does RS2 already know about the "-splitting" path ?
>
> It will look at all the possible locations. See
> ReplicationSource.openReader
>
> >
> > Also on a related note, was there a reason that we have all region
> servers
> > watching all other region server's queue of logs. Otherwise, couldn't the
> > master have done the reassignment of outstanding logs to other region
> > servers more fairly upon failure ?
>
> I think I did it like that because it was easier since the region
> server has to be told to grab the queue(s) anyway.
>
> >
> > Thanks
> > Varun
> >
> >
> > On Sun, May 19, 2013 at 8:49 PM, lars hofhansl <la...@apache.org> wrote:
> >
> >> #1 yes
> >> #2 no
> >>
> >> :)
> >>
> >> Now, there are scenarios where inconsistencies can happen. The edits are
> >> not necessarily shipped in order when there are failures.
> >> So it is possible to have some Puts at T1 and some Deletes at T2 (T1 <
> >> T2), and end up with the deletes shipped first.
> >> Now imagine a compaction happens at the slave after the Deletes are
> >> shipped to the slave, but before the Puts are shipped... The Puts will
> >> reappear.
> >>
> >> -- Lars
> >>
> >>
> >>
> >> ________________________________
> >>  From: Varun Sharma <va...@pinterest.com>
> >> To: user@hbase.apache.org
> >> Sent: Sunday, May 19, 2013 12:13 PM
> >> Subject: Questions about HBase replication
> >>
> >>
> >> Hi,
> >>
> >> I have a couple of questions about HBase replication...
> >>
> >> 1) When we ship edits to slave cluster - do we retain the timestamps in
> the
> >> edits - if we don't, I can imagine hitting some inconsistencies ?
> >>
> >> 2) When a region server fails, the master renames the directory
> containing
> >> WAL(s). Does this impact reading of those logs for replication ?
> >>
> >> Thanks
> >> Varun
> >>
>

Re: Questions about HBase replication

Posted by Jean-Daniel Cryans <jd...@apache.org>.

On Mon, May 20, 2013 at 12:35 AM, Varun Sharma <va...@pinterest.com> wrote:
> Hi Lars,
>
> Thanks for the response.
>
> Regarding #2 again, so if RS1 failed, then the following happens...
> 1) RS2 takes over its logs...
> 2) Master renames the log containing directory to have a -splitting in the
> path
> 3) Does RS2 already know about the "-splitting" path ?

It will look at all the possible locations. See ReplicationSource.openReader

>
> Also on a related note, was there a reason that we have all region servers
> watching all other region server's queue of logs. Otherwise, couldn't the
> master have done the reassignment of outstanding logs to other region
> servers more fairly upon failure ?

I think I did it like that because it was easier since the region
server has to be told to grab the queue(s) anyway.

>
> Thanks
> Varun
>
>
> On Sun, May 19, 2013 at 8:49 PM, lars hofhansl <la...@apache.org> wrote:
>
>> #1 yes
>> #2 no
>>
>> :)
>>
>> Now, there are scenarios where inconsistencies can happen. The edits are
>> not necessarily shipped in order when there are failures.
>> So it is possible to have some Puts at T1 and some Deletes at T2 (T1 <
>> T2), and end up with the deletes shipped first.
>> Now imagine a compaction happens at the slave after the Deletes are
>> shipped to the slave, but before the Puts are shipped... The Puts will
>> reappear.
>>
>> -- Lars
>>
>>
>>
>> ________________________________
>>  From: Varun Sharma <va...@pinterest.com>
>> To: user@hbase.apache.org
>> Sent: Sunday, May 19, 2013 12:13 PM
>> Subject: Questions about HBase replication
>>
>>
>> Hi,
>>
>> I have a couple of questions about HBase replication...
>>
>> 1) When we ship edits to slave cluster - do we retain the timestamps in the
>> edits - if we don't, I can imagine hitting some inconsistencies ?
>>
>> 2) When a region server fails, the master renames the directory containing
>> WAL(s). Does this impact reading of those logs for replication ?
>>
>> Thanks
>> Varun
>>

Re: Questions about HBase replication

Posted by Varun Sharma <va...@pinterest.com>.

Hi Lars,

Thanks for the response.

Regarding #2 again, so if RS1 failed, then the following happens...
1) RS2 takes over its logs...
2) Master renames the log containing directory to have a -splitting in the
path
3) Does RS2 already know about the "-splitting" path ?

Also on a related note, was there a reason that we have all region servers
watching all other region server's queue of logs. Otherwise, couldn't the
master have done the reassignment of outstanding logs to other region
servers more fairly upon failure ?

Thanks
Varun

On Sun, May 19, 2013 at 8:49 PM, lars hofhansl <la...@apache.org> wrote:

> #1 yes
> #2 no
>
> :)
>
> Now, there are scenarios where inconsistencies can happen. The edits are
> not necessarily shipped in order when there are failures.
> So it is possible to have some Puts at T1 and some Deletes at T2 (T1 <
> T2), and end up with the deletes shipped first.
> Now imagine a compaction happens at the slave after the Deletes are
> shipped to the slave, but before the Puts are shipped... The Puts will
> reappear.
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Sunday, May 19, 2013 12:13 PM
> Subject: Questions about HBase replication
>
>
> Hi,
>
> I have a couple of questions about HBase replication...
>
> 1) When we ship edits to slave cluster - do we retain the timestamps in the
> edits - if we don't, I can imagine hitting some inconsistencies ?
>
> 2) When a region server fails, the master renames the directory containing
> WAL(s). Does this impact reading of those logs for replication ?
>
> Thanks
> Varun
>

Re: Questions about HBase replication

Posted by lars hofhansl <la...@apache.org>.

#1 yes
#2 no

:)

Now, there are scenarios where inconsistencies can happen. The edits are not necessarily shipped in order when there are failures.
So it is possible to have some Puts at T1 and some Deletes at T2 (T1 < T2), and end up with the deletes shipped first.
Now imagine a compaction happens at the slave after the Deletes are shipped to the slave, but before the Puts are shipped... The Puts will reappear.

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Sunday, May 19, 2013 12:13 PM
Subject: Questions about HBase replication
 

Hi,

I have a couple of questions about HBase replication...

1) When we ship edits to slave cluster - do we retain the timestamps in the
edits - if we don't, I can imagine hitting some inconsistencies ?

2) When a region server fails, the master renames the directory containing
WAL(s). Does this impact reading of those logs for replication ?

Thanks
Varun