You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Neil Yalowitz <ne...@gmail.com> on 2012/09/13 22:18:14 UTC

replication - how do I know the status?

Hi all,

I'm using HBase replication between two clusters running CDH3u3 and I
recently noticed that a replicated column family was "lagging" by more than
a day... that is, it required more than 24 hours for a Put to replicate
from master to slave.  The root cause of the lag appears to be swapping and
other bad behavior.

The real question I have is this: how do I know the state of replication at
any given time?  Does a large amount of data in /hbase/.logs indicate that
replication is falling behind?  What about /hbase/.oldlogs which seems to
grow forever?  What red flags should I look for to tell me that there is a
problem with replication?


Neil Yalowitz
neilyalowitz@gmail.com

Re: replication - how do I know the status?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Thu, Sep 13, 2012 at 2:28 PM, Neil Yalowitz <ne...@gmail.com> wrote:
> This is a great answer, I can see that particular ganglia metric sharply
> increased when the issue began.  Thanks much.

Nice!

>
> One followup question:
>
> Can a distressed slave cluster cause performance issues on the master
> cluster?  It appears our performance problem was occurring on the slave
> peer, but the master cluster almost crashed as well.  I'm trying to
> determine if that was a coincidence or something more...

That's a tougher one, but FWIW the work required on the master cluster
is low compared to what the slave has to do; the master just needs to
read a bunch of edits and send them whereas the slave has to write
them to the WAL, add them to the MemStore, eventually flush and
compact, etc.

Also if you had a big MR job that ran on the master and that inserted
a lot of data, I would assume that it made everything slower. If it's
also what caused swapping then it would explain a lot.

J-D

Re: replication - how do I know the status?

Posted by Neil Yalowitz <ne...@gmail.com>.
This is a great answer, I can see that particular ganglia metric sharply
increased when the issue began.  Thanks much.

One followup question:

Can a distressed slave cluster cause performance issues on the master
cluster?  It appears our performance problem was occurring on the slave
peer, but the master cluster almost crashed as well.  I'm trying to
determine if that was a coincidence or something more...


Neil

On Thu, Sep 13, 2012 at 5:18 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> The best metric at the moment is hbase.replication.sizeOfLogQueue
> published through JMX. If your have Ganglia, opentsdb or Cacti you can
> graph how many logs per server need to be replicated and then you'll
> have a good idea of how much data needs to be replicated.
>
> If it goes up to more than 2 per server for a few minutes, you know
> you are either slowing down or someone is inserting a lot of data.
>
> J-D
>
> On Thu, Sep 13, 2012 at 1:18 PM, Neil Yalowitz <ne...@gmail.com>
> wrote:
> > Hi all,
> >
> > I'm using HBase replication between two clusters running CDH3u3 and I
> > recently noticed that a replicated column family was "lagging" by more
> than
> > a day... that is, it required more than 24 hours for a Put to replicate
> > from master to slave.  The root cause of the lag appears to be swapping
> and
> > other bad behavior.
> >
> > The real question I have is this: how do I know the state of replication
> at
> > any given time?  Does a large amount of data in /hbase/.logs indicate
> that
> > replication is falling behind?  What about /hbase/.oldlogs which seems to
> > grow forever?  What red flags should I look for to tell me that there is
> a
> > problem with replication?
> >
> >
> > Neil Yalowitz
> > neilyalowitz@gmail.com
>

Re: replication - how do I know the status?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
The best metric at the moment is hbase.replication.sizeOfLogQueue
published through JMX. If your have Ganglia, opentsdb or Cacti you can
graph how many logs per server need to be replicated and then you'll
have a good idea of how much data needs to be replicated.

If it goes up to more than 2 per server for a few minutes, you know
you are either slowing down or someone is inserting a lot of data.

J-D

On Thu, Sep 13, 2012 at 1:18 PM, Neil Yalowitz <ne...@gmail.com> wrote:
> Hi all,
>
> I'm using HBase replication between two clusters running CDH3u3 and I
> recently noticed that a replicated column family was "lagging" by more than
> a day... that is, it required more than 24 hours for a Put to replicate
> from master to slave.  The root cause of the lag appears to be swapping and
> other bad behavior.
>
> The real question I have is this: how do I know the state of replication at
> any given time?  Does a large amount of data in /hbase/.logs indicate that
> replication is falling behind?  What about /hbase/.oldlogs which seems to
> grow forever?  What red flags should I look for to tell me that there is a
> problem with replication?
>
>
> Neil Yalowitz
> neilyalowitz@gmail.com