You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Eric Newton <er...@gmail.com> on 2014/03/14 20:18:04 UTC

HA namenode questions

For those of you running HA NN on large clusters, I'm looking for some
advice.

I was looking at an HA NN config today.  Either by default, or by following
the configuration instructions, I saw that the zookeeper timeout was set to
5 seconds.

* is this a reasonable timeout?
* do you provide HA NN its own set of zookeepers?

We have seen problems with large GC pauses with tablet servers.  This
happens less and less as we have learned more tricks, but I'm constantly
talking to users who want their zookeeper timeout as high as two minutes.

We have also had to increase the number of zookeepers on our largest
clusters in order to handle the "thundering herd" load when large
map/reduce jobs kick off and they all start talking to accumulo, which
requires reading information from zookeeper.

Any experience you can share about HA NN configuration at scales over few
hundred nodes would be appreciated.

-Eric

Re: HA namenode questions

Posted by Eric Newton <er...@gmail.com>.
Thanks Mike (and Todd), that clears things up.  I was not aware that the
zookeeper locks were held by a separate process (ZKFC).

-Eric



On Fri, Mar 14, 2014 at 4:24 PM, Mike Drob <ma...@cloudera.com> wrote:

> Replies from Todd Lipcon in-line.
>
> Mike
>
> ---------- Forwarded message ----------
> From: Todd Lipcon <to...@cloudera.com>
> Date: Fri, Mar 14, 2014 at 4:14 PM
> Subject: Re: HA namenode questions
>
>
> I'm not on dev@accumulo upstream list anymore, but here's an answer. Feel
> free to forward onto the public list (I've known Eric for a while)
>
>
> ---------- Forwarded message ----------
> > From: Eric Newton <er...@gmail.com>
> > Date: Fri, Mar 14, 2014 at 3:18 PM
> > Subject: HA namenode questions
> > To: dev@accumulo.apache.org
> >
> >
> > For those of you running HA NN on large clusters, I'm looking for some
> > advice.
> >
> > I was looking at an HA NN config today.  Either by default, or by
> following
> > the configuration instructions, I saw that the zookeeper timeout was set
> to
> > 5 seconds.
> >
> > * is this a reasonable timeout?
> >
> >
> Yes -- this timeout is only used from the ZKFC process, which is a very
> lightweight process whose _only_ jobs are to (a) ping ZK, and (b) ping the
> NN to check its health. It has on the order of a few MB of heap usage, so
> should never GC. If it goes away longer than 5 seconds something is almost
> certainly wrong with the machine or network.
>
> That said, if you would rather ride out a longer network blip (eg a switch
> reboot) you could choose to make it longer.
>
>
> >  * do you provide HA NN its own set of zookeepers?
> >
> >
> So long as the ZKs aren't ridiculously overloaded, sharing should be fine.
> If you have a lot of un-tamed clients to some other ZK cluster, it's
> probably best from an isolation perspective to run your own ensemble for HA
> purposes. But, the ZK daemons could be colocated on the NNs + JT for
> example so long as they get dedicated spindles.
>
>
> >  We have seen problems with large GC pauses with tablet servers.  This
> > happens less and less as we have learned more tricks, but I'm constantly
> > talking to users who want their zookeeper timeout as high as two minutes.
> >
> > Yea, the ZKFC has no heap usage, so no GC.
>
>
> >  We have also had to increase the number of zookeepers on our largest
> > clusters in order to handle the "thundering herd" load when large
> > map/reduce jobs kick off and they all start talking to accumulo, which
> > requires reading information from zookeeper.
> >
> > Clients today in HDFS HA don't ever talk to ZK, so the number of nodes
> accessing ZK is limited to just the two NNs.
>
> >  Any experience you can share about HA NN configuration at scales over
> few
> > hundred nodes would be appreciated.
> >
> > The ZK interaction should have no dependence on cluster size. The timeout
> for how long it is expected to become active can have a dependence on
> number of blocks in the cluster, but you should be able to see that by
> doing some "practice failovers". We're working on making the
> transitionToActive process quicker and more constant-time rather than
> dependent on initializing block replication queues inline with the
> failover.
>

Fwd: HA namenode questions

Posted by Mike Drob <ma...@cloudera.com>.
Replies from Todd Lipcon in-line.

Mike

---------- Forwarded message ----------
From: Todd Lipcon <to...@cloudera.com>
Date: Fri, Mar 14, 2014 at 4:14 PM
Subject: Re: HA namenode questions


I'm not on dev@accumulo upstream list anymore, but here's an answer. Feel
free to forward onto the public list (I've known Eric for a while)


---------- Forwarded message ----------
> From: Eric Newton <er...@gmail.com>
> Date: Fri, Mar 14, 2014 at 3:18 PM
> Subject: HA namenode questions
> To: dev@accumulo.apache.org
>
>
> For those of you running HA NN on large clusters, I'm looking for some
> advice.
>
> I was looking at an HA NN config today.  Either by default, or by following
> the configuration instructions, I saw that the zookeeper timeout was set to
> 5 seconds.
>
> * is this a reasonable timeout?
>
>
Yes -- this timeout is only used from the ZKFC process, which is a very
lightweight process whose _only_ jobs are to (a) ping ZK, and (b) ping the
NN to check its health. It has on the order of a few MB of heap usage, so
should never GC. If it goes away longer than 5 seconds something is almost
certainly wrong with the machine or network.

That said, if you would rather ride out a longer network blip (eg a switch
reboot) you could choose to make it longer.


>  * do you provide HA NN its own set of zookeepers?
>
>
So long as the ZKs aren't ridiculously overloaded, sharing should be fine.
If you have a lot of un-tamed clients to some other ZK cluster, it's
probably best from an isolation perspective to run your own ensemble for HA
purposes. But, the ZK daemons could be colocated on the NNs + JT for
example so long as they get dedicated spindles.


>  We have seen problems with large GC pauses with tablet servers.  This
> happens less and less as we have learned more tricks, but I'm constantly
> talking to users who want their zookeeper timeout as high as two minutes.
>
> Yea, the ZKFC has no heap usage, so no GC.


>  We have also had to increase the number of zookeepers on our largest
> clusters in order to handle the "thundering herd" load when large
> map/reduce jobs kick off and they all start talking to accumulo, which
> requires reading information from zookeeper.
>
> Clients today in HDFS HA don't ever talk to ZK, so the number of nodes
accessing ZK is limited to just the two NNs.

>  Any experience you can share about HA NN configuration at scales over few
> hundred nodes would be appreciated.
>
> The ZK interaction should have no dependence on cluster size. The timeout
for how long it is expected to become active can have a dependence on
number of blocks in the cluster, but you should be able to see that by
doing some "practice failovers". We're working on making the
transitionToActive process quicker and more constant-time rather than
dependent on initializing block replication queues inline with the failover.

Re: HA namenode questions

Posted by Mike Drob <ma...@cloudera.com>.
Specifically, for dealing with a large number of clients, you can use
ZooKeeper Observers.

---------- Forwarded message ----------
From: Eric Newton <er...@gmail.com>
Date: Fri, Mar 14, 2014 at 3:18 PM
Subject: HA namenode questions
To: dev@accumulo.apache.org


For those of you running HA NN on large clusters, I'm looking for some
advice.

I was looking at an HA NN config today.  Either by default, or by following
the configuration instructions, I saw that the zookeeper timeout was set to
5 seconds.

* is this a reasonable timeout?
* do you provide HA NN its own set of zookeepers?

We have seen problems with large GC pauses with tablet servers.  This
happens less and less as we have learned more tricks, but I'm constantly
talking to users who want their zookeeper timeout as high as two minutes.

We have also had to increase the number of zookeepers on our largest
clusters in order to handle the "thundering herd" load when large
map/reduce jobs kick off and they all start talking to accumulo, which
requires reading information from zookeeper.

Any experience you can share about HA NN configuration at scales over few
hundred nodes would be appreciated.

-Eric