You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Thomas Koch <th...@koch.ro> on 2010/02/19 09:41:01 UTC

why not zookeeper for the namenode

Hi,

yesterday I read the documentation of zookeeper and the zk contrib bookkeeper. 
From what I read, I thought, that bookkeeper would be the ideal enhancement 
for the namenode, to make it distributed and therefor finaly highly available.
Now I searched, if work in that direction has already started and found out, 
that apparently a totaly different approach has been choosen:
http://issues.apache.org/jira/browse/HADOOP-4539

Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if 
somebody could satisfy my curiosity:

- Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it  
  seems to do a similiar job already in hbase.

- Isn't it, that with HADOOP-4539 client's can only connect to one namenode at 
  a time, leaving the burden of all reads and writes on the one's shoulder?

- Isn't it, that zookeeper would be more network efficient. It requires only a
  majority of nodes to receive a change, while HADOOP-4539 seems to require
  all backup nodes to receive a change before its persisted.

Thanks for any explanation,

Thomas Koch, http://www.koch.ro

Re: why not zookeeper for the namenode

Posted by Todd Lipcon <to...@cloudera.com>.
On Fri, Feb 19, 2010 at 12:41 AM, Thomas Koch <th...@koch.ro> wrote:
> Hi,
>
> yesterday I read the documentation of zookeeper and the zk contrib bookkeeper.
> From what I read, I thought, that bookkeeper would be the ideal enhancement
> for the namenode, to make it distributed and therefor finaly highly available.
> Now I searched, if work in that direction has already started and found out,
> that apparently a totaly different approach has been choosen:
> http://issues.apache.org/jira/browse/HADOOP-4539
>
> Since I'm new to hadoop, I do trust in your decision. However I'd be glad, if
> somebody could satisfy my curiosity:
>

I didn't work on that particular design, but I'll do my best to answer
your questions below:

> - Why hasn't zookeeper(-bookkeeper) not been choosen? Especially since it
>  seems to do a similiar job already in hbase.
>

HBase does not use Bookkeeper, currently. Rather, it just uses ZK for
election and some small amount of metadata tracking. It therefore is
only storing a small amount of data in ZK, whereas the Hadoop NN would
have to store many GB worth of namesystem data. I don't think anyone
has tried putting such a large amount of data in ZK yet, and being the
first to do something is never without problems :)

Additionally, when this design was made, Bookkeeper was very new. It's
still in development, as I understand it.

> - Isn't it, that with HADOOP-4539 client's can only connect to one namenode at
>  a time, leaving the burden of all reads and writes on the one's shoulder?
>

Yes.

> - Isn't it, that zookeeper would be more network efficient. It requires only a
>  majority of nodes to receive a change, while HADOOP-4539 seems to require
>  all backup nodes to receive a change before its persisted.
>

Potentially. However, "all backup nodes" is usually just 1. In our
experience, and the experience of most other Hadoop deployments I've
spoken with, the primary factors decreasing NN availability are *not*
system crashes, but rather lack of online upgrade capability, slow
restart time for planned restarts, etc. Adding a hot standby can help
with the planned upgrade situation, but two standbys doesn't give you
much reliability above one. In a datacenter, the failure correlations
are generally such that racks either fail independently, or the entire
DC has lost power. So, there aren't a lot of cases where 3 NN replicas
would buy you much over 2.

-Todd

> Thanks for any explanation,
>
> Thomas Koch, http://www.koch.ro
>

Re: why not zookeeper for the namenode

Posted by Steve Loughran <st...@apache.org>.
Eli Collins wrote:
>> From what I read, I thought, that bookkeeper would be the ideal enhancement
>> for the namenode, to make it distributed and therefor finaly highly available.
> 
> Being distributed doesn't imply high availability. Availability is
> about minimizing downtime. For example, a primary that can fail over
> to a secondary (and back) may be more available than a distributed
> system that needs to be restarted when it's software or dependencies
> are upgraded. A distributed system that can only handle x ops/second
> may be less available than a non-distributed system that can handle 2x
> ops/second. A large distributed system may be less available than n
> smaller systems, depending on consistency requirements. Implementation
> and management complexity often result in more downtime. Etc.
> Decreasing the time it takes to restart and upgrade an HDFS cluster
> would significantly improve it's availability for many users (there
> are jiras for these).
> 

There's another availability, "engineering availability". What we have 
today is nice in that the HDFS engineering skills are distributed among 
a number of companies, and the source is there for anyone else to learn, 
rebuilding is fairly straightforward.

Don't dismiss that as unimportant. Engineering availability means that 
if you discover a new problem, you have the ability to patch your copy 
of the code, and keep going, while filing a bug report for others to 
deal with. That significantly reduces your downtime on a software 
problem compared to filing a bugrep and hoping that a future release 
will have addressed it

-steve


Re: why not zookeeper for the namenode

Posted by Eli Collins <el...@cloudera.com>.
> From what I read, I thought, that bookkeeper would be the ideal enhancement
> for the namenode, to make it distributed and therefor finaly highly available.

Being distributed doesn't imply high availability. Availability is
about minimizing downtime. For example, a primary that can fail over
to a secondary (and back) may be more available than a distributed
system that needs to be restarted when it's software or dependencies
are upgraded. A distributed system that can only handle x ops/second
may be less available than a non-distributed system that can handle 2x
ops/second. A large distributed system may be less available than n
smaller systems, depending on consistency requirements. Implementation
and management complexity often result in more downtime. Etc.
Decreasing the time it takes to restart and upgrade an HDFS cluster
would significantly improve it's availability for many users (there
are jiras for these).

> - Why hasn't zookeeper(-bookkeeper) not been chosen?

Persisting NN metadata over a set of servers is only part of the
problem. You might be interested in checking out HDFS-976.

Thanks,
Eli