You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Petru Dimulescu <pe...@gmail.com> on 2011/10/18 11:48:05 UTC

automatic node discovery

Hello,

I wonder how do you guys see the problem of automatic node discovery: 
having, for instance, a couple of hadoops, with no configuration 
explicitly set whatsoever, simply discover each other and work together, 
like Gridgain does: just fire up two instances of the product, on the 
same machine or on different machines in the same LAN, they will use 
mulitcast or whatever to discover each other and to be a part of a 
self-discovered topology.

Of course, if you have special network requirements you should be able 
to specify undiscovarable nodes by IP or name but often grids are 
installed on LANs and it should really be simpler.

Namenodes are a bit different, they should use safer machines, I'm 
basically talking about datanodes here, but still I wonder how hard can 
it be to have self-assigned namenodes, maybe replicated automatically on 
several machines, unless one specific namenode is explicitly set via xml 
configuration.

Also, the ssh passwordless thing is so awkward. If you have a network of 
hadoop that mutually discover each other there is really no need for 
this passwordless ssh requirement. This is more of a system 
administrator aspect, if sysadmins want to automatically deploy or start 
a program on 5000 machines they often have the tools&skills to do that, 
it should not be a requirement.

What do you people think about this?

Best
Petru

Re: automatic node discovery

Posted by Steve Loughran <st...@apache.org>.

On 18/10/11 10:48, Petru Dimulescu wrote:
> Hello,
>
> I wonder how do you guys see the problem of automatic node discovery:
> having, for instance, a couple of hadoops, with no configuration
> explicitly set whatsoever, simply discover each other and work together,
> like Gridgain does: just fire up two instances of the product, on the
> same machine or on different machines in the same LAN, they will use
> mulitcast or whatever to discover each other

you can use techniques like Bonjour to have hadoop services register 
themselves in DNS and locate that way, but things only need to discover 
the NN and JT and report in.

 > and to be a part of a
> self-discovered topology.

Topology inference is an interesting problem. Something purely for 
diagnostics could be useful.

> Of course, if you have special network requirements you should be able
> to specify undiscovarable nodes by IP or name but often grids are
> installed on LANs and it should really be simpler.

In a production system I'd have a private switch and isolate things for 
bandwidth and security; this is why auto configuration is generally 
neglected. If it were to be added, it would go via Zookeeper, leaving 
only the zookeeper discovery problem. You can't rely on DNS or multicast 
IP here as it doesn't always work in virtualised environments.

>
> Namenodes are a bit different, they should use safer machines, I'm
> basically talking about datanodes here, but still I wonder how hard can
> it be to have self-assigned namenodes, maybe replicated automatically on
> several machines, unless one specific namenode is explicitly set via xml
> configuration.

I wouldn't touch dynamic namenodes, you really need fixed NNs and 2nns 
and as automatic replication isn't there it's a non-issue.

With fixed NN and JT entries in the DNS table, anything can come up in 
the LAN and talk to them unless you set up the master nodes with lists 
of things you trust.

> Also, the ssh passwordless thing is so awkward. If you have a network of
> hadoop that mutually discover each other there is really no need for
> this passwordless ssh requirement. This is more of a system
> administrator aspect, if sysadmins want to automatically deploy or start
> a program on 5000 machines they often have the tools&skills to do that,
> it should not be a requirement.

It's not a requirement, there are other ways to deploy. Large clusters 
tend to use cluster management tooling that keeps the OS images 
consistent, or you can use more devops-centric tooling (inc Apache 
Whirr) to roll things out.