You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Lukasz Szybalski <sz...@gmail.com> on 2009/05/27 22:11:40 UTC

hdfs on public internet/wan

Hello,
I wanted to setup hdfs to be used as a public like file system where,
aside from few core computer that will be running masters, you would
have x amount of data nodes/computers that would be located through
the internet?

How do I setup master servers, and then 3-65+ slave servers, where
each server can come or leave at any time they want.
How would I control how slave servers are added? assuming they would
give me their ip, available size, and in return I would need to
provide then with...?
Should the ssh account that is used be created in some special way? No
shell access? or some restrictions? (command?)
Are there any specific differences that should be accounted for in
this "public" version of hadoop cluster?


Let me know.

Thanks,
Lucas

Re: hdfs on public internet/wan

Posted by Alex Loddengaard <al...@cloudera.com>.
It sounds like HDFS probably isn't the right application for you.  When new
nodes add themselves to the cluster, the administrator needs to rebalance
the cluster in order for the new nodes to get data.  Without rebalancing,
new data will be stored on those new nodes, but old data will not be
distributed to these new nodes.

In the case when a node leaves the cluster for 10 minutes, the master will
start replicating the blocks that were on that node onto other nodes in the
cluster.  The point is is that -- though HDFS can handle nodes dying and new
nodes being added -- it's not designed for this to happen all the time.

Similararly, HDFS doesn't have any security.  You would have to configure
your own firewall to limit access.  I imagine doing so would be really
annoying when not all machines are behind the same router.

So anyway, you may want to consider other file systems (perhaps there is
something P2P out there?) for what you're trying to do.

Hope this helps.

Alex

On Wed, May 27, 2009 at 1:11 PM, Lukasz Szybalski <sz...@gmail.com>wrote:

> Hello,
> I wanted to setup hdfs to be used as a public like file system where,
> aside from few core computer that will be running masters, you would
> have x amount of data nodes/computers that would be located through
> the internet?
>
> How do I setup master servers, and then 3-65+ slave servers, where
> each server can come or leave at any time they want.
> How would I control how slave servers are added? assuming they would
> give me their ip, available size, and in return I would need to
> provide then with...?
> Should the ssh account that is used be created in some special way? No
> shell access? or some restrictions? (command?)
> Are there any specific differences that should be accounted for in
> this "public" version of hadoop cluster?
>
>
> Let me know.
>
> Thanks,
> Lucas
>