You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Miles Osborne <mi...@inf.ed.ac.uk> on 2008/02/28 11:43:31 UTC

Cross-data centre DFS communication?

Currently, we have the following setup:

--cluster A, running Nutch: small RAM per node

--cluster B, just running Hadoop:  lots of RAM per node

At some point in the future we will want cluster B to talk to cluster A, and
ideally this should be DFS-to-DFS

Is this possible?  Or do we need to do something like:

Cluster A --> Unix filesystem --> Cluster B

via hadoop dfs -cat / -put operations etc

Thanks

Miles

-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.

Re: Cross-data centre DFS communication?

Posted by Steve Sapovits <ss...@invitemedia.com>.
Owen O'Malley wrote:

> Sure, the info server on the name node of HDFS has a read-only interface 
> that lists directories in xml and allows the client to read files over 
> http. There is a FileSystem implementation that provides the client side 
> interface to the xml/http access.
> 
> To use it, you need a path with hftp as the protocol:
> hadoop distcp hftp://namenode1:50070/foo/bar hdfs://namenode2:8020/foo

Very useful.  Thanks.

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com


Re: Cross-data centre DFS communication?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 28, 2008, at 8:20 AM, Steve Sapovits wrote:

> Can you further explain the hftp part of this?  I'm not familiar  
> with that. We have a similar need to go cross-data center.

Sure, the info server on the name node of HDFS has a read-only  
interface that lists directories in xml and allows the client to read  
files over http. There is a FileSystem implementation that provides  
the client side interface to the xml/http access.

To use it, you need a path with hftp as the protocol:
hadoop distcp hftp://namenode1:50070/foo/bar hdfs://namenode2:8020/foo


> In an earlier post it
> was suggested that there was no map/reduce model for that so this
> sounds more like what we're looking for.

It isn't a good idea to run map/reduce jobs across clusters, so you  
usually need to copy the data locally.

-- Owen

Re: Cross-data centre DFS communication?

Posted by Steve Sapovits <ss...@invitemedia.com>.
Owen O'Malley wrote:

> To copy between clusters, there is a tool called distcp. Look at 
> "bin/hadoop distcp". It runs a map/reduce job that copies a group of 
> files. It can also be used to copy between versions of hadoop, if the 
> source file system is hftp, which uses xml to read hdfs.

Can you further explain the hftp part of this?  I'm not familiar with that. 
We have a similar need to go cross-data center.  In an earlier post it
was suggested that there was no map/reduce model for that so this
sounds more like what we're looking for. 

-- 
Steve Sapovits
Invite Media  -  http://www.invitemedia.com
ssapovits@invitemedia.com

Re: Cross-data centre DFS communication?

Posted by Owen O'Malley <oo...@yahoo-inc.com>.
On Feb 28, 2008, at 2:43 AM, Miles Osborne wrote:

> Currently, we have the following setup:
>
> --cluster A, running Nutch: small RAM per node
>
> --cluster B, just running Hadoop:  lots of RAM per node
>
> At some point in the future we will want cluster B to talk to  
> cluster A, and
> ideally this should be DFS-to-DFS
>
> Is this possible?  Or do we need to do something like:
>
> Cluster A --> Unix filesystem --> Cluster B
>
> via hadoop dfs -cat / -put operations etc

To copy between clusters, there is a tool called distcp. Look at "bin/ 
hadoop distcp". It runs a map/reduce job that copies a group of  
files. It can also be used to copy between versions of hadoop, if the  
source file system is hftp, which uses xml to read hdfs.

-- Owen