You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by zenMonkey <nu...@gmail.com> on 2010/03/05 23:25:30 UTC

Copying files between two remote hadoop clusters

I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.

At the moment, it is done in two steps:

1 - Secure copy the remote files
2 - Put the files into HDFS

I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide. 

Thanks for any help.

-- 
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Copying files between two remote hadoop clusters

Posted by zenMonkey <nu...@gmail.com>.
distcp seems to copy between clusters.

http://hadoop.apache.org/common/docs/current/distcp.html
http://hadoop.apache.org/common/docs/current/distcp.html 




zenMonkey wrote:
> 
> I want to write a script that pulls data (flat files) from a remote
> machine and pushes that into its hadoop cluster.
> 
> At the moment, it is done in two steps:
> 
> 1 - Secure copy the remote files
> 2 - Put the files into HDFS
> 
> I was wondering if it was possible to optimize this by avoiding copying to
> local fs before pushing to hdfs; and instead write directly to hdfs. I am
> not sure if this is something that hadoop tools already provide. 
> 
> Thanks for any help.
> 
> 

-- 
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27813482.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


RE: Copying files between two remote hadoop clusters

Posted by zhuweimin <xi...@tsm.kddilabs.jp>.
Hi 

Hdfs shell command standard I/O is supported. I think if you using it for
avoiding temporary save to local file system.

For example:
wget https://web-server/file-path -O - | hadoop fs -put -
hdfs://nn.example.com/hadoop/hadoopfile

Refer to this URL
http://hadoop.apache.org/common/docs/current/hdfs_shell.html#put

Hope that helps

zhuweimin

> -----Original Message-----
> From: zenMonkey [mailto:numan.salati@gmail.com]
> Sent: Sunday, March 07, 2010 4:23 AM
> To: hadoop-user@lucene.apache.org
> Subject: Copying files between two remote hadoop clusters
> 
> 
> I want to write a script that pulls data (flat files) from a remote
machine
> and pushes that into its hadoop cluster.
> 
> At the moment, it is done in two steps:
> 
> 1 - Secure copy the remote files
> 2 - Put the files into HDFS
> 
> I was wondering if it was possible to optimize this by avoiding copying
> to
> local fs before pushing to hdfs; and instead write directly to hdfs. I am
> not sure if this is something that hadoop tools already provide.
> 
> Thanks for any help.
> 
> --
> View this message in context:
> http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters
> -tp27799963p27799963.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




Re: Copying files between two remote hadoop clusters

Posted by jiang licht <li...@yahoo.com>.
Forgot to mention that the methods in my previous reply is about copy files from non hadoop cluster to a hadoop cluser. Otherwise, inter-cluster copy can be handled by hadoop distcp, see here: http://hadoop.apache.org/common/docs/current/distcp.html

Thanks,
--

Michael

--- On Fri, 3/5/10, jiang licht <li...@yahoo.com> wrote:

From: jiang licht <li...@yahoo.com>
Subject: Re: Copying files between two remote hadoop clusters
To: common-user@hadoop.apache.org
Date: Friday, March 5, 2010, 4:37 PM

This is sth that I asked recently :)

Here's a list of what I can think of

1. on remote box of data, cat filetobesent | ssh hadoopmaster 'hadoop fs -put - dstinhdfs'

2. on remote box of data, configure core-site.xml to set fs.default.name to hdfs://namenode:port and then fire a "hadoop fs -copyFromLocal" or "hadoop fs -put" as it is if your namenode is accessible from your data box or through a VPN to reach the namenode.

3. hdfs-aware gridftp, you can read more detail about it here sth that was mentioned in  Brian Bockelman's reply:

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201003.mbox/%3C2506096C-C00D-40EC-8751-4ABD8F040009@cse.unl.edu%3E

4. you can write a data transfer tool that is HDFS-aware and will run on data box: reads data on data box and send it over network to its partner on namenode and writes directly into hadoop cluster.

5. other idea?

Thanks,

Michael

--- On Fri, 3/5/10, zenMonkey <nu...@gmail.com> wrote:

From: zenMonkey <nu...@gmail.com>
Subject: Copying files between two remote hadoop clusters
To: hadoop-user@lucene.apache.org
Date: Friday, March 5, 2010, 4:25 PM


I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.

At the moment, it is done in two steps:

1 - Secure copy the remote files
2 - Put the files into HDFS

I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide. 

Thanks for any help.

-- 
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.




      


      

Re: Copying files between two remote hadoop clusters

Posted by jiang licht <li...@yahoo.com>.
This is sth that I asked recently :)

Here's a list of what I can think of

1. on remote box of data, cat filetobesent | ssh hadoopmaster 'hadoop fs -put - dstinhdfs'

2. on remote box of data, configure core-site.xml to set fs.default.name to hdfs://namenode:port and then fire a "hadoop fs -copyFromLocal" or "hadoop fs -put" as it is if your namenode is accessible from your data box or through a VPN to reach the namenode.

3. hdfs-aware gridftp, you can read more detail about it here sth that was mentioned in  Brian Bockelman's reply:

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201003.mbox/%3C2506096C-C00D-40EC-8751-4ABD8F040009@cse.unl.edu%3E

4. you can write a data transfer tool that is HDFS-aware and will run on data box: reads data on data box and send it over network to its partner on namenode and writes directly into hadoop cluster.

5. other idea?

Thanks,

Michael

--- On Fri, 3/5/10, zenMonkey <nu...@gmail.com> wrote:

From: zenMonkey <nu...@gmail.com>
Subject: Copying files between two remote hadoop clusters
To: hadoop-user@lucene.apache.org
Date: Friday, March 5, 2010, 4:25 PM


I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.

At the moment, it is done in two steps:

1 - Secure copy the remote files
2 - Put the files into HDFS

I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide. 

Thanks for any help.

-- 
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.