You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by zenMonkey <nu...@gmail.com> on 2010/03/05 23:25:30 UTC
Copying files between two remote hadoop clusters
I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.
At the moment, it is done in two steps:
1 - Secure copy the remote files
2 - Put the files into HDFS
I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide.
Thanks for any help.
--
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Copying files between two remote hadoop clusters
Posted by zenMonkey <nu...@gmail.com>.
distcp seems to copy between clusters.
http://hadoop.apache.org/common/docs/current/distcp.html
http://hadoop.apache.org/common/docs/current/distcp.html
zenMonkey wrote:
>
> I want to write a script that pulls data (flat files) from a remote
> machine and pushes that into its hadoop cluster.
>
> At the moment, it is done in two steps:
>
> 1 - Secure copy the remote files
> 2 - Put the files into HDFS
>
> I was wondering if it was possible to optimize this by avoiding copying to
> local fs before pushing to hdfs; and instead write directly to hdfs. I am
> not sure if this is something that hadoop tools already provide.
>
> Thanks for any help.
>
>
--
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27813482.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
RE: Copying files between two remote hadoop clusters
Posted by zhuweimin <xi...@tsm.kddilabs.jp>.
Hi
Hdfs shell command standard I/O is supported. I think if you using it for
avoiding temporary save to local file system.
For example:
wget https://web-server/file-path -O - | hadoop fs -put -
hdfs://nn.example.com/hadoop/hadoopfile
Refer to this URL
http://hadoop.apache.org/common/docs/current/hdfs_shell.html#put
Hope that helps
zhuweimin
> -----Original Message-----
> From: zenMonkey [mailto:numan.salati@gmail.com]
> Sent: Sunday, March 07, 2010 4:23 AM
> To: hadoop-user@lucene.apache.org
> Subject: Copying files between two remote hadoop clusters
>
>
> I want to write a script that pulls data (flat files) from a remote
machine
> and pushes that into its hadoop cluster.
>
> At the moment, it is done in two steps:
>
> 1 - Secure copy the remote files
> 2 - Put the files into HDFS
>
> I was wondering if it was possible to optimize this by avoiding copying
> to
> local fs before pushing to hdfs; and instead write directly to hdfs. I am
> not sure if this is something that hadoop tools already provide.
>
> Thanks for any help.
>
> --
> View this message in context:
> http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters
> -tp27799963p27799963.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Copying files between two remote hadoop clusters
Posted by jiang licht <li...@yahoo.com>.
Forgot to mention that the methods in my previous reply is about copy files from non hadoop cluster to a hadoop cluser. Otherwise, inter-cluster copy can be handled by hadoop distcp, see here: http://hadoop.apache.org/common/docs/current/distcp.html
Thanks,
--
Michael
--- On Fri, 3/5/10, jiang licht <li...@yahoo.com> wrote:
From: jiang licht <li...@yahoo.com>
Subject: Re: Copying files between two remote hadoop clusters
To: common-user@hadoop.apache.org
Date: Friday, March 5, 2010, 4:37 PM
This is sth that I asked recently :)
Here's a list of what I can think of
1. on remote box of data, cat filetobesent | ssh hadoopmaster 'hadoop fs -put - dstinhdfs'
2. on remote box of data, configure core-site.xml to set fs.default.name to hdfs://namenode:port and then fire a "hadoop fs -copyFromLocal" or "hadoop fs -put" as it is if your namenode is accessible from your data box or through a VPN to reach the namenode.
3. hdfs-aware gridftp, you can read more detail about it here sth that was mentioned in Brian Bockelman's reply:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201003.mbox/%3C2506096C-C00D-40EC-8751-4ABD8F040009@cse.unl.edu%3E
4. you can write a data transfer tool that is HDFS-aware and will run on data box: reads data on data box and send it over network to its partner on namenode and writes directly into hadoop cluster.
5. other idea?
Thanks,
Michael
--- On Fri, 3/5/10, zenMonkey <nu...@gmail.com> wrote:
From: zenMonkey <nu...@gmail.com>
Subject: Copying files between two remote hadoop clusters
To: hadoop-user@lucene.apache.org
Date: Friday, March 5, 2010, 4:25 PM
I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.
At the moment, it is done in two steps:
1 - Secure copy the remote files
2 - Put the files into HDFS
I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide.
Thanks for any help.
--
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Copying files between two remote hadoop clusters
Posted by jiang licht <li...@yahoo.com>.
This is sth that I asked recently :)
Here's a list of what I can think of
1. on remote box of data, cat filetobesent | ssh hadoopmaster 'hadoop fs -put - dstinhdfs'
2. on remote box of data, configure core-site.xml to set fs.default.name to hdfs://namenode:port and then fire a "hadoop fs -copyFromLocal" or "hadoop fs -put" as it is if your namenode is accessible from your data box or through a VPN to reach the namenode.
3. hdfs-aware gridftp, you can read more detail about it here sth that was mentioned in Brian Bockelman's reply:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201003.mbox/%3C2506096C-C00D-40EC-8751-4ABD8F040009@cse.unl.edu%3E
4. you can write a data transfer tool that is HDFS-aware and will run on data box: reads data on data box and send it over network to its partner on namenode and writes directly into hadoop cluster.
5. other idea?
Thanks,
Michael
--- On Fri, 3/5/10, zenMonkey <nu...@gmail.com> wrote:
From: zenMonkey <nu...@gmail.com>
Subject: Copying files between two remote hadoop clusters
To: hadoop-user@lucene.apache.org
Date: Friday, March 5, 2010, 4:25 PM
I want to write a script that pulls data (flat files) from a remote machine
and pushes that into its hadoop cluster.
At the moment, it is done in two steps:
1 - Secure copy the remote files
2 - Put the files into HDFS
I was wondering if it was possible to optimize this by avoiding copying to
local fs before pushing to hdfs; and instead write directly to hdfs. I am
not sure if this is something that hadoop tools already provide.
Thanks for any help.
--
View this message in context: http://old.nabble.com/Copying-files-between-two-remote-hadoop-clusters-tp27799963p27799963.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.