You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Soulghost <ju...@gmail.com> on 2012/09/04 22:07:18 UTC

Transfer large file >50Gb with DistCp from s3 to cluster

Hello guys 

I have a problem using the DistCp to transfer a large file from s3 to HDFS
cluster, whenever I tried to make the copy, I only saw processing work and
memory usage in one of the nodes, not in all of them, I don't know if this
is the proper behaviour of this or if it is a configuration problem. If I
make the transfer of multiple files each node handles a single file at the
same time, I understand that this transfer would be in parallel but it
doesn't seems like that. 

I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I
was hoping that any of you have an idea of how it works distCp and which
properties could I tweak to improve the transfer rate that is currently in
0.7 Gb per minute. 

Regards.
-- 
View this message in context: http://old.nabble.com/Transfer-large-file-%3E50Gb-with-DistCp-from-s3-to-cluster-tp34389118p34389118.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.


Re: Transfer large file >50Gb with DistCp from s3 to cluster

Posted by Mischa Tuffield <mi...@mmt.me.uk>.
Hello, 

You could try this jar which I found link to from one of the amazon pages. 

s3cmd get s3://eu-west-1.elasticmapreduce/libs/s3distcp/1.0.1/s3distcp.jar

s3dist.jar copies via mapreduce to s3 and back .

If you cluster has N number of reducers available, you can : 

hadoop jar s3distcp.jar -D mapred.reduce.tasks=N --src s3://lame/foo --dest hdfs:///user/hadoop/lamefoo/

I would run it in a screen session. 
On 4 Sep 2012, at 21:07, Soulghost wrote:

> 
> Hello guys 
> 
> I have a problem using the DistCp to transfer a large file from s3 to HDFS
> cluster, whenever I tried to make the copy, I only saw processing work and
> memory usage in one of the nodes, not in all of them, I don't know if this
> is the proper behaviour of this or if it is a configuration problem. If I
> make the transfer of multiple files each node handles a single file at the
> same time, I understand that this transfer would be in parallel but it
> doesn't seems like that. 
> 
> I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I
> was hoping that any of you have an idea of how it works distCp and which
> properties could I tweak to improve the transfer rate that is currently in
> 0.7 Gb per minute. 
> 
> Regards.
> -- 
> View this message in context: http://old.nabble.com/Transfer-large-file-%3E50Gb-with-DistCp-from-s3-to-cluster-tp34389118p34389118.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 

_____________________________
Mischa Tuffield PhD
http://mmt.me.uk/
http://mmt.me.uk/foaf.rdf#mischa


Re: Transfer large file >50Gb with DistCp from s3 to cluster

Posted by Kai Voigt <k...@123.org>.
Hi,

my guess is that you run "hadoop distcp" on one of the datanodes... In that case, the node will get the first replica of each block. But you should also see copies on more nodes as well. But that one node will get a replica of all the blocks.

Kai

Am 04.09.2012 um 22:07 schrieb Soulghost <ju...@gmail.com>:

> 
> Hello guys 
> 
> I have a problem using the DistCp to transfer a large file from s3 to HDFS
> cluster, whenever I tried to make the copy, I only saw processing work and
> memory usage in one of the nodes, not in all of them, I don't know if this
> is the proper behaviour of this or if it is a configuration problem. If I
> make the transfer of multiple files each node handles a single file at the
> same time, I understand that this transfer would be in parallel but it
> doesn't seems like that. 
> 
> I am using 0.20.2 distribution for hadoop in a two Ec2Instances cluster, I
> was hoping that any of you have an idea of how it works distCp and which
> properties could I tweak to improve the transfer rate that is currently in
> 0.7 Gb per minute. 
> 
> Regards.
> -- 
> View this message in context: http://old.nabble.com/Transfer-large-file-%3E50Gb-with-DistCp-from-s3-to-cluster-tp34389118p34389118.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> 
> 

-- 
Kai Voigt
k@123.org