You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Nik Lam <ni...@gmail.com> on 2014/09/15 19:54:13 UTC

tricks to parallelize distcp of large files

Hello,

I have some directories containing a large number of files which have a
wide range of file sizes in it.

When I do a distcp, because the smallest unit of transfer is a file, there
are some maps that take much longer than the others (or simply fail).

I see that there's an open JIRA (
https://issues.apache.org/jira/browse/MAPREDUCE-2257 - distcp.copy.by.chunk
) to allow multiple maps to copy parts the same file in parallel to get
around this problem.

In the mean time, can anyone suggest a manual technique that I can use on
the largest files in the directory to split them prior to carrying out the
distcp, and then concatenate them back into their original sizes at the
other end?

Regards,

Nik