You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mridul Muralidharan <mr...@yahoo-inc.com> on 2010/05/18 02:10:35 UTC

distcp of small number of really large files

Hi,

   Is there a way to parallelize copy of really large files ?
 From my understanding, currently a each map in distcp copies one file.
So for really large files, this would be pretty slow if number of files 
is really large.


Thanks,
Mridul

Re: distcp of small number of really large files

Posted by Konstantin Shvachko <sh...@yahoo-inc.com>.

There is a new feature called concat(), which concatenates files consisting of full blocks.
So the ideas is to copy individual blocks in parallel, then concatenate them once they are
copied back into original files.
You will have to write some code to do this or modify distcp.
This is in 0.22/21, but not in 0.20.
--Konstantin

On 5/17/2010 5:10 PM, Mridul Muralidharan wrote:
> Hi,
>
> Is there a way to parallelize copy of really large files ?
>  From my understanding, currently a each map in distcp copies one file.
> So for really large files, this would be pretty slow if number of files
> is really large.
>
>
> Thanks,
> Mridul
>