You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Hendrik Haddorp <he...@gmx.net> on 2018/04/21 14:00:44 UTC

distcp from plain java program

Hi,

I'm trying to use distcp (org.apache.hadoop.tools.DistCp) out of a 
simple java program to copy files from HDFS to S3 storage. This works 
quite fine, except that it is very slow. Copying the files to the local 
disk is also not much faster. It seems like files are copied 
sequentially. My understanding was however that distcp would create map 
jobs that could be executed in parallel. Is there any configuration 
setting required to get the map jobs executed in parallel?

thanks,
Hendrik

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: distcp from plain java program

Posted by Hendrik Haddorp <he...@gmx.net>.
Hi Gour,

I did but, the problem seems to have been the local execution. The local 
execution uses only one thread, which is what I saw in the logs as well. 
So I ended up just doing my own copy using hadoop FileSystem APIs and 
using multiple threads. That worked pretty well and allowed me to 
control the order of the file copies.

regards,
Hendrik

On 26.04.2018 02:00, Gour Saha wrote:
> Hendrik,
> Did you try setting maxMaps to a higher number? The default is 20, so you might try setting it to a higher value.
>
> -Gour
>
> On 4/21/18, 7:01 AM, "Hendrik Haddorp" <he...@gmx.net> wrote:
>
>      Hi,
>      
>      I'm trying to use distcp (org.apache.hadoop.tools.DistCp) out of a
>      simple java program to copy files from HDFS to S3 storage. This works
>      quite fine, except that it is very slow. Copying the files to the local
>      disk is also not much faster. It seems like files are copied
>      sequentially. My understanding was however that distcp would create map
>      jobs that could be executed in parallel. Is there any configuration
>      setting required to get the map jobs executed in parallel?
>      
>      thanks,
>      Hendrik
>      
>      ---------------------------------------------------------------------
>      To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
>      For additional commands, e-mail: user-help@hadoop.apache.org
>      
>      
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
> For additional commands, e-mail: user-help@hadoop.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
For additional commands, e-mail: user-help@hadoop.apache.org


Re: distcp from plain java program

Posted by Gour Saha <gs...@hortonworks.com>.
Hendrik,
Did you try setting maxMaps to a higher number? The default is 20, so you might try setting it to a higher value.

-Gour 

On 4/21/18, 7:01 AM, "Hendrik Haddorp" <he...@gmx.net> wrote:

    Hi,
    
    I'm trying to use distcp (org.apache.hadoop.tools.DistCp) out of a 
    simple java program to copy files from HDFS to S3 storage. This works 
    quite fine, except that it is very slow. Copying the files to the local 
    disk is also not much faster. It seems like files are copied 
    sequentially. My understanding was however that distcp would create map 
    jobs that could be executed in parallel. Is there any configuration 
    setting required to get the map jobs executed in parallel?
    
    thanks,
    Hendrik
    
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: user-unsubscribe@hadoop.apache.org
    For additional commands, e-mail: user-help@hadoop.apache.org