You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Erik Krogen (JIRA)" <ji...@apache.org> on 2017/02/27 16:51:45 UTC

[jira] [Comment Edited] (HADOOP-14086) Improve DistCp Speed for small files

    [ https://issues.apache.org/jira/browse/HADOOP-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15886105#comment-15886105 ] 

Erik Krogen edited comment on HADOOP-14086 at 2/27/17 4:51 PM:
---------------------------------------------------------------

[~zhz] currently there are multiple calls made for each file; even reducing a distcp for 1M files to 1M {{getFileInfo}} calls would be a big improvement over the current implementation.

[~stevel@apache.org], what about this JIRA makes you worry that object store performance will be worse? Nothing stands out to me so I am curious. Also, are you saying that the listFiles performance work is already done, or under progress? Do you have a JIRA link? Sounds very interesting.


was (Author: xkrogen):
[~zhz] currently there are multiple calls made for each file; even reducing a distcp for 1M files to 1M {{getFileInfo}} calls would be a big improvement over the current implementation.

[~stevel@apache.org], what about this JIRA makes you worry that object store performance will be worse? Nothing stands out to me so I am curious. Also, are you saying that the listFiles performance work is already done, or under progress? Do you have a JIRA link?

> Improve DistCp Speed for small files
> ------------------------------------
>
>                 Key: HADOOP-14086
>                 URL: https://issues.apache.org/jira/browse/HADOOP-14086
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.6.5
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Minor
>
> When using distcp to copy lots of small files,  NameNode naturally becomes a bottleneck.
> The current distcp code did *not* optimize to reduce the NameNode calls.  We should restructure the code to reduce the number of NameNode calls as much as possible to speed up the copy of small files.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org