You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2017/03/14 05:08:41 UTC
[jira] [Assigned] (HADOOP-14137) Faster distcp by taking file list
from fsimage or -lsr result
[ https://issues.apache.org/jira/browse/HADOOP-14137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zheng Shao reassigned HADOOP-14137:
-----------------------------------
Assignee: Zheng Shao
> Faster distcp by taking file list from fsimage or -lsr result
> -------------------------------------------------------------
>
> Key: HADOOP-14137
> URL: https://issues.apache.org/jira/browse/HADOOP-14137
> Project: Hadoop Common
> Issue Type: New Feature
> Components: tools/distcp
> Reporter: Zheng Shao
> Assignee: Zheng Shao
> Attachments: HADOOP-14137.branch26.1.patch, HADOOP-14137.branch26.2.patch
>
>
> DistCp is very slow to start when the src directory has a huge number of subdirectories. In our case, we already have the directory listing (via "hdfs oiv -i fsimage" or via nightly "hdfs dfs -lr -r /" dumps), and we would like to use that instead of doing realtime listing on the NameNode.
> The "-f" option doesn't help in this case because it would try to put everything into a single flat target directory.
> We'd like to introduce a new option "-list <file>" for distcp. The <file> contains the result of listing the src directory.
> In order to achieve this, we plan to:
> 1. Add a new CopyListing class PregeneratedCopyListing similar to SimpleCopyListing which doesn't "-ls -r" into the directory, but takes the listing via "-list"
> 2. Add an option "-list <file>" which will automatically make distcp use the new PregeneratedCopyListing class.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org