You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Linxiao Jin (JIRA)" <ji...@apache.org> on 2015/08/07 20:45:45 UTC

[jira] [Created] (HDFS-8878) An HDFS built-in DistCp

Linxiao Jin created HDFS-8878:
---------------------------------

             Summary: An HDFS built-in DistCp 
                 Key: HDFS-8878
                 URL: https://issues.apache.org/jira/browse/HDFS-8878
             Project: Hadoop HDFS
          Issue Type: New Feature
            Reporter: Linxiao Jin
            Assignee: Linxiao Jin


For now, we use DistCp to do directory copy, which works quite good. However, it would be better if there is an HDFS built-in, efficient, directory copy tool. It could be faster by cut off the redundant communication between HDFS, YARN and MapReduce. It could also release the resource DistCp consumed in job tracker and YARN and easier for debugging.

We need more discussion on the new protocol between NN and DN from different clusters to achieve HDFS-level command sending and data transfer. One available hacky solution could be, the srcNN get the block distribution of the target file, ask each datanode to start a DFSClient and copy their local shortcircuited block as a file in dst cluster. After all the block-file in dst cluster is completed, use a DFSClient to concat them together to form the target destination file. There might be some optimized solution by implement a newly designed protocol to communicate over cluster rather than DFSClient and use methods from lower bottom layer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)