You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Andrew Olson (JIRA)" <ji...@apache.org> on 2019/01/14 15:29:00 UTC

[jira] [Created] (HADOOP-16047) Avoid expensive rename when DistCp is writing to S3

Andrew Olson created HADOOP-16047:
-------------------------------------

             Summary: Avoid expensive rename when DistCp is writing to S3
                 Key: HADOOP-16047
                 URL: https://issues.apache.org/jira/browse/HADOOP-16047
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs/s3, tools/distcp
            Reporter: Andrew Olson


When writing to an S3-based target, the temp file and rename logic in RetriableFileCopyCommand adds some unnecessary cost to the job, as the rename operation does a server-side copy + delete in S3 [1]. The renames are parallelized across all of the DistCp map tasks, so the severity is mitigated to some extent. However a configuration property to conditionally allow distributed copies to avoid that expense and write directly to the target path would improve performance considerably.

[1] https://github.com/apache/hadoop/blob/release-3.2.0-RC1/hadoop-common-project/hadoop-common/src/site/markdown/filesystem/introduction.md#object-stores-vs-filesystems



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org