You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Corby Wilson (JIRA)" <ji...@apache.org> on 2014/11/07 20:49:34 UTC

[jira] [Created] (HADOOP-11281) Add flag to fs.shell to skip _COPYING_ file

Corby Wilson created HADOOP-11281:
-------------------------------------

             Summary: Add flag to fs.shell to skip _COPYING_ file
                 Key: HADOOP-11281
                 URL: https://issues.apache.org/jira/browse/HADOOP-11281
             Project: Hadoop Common
          Issue Type: Improvement
          Components: fs, fs/s3
         Environment: Hadoop 2.2 but is in all of them.
AWS EMR 3.0.4
            Reporter: Corby Wilson
            Priority: Critical


Amazon S3 does not have a rename feature.
When you use the hadoop shell or distcp feature, hadoop first uploads the file using the ._COPYING_ extension, then renames the file to the final output.

Code:
org/apache/hadoop/fs/shell/CommandWithDestination.java
      PathData tempTarget = target.suffix("._COPYING_");
      targetFs.setWriteChecksum(writeChecksum);
      targetFs.writeStreamToFile(in, tempTarget, lazyPersist);
      targetFs.rename(tempTarget, target);

The problem is that on rename, we actually have to download the file again (through an InputStream), then upload it again.
For very large files (>= 5GB) we have to use multipart upload.
So if we are processing several TB of multi GB files, we are actually writing the file to S3 twice and reading it once from S3.

It would be nice to have a flag or core-site.xml setting that allowed us to tell hadoop to skip the copy and just write the file once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)