You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hadoop.apache.org by Elliot West <te...@gmail.com> on 2016/05/16 14:11:07 UTC

Using DistCp and S3AFileSystem to move data to S3

Hello,

I've been moving files to S3 using DistCp and the S3AFileSystem
(branch-2.8) and notice that DistCp always copies to a temporary set of
files in S3 and then performs a move on copy completion. It does this in a
per task basis and is separate from the temporary location of the '-atomic'
option. An example path is as follows:

s3://bucket/folder/.distcp.tmp.attempt_0000000000001_000001_m_000001_0


Now, my understanding is that moves on S3 are actually an asynchronous copy
+ delete, and that once the call to FileSystem.rename(...) returns there is
no guarantee that the data is at the destination at that point in time.
Therefore I can make no guarantees regarding the availability of said data
to downstream processes that may wish to consume it. However, I am lead to
believe that file creations are consistent (but not overwrites).

Is there any way to have DistCp write directly to the target location in
S3? If not, is there any reason why it would be undesirable to provide the
option of such behaviour?

The code in question is located here:
https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L106-L136

Thanks,

Elliot.

Re: Using DistCp and S3AFileSystem to move data to S3

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hello Elliot,

This is very timely as I have been investigating this recently.  Your assessment is correct: DistCp triggers a rename, and renames on S3 do not satisfy the expectation that rename is fast and atomic like on most file systems.

There has been prior discussion of "direct commit" strategies like you described to improve performance against S3A.  The relevant JIRAs are HADOOP-9565 and HADOOP-11487.  I recommend watching those JIRAs if you'd like to keep track of how the discussion evolves.

Meanwhile, you might be interested in my work-in-progress patch on HADOOP-13145, which prevents some unnecessary calls in DistCp when you're not using the option to preserve metadata attributes.  This does not directly address the rename/copy problem, but it does avoid a potential eventual consistency problem with DistCp to S3A and provide an overall optimization.  We are seeing good results with the patch so far from some manual DistCp testing.  I still need to write some JUnit tests before we'll commit that patch.

--Chris Nauroth

From: Elliot West <te...@gmail.com>>
Date: Monday, May 16, 2016 at 7:11 AM
To: "user@hadoop.apache.org<ma...@hadoop.apache.org>" <us...@hadoop.apache.org>>
Subject: Using DistCp and S3AFileSystem to move data to S3

Hello,

I've been moving files to S3 using DistCp and the S3AFileSystem (branch-2.8) and notice that DistCp always copies to a temporary set of files in S3 and then performs a move on copy completion. It does this in a per task basis and is separate from the temporary location of the '-atomic' option. An example path is as follows:

s3://bucket/folder/.distcp.tmp.attempt_0000000000001_000001_m_000001_0

Now, my understanding is that moves on S3 are actually an asynchronous copy + delete, and that once the call to FileSystem.rename(...) returns there is no guarantee that the data is at the destination at that point in time. Therefore I can make no guarantees regarding the availability of said data to downstream processes that may wish to consume it. However, I am lead to believe that file creations are consistent (but not overwrites).

Is there any way to have DistCp write directly to the target location in S3? If not, is there any reason why it would be undesirable to provide the option of such behaviour?

The code in question is located here:
https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L106-L136

Thanks,

Elliot.