You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2019/03/18 19:33:00 UTC

[jira] [Commented] (HADOOP-13600) S3a rename() to copy files in a directory in parallel

    [ https://issues.apache.org/jira/browse/HADOOP-13600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16795337#comment-16795337 ] 

Steve Loughran commented on HADOOP-13600:
-----------------------------------------

I'm reviewing this again. Nominally, the S3 transfer manager is paralellized anyway. 

But if a many GB copy is taking place there, all the small copy operations which follow are being held up, even though many of them could be executed. So yes, we do need something to do work in batches.

HADOOP-16189 looks at moving away from the transfer manager and doing it ourselves. I'm not yet ready to take that on, but the 200 error of HADOOP-16188 means I have some doubts now about its longevity. I just don't want to rush into that.

* We know rename will never go away, it's too ubiquitous
* we know that directory renames is a major bottleneck in things. Even "hadoop fs -rm" commands, let along large hive jobs.
* if we can show tangible speedup, it's justifed

But: we need to retain consistency with s3Guard in the presence of failure. Proposed: after every copy call completes, S3Guard is updated immediately with the info about that dir existing. We'll update the delete calls after every bulk delete

> S3a rename() to copy files in a directory in parallel
> -----------------------------------------------------
>
>                 Key: HADOOP-13600
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13600
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 2.7.3
>            Reporter: Steve Loughran
>            Assignee: Sahil Takiar
>            Priority: Major
>         Attachments: HADOOP-13600.001.patch
>
>
> Currently a directory rename does a one-by-one copy, making the request O(files * data). If the copy operations were launched in parallel, the duration of the copy may be reducable to the duration of the longest copy. For a directory with many files, this will be significant



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org