You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Amir Shenavandeh (Jira)" <ji...@apache.org> on 2019/12/22 05:39:00 UTC

[jira] [Comment Edited] (HADOOP-16775) Hadoop DistCp reuses the same temp file within the task for different files.

    [ https://issues.apache.org/jira/browse/HADOOP-16775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17001830#comment-17001830 ] 

Amir Shenavandeh edited comment on HADOOP-16775 at 12/22/19 5:38 AM:
---------------------------------------------------------------------

The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We can track the temp file based on the time it was created with in each attempt log.

in: [https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]
| |
 


was (Author: shenavandeh):
The patch is for hadoop 2.10.0 add a timestamp to the temp file name. We can track the temp file based on the time it was created with in each attempt log.

in: [https://github.com/apache/hadoop/blob/release-2.10.0-RC1/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java#L237]

 
|private Path getTmpFile(Path target, Mapper.Context context) {|

|Path targetWorkPath = new Path(context.getConfiguration().|

|get(DistCpConstants.CONF_LABEL_TARGET_WORK_PATH));|

| |

|Path root = target.equals(targetWorkPath)? targetWorkPath.getParent() : targetWorkPath;|

|LOG.info("Creating temp file: " +|

|new Path(root, ".distcp.tmp." + context.getTaskAttemptID().toString()));|

|return new Path(root, ".distcp.tmp." + context.getTaskAttemptID().toString());|

}

 

 

> Hadoop DistCp reuses the same temp file within the task for different files.
> ----------------------------------------------------------------------------
>
>                 Key: HADOOP-16775
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16775
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 2.0
>            Reporter: Amir Shenavandeh
>            Priority: Major
>         Attachments: patch.txt
>
>
> Hadoop DistCp reuses the same temp file name for all the files copied within each task attempt and then moves them to the target name, which also a server side copy. For copies over S3 this will cause inconsistency as S3 is only consistent for read after writes, for brand new objects. There is also inconsistency for contents of overwritten objects on S3.
> To avoid this, we should randomize the temp file name.  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org