You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Ravi Gummadi (JIRA)" <ji...@apache.org> on 2009/06/17 07:52:07 UTC

[jira] Updated: (HADOOP-5444) Add atomic move option

     [ https://issues.apache.org/jira/browse/HADOOP-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Gummadi updated HADOOP-5444:
---------------------------------

    Attachment: d_retries_atomic.patch

Here is a patch that supports atomic copies and atomic updates.

distcp -atomic <stagedir> src* dst

Instead of ending up in quota issues(if we consider our own stage dir some where) or access permissions(if we consider the stage dir as a sibling to the dest dir), stagedir is taken as argument with -atomic option.

Mapreduce job would copy to stagedir in case of atomic copy and finally the contents of stagedir are moved to dest dir by distcp. In case of atomic update(-update and -atomic <stagedir>), final move would happen file by file as some of the files/dirs could already be there in dest dir.

This patch also includes code changes of -retries <num_tries> option (HADOOP-6060), as there are dependent code changes.
With -retries <num_tries>, distcp would launch at most num_tries jobs in case of transient failures. Retries are done with -update option enabled.

Patch also contains a testcase to test atomic copy with job retries.

Please review and provide your comments.

> Add atomic move option
> ----------------------
>
>                 Key: HADOOP-5444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5444
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 0.18.3
>            Reporter: Richard Theige
>         Attachments: d_retries_atomic.patch
>
>
> Provide support for update to move directories/files atomically by copying the src directory to a tmp directory (with random/unique name) then move the directory to its target destination name after all subdirs/files are copied and verified.
> example option ideas
>   hadoop ... distcp -update -move src dst
> or
>   hadoop ... distcp -update -atomic src dst
> to assure file correctness at the destination, before distcp performs the  'move' at the end of the copy process, it should first perform a strong signature/cksum (e.g. MD4) on the files.
> The issue/need for this is that applications may attempt to start processing data (because files are present), prior to completion of a whole directory copy -- resulting in work against an incomplete data set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.