You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Richard Theige (JIRA)" <ji...@apache.org> on 2009/03/09 23:23:50 UTC

[jira] Created: (HADOOP-5444) Add atomic move option

Add atomic move option
----------------------

                 Key: HADOOP-5444
                 URL: https://issues.apache.org/jira/browse/HADOOP-5444
             Project: Hadoop Core
          Issue Type: New Feature
          Components: tools/distcp
    Affects Versions: 0.18.3
            Reporter: Richard Theige


Provide support for update to move directories/files atomically by copying the src directory to a tmp directory (with random/unique name) then move the directory to its target destination name after all subdirs/files are copied and verified.

example option ideas
  hadoop ... distcp -update -move src dst
or
  hadoop ... distcp -update -atomic src dst

to assure file correctness at the destination, before distcp performs the  'move' at the end of the copy process, it should first perform a strong signature/cksum (e.g. MD4) on the files.

The issue/need for this is that applications may attempt to start processing data (because files are present), prior to completion of a whole directory copy -- resulting in work against an incomplete data set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-5444) Add atomic move option

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Gummadi reassigned HADOOP-5444:
------------------------------------

    Assignee: Ravi Gummadi

> Add atomic move option
> ----------------------
>
>                 Key: HADOOP-5444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5444
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 0.18.3
>            Reporter: Richard Theige
>            Assignee: Ravi Gummadi
>         Attachments: d_retries_atomic.patch
>
>
> Provide support for update to move directories/files atomically by copying the src directory to a tmp directory (with random/unique name) then move the directory to its target destination name after all subdirs/files are copied and verified.
> example option ideas
>   hadoop ... distcp -update -move src dst
> or
>   hadoop ... distcp -update -atomic src dst
> to assure file correctness at the destination, before distcp performs the  'move' at the end of the copy process, it should first perform a strong signature/cksum (e.g. MD4) on the files.
> The issue/need for this is that applications may attempt to start processing data (because files are present), prior to completion of a whole directory copy -- resulting in work against an incomplete data set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5444) Add atomic move option

Posted by "Ravi Gummadi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ravi Gummadi updated HADOOP-5444:
---------------------------------

    Attachment: d_retries_atomic.patch

Here is a patch that supports atomic copies and atomic updates.

distcp -atomic <stagedir> src* dst

Instead of ending up in quota issues(if we consider our own stage dir some where) or access permissions(if we consider the stage dir as a sibling to the dest dir), stagedir is taken as argument with -atomic option.

Mapreduce job would copy to stagedir in case of atomic copy and finally the contents of stagedir are moved to dest dir by distcp. In case of atomic update(-update and -atomic <stagedir>), final move would happen file by file as some of the files/dirs could already be there in dest dir.

This patch also includes code changes of -retries <num_tries> option (HADOOP-6060), as there are dependent code changes.
With -retries <num_tries>, distcp would launch at most num_tries jobs in case of transient failures. Retries are done with -update option enabled.

Patch also contains a testcase to test atomic copy with job retries.

Please review and provide your comments.

> Add atomic move option
> ----------------------
>
>                 Key: HADOOP-5444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5444
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 0.18.3
>            Reporter: Richard Theige
>         Attachments: d_retries_atomic.patch
>
>
> Provide support for update to move directories/files atomically by copying the src directory to a tmp directory (with random/unique name) then move the directory to its target destination name after all subdirs/files are copied and verified.
> example option ideas
>   hadoop ... distcp -update -move src dst
> or
>   hadoop ... distcp -update -atomic src dst
> to assure file correctness at the destination, before distcp performs the  'move' at the end of the copy process, it should first perform a strong signature/cksum (e.g. MD4) on the files.
> The issue/need for this is that applications may attempt to start processing data (because files are present), prior to completion of a whole directory copy -- resulting in work against an incomplete data set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5444) Add atomic move option

Posted by "Hong Tang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720520#action_12720520 ] 

Hong Tang commented on HADOOP-5444:
-----------------------------------

Would the program litter partially copied directories if the command fails or interrupted by the client? 

> Add atomic move option
> ----------------------
>
>                 Key: HADOOP-5444
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5444
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: tools/distcp
>    Affects Versions: 0.18.3
>            Reporter: Richard Theige
>         Attachments: d_retries_atomic.patch
>
>
> Provide support for update to move directories/files atomically by copying the src directory to a tmp directory (with random/unique name) then move the directory to its target destination name after all subdirs/files are copied and verified.
> example option ideas
>   hadoop ... distcp -update -move src dst
> or
>   hadoop ... distcp -update -atomic src dst
> to assure file correctness at the destination, before distcp performs the  'move' at the end of the copy process, it should first perform a strong signature/cksum (e.g. MD4) on the files.
> The issue/need for this is that applications may attempt to start processing data (because files are present), prior to completion of a whole directory copy -- resulting in work against an incomplete data set.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.