You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "eric baldeschwieler (JIRA)" <ji...@apache.org> on 2007/04/24 19:27:15 UTC

[jira] Created: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

dfs -copyToLocal should guarantee file is complete
--------------------------------------------------

                 Key: HADOOP-1292
                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
             Project: Hadoop
          Issue Type: Improvement
          Components: dfs
            Reporter: eric baldeschwieler


We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.

Original suggestion:

On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:

I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment:     (was: HADOOP-1292_20070619b.patch)

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment:     (was: HADOOP-1292_20070620.patch)

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070621.patch

Removed the ChecksumFileSystem.getChecksumFile(File).

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070621.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507028 ] 

dhruba borthakur commented on HADOOP-1292:
------------------------------------------

+1. Code looks great! Thanks.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070621.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507129 ] 

dhruba borthakur commented on HADOOP-1292:
------------------------------------------

The test creates files/directories in /test/copytolocal. This could cause the test to fail if this directory does not have permissions to writes, etc. It is better to use FsShell.TEST_ROOT_DIR as the temporary work directory for the test.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: HADOOP-1292_20070621c.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070619.patch

- Added a protected method FileSystem.getLocalFileSystem().
  All calls to the deprecated method 
  FileSystem.getNamed(String name, Configuration conf)
  for getting a local file system are updated.

- In FsShell, added 
  static final String TMP_FILENAME_PREFIX = "_tmp_";
  If desired, we may add TMP_FILENAME_PREFIX to the XML configuration file.

- In FsShell, added 
  private void copyToLocal(DistributedFileSystem dfs,
			   Path src, Path dst, boolean copyCrc)
  so that src is first copied to a tmp file under the parent directory of dst
  and then tmp is renamed to dst after the copying successes.
  Note that resume is not implemented (tmp file won't be reused).

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070619.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070619b.patch

updated patch so that it satisfies checkstyle.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070619b.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment:     (was: HADOOP-1292_20070619.patch)

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507594 ] 

Hudson commented on HADOOP-1292:
--------------------------------

Integrated in Hadoop-Nightly #133 (See [http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/133/])

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.14.0
>
>         Attachments: HADOOP-1292_20070621c.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506427 ] 

dhruba borthakur commented on HADOOP-1292:
------------------------------------------

Looks good. A few comments:

1. I would rather not add a new method to FileSystem. Instead I would use FileSystem.get(Uri, conf) to get the local file system where-ever needed in FsShell.java
2. the tmp file prefix or suffix could be "tmp.fsshell" so that it is helpful to debug certain scenarios. Most applications uses "tmp" or some variations of that.
3. I am unable to understand the behaviour of "another" file in FsShell.copyToLocal. Will discuss this with you.
4. Maybe enhance TestDFSShell.java to encompass this scenario. At least invoke FsShell.copyToLocal

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070619b.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12507457 ] 

dhruba borthakur commented on HADOOP-1292:
------------------------------------------

Ok, I misunderstood the code. The /test/copytolocal is a dfs directory and not a local directory.

+1.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: HADOOP-1292_20070621c.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-1292:
---------------------------------

       Resolution: Fixed
    Fix Version/s: 0.14.0
           Status: Resolved  (was: Patch Available)

I just committed this.  Thanks, Tsz Wo!

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.14.0
>
>         Attachments: HADOOP-1292_20070621c.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070620.patch

Regarding to Dhruba's comments:

1. I would rather not add a new method to FileSystem. Instead I would use FileSystem.get(Uri, conf) to get the local file system where-ever needed in FsShell.java

The method FileSystem.getLocalFileSystem() is not necessary since there is already a method FileSystem.getLocal(conf).  I did not change anything in FileSystem.java in this new patch.

2. the tmp file prefix or suffix could be "tmp.fsshell" so that it is helpful to debug certain scenarios. Most applications uses "tmp" or some variations of that.

using "_copyToLocal_" now.

3. I am unable to understand the behaviour of "another" file in FsShell.copyToLocal. Will discuss this with you.

If renaming tmp to dst failed, tmp is renamed to another file since tmp will be deleted on exit.  The error message would tell the user that src is copied to "another" successfully but cannot be renamed to dst.

4. Maybe enhance TestDFSShell.java to encompass this scenario. At least invoke FsShell.copyToLocal

TestDFSShell.java already has some tests for "-get" (i.e. -copyToLocal).  The scenario not tested is the VM killed in the middle of copying.  It needs some works to make such tests.  It seems to me that such exceptional scenario is not worth to put a lot of effort to test it.   FsShell.copyToLocal is private.  So we cannot invoke it directly in unit test.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070620.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur reassigned HADOOP-1292:
----------------------------------------

    Assignee: Tsz Wo (Nicholas), SZE

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: HADOOP-1292_20070621.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment:     (was: HADOOP-1292_20070621.patch)

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Status: Patch Available  (was: Open)

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: HADOOP-1292_20070621c.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506774 ] 

dhruba borthakur commented on HADOOP-1292:
------------------------------------------

Code changes look good. However, I would rather not introduce a new API  ChecksumFileSystem.getChecksumFile(File). Instead, I would use the existing ChecksumFileSystem.getChecksumFile(Path) method.


> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070620.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070621c.patch

Fixed a bug in copying tree and created a unit test in TestDFSShell.

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: HADOOP-1292_20070621c.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506995 ] 

Tsz Wo (Nicholas), SZE edited comment on HADOOP-1292 at 6/21/07 12:28 PM:
--------------------------------------------------------------------------

Removed the method ChecksumFileSystem.getChecksumFile(File).


 was:
Removed the ChecksumFileSystem.getChecksumFile(File).

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070621.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.