You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2007/06/19 23:13:26 UTC

[jira] Updated: (HADOOP-1292) dfs -copyToLocal should guarantee file is complete

     [ https://issues.apache.org/jira/browse/HADOOP-1292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-1292:
-------------------------------------------

    Attachment: HADOOP-1292_20070619.patch

- Added a protected method FileSystem.getLocalFileSystem().
  All calls to the deprecated method 
  FileSystem.getNamed(String name, Configuration conf)
  for getting a local file system are updated.

- In FsShell, added 
  static final String TMP_FILENAME_PREFIX = "_tmp_";
  If desired, we may add TMP_FILENAME_PREFIX to the XML configuration file.

- In FsShell, added 
  private void copyToLocal(DistributedFileSystem dfs,
			   Path src, Path dst, boolean copyCrc)
  so that src is first copied to a tmp file under the parent directory of dst
  and then tmp is renamed to dst after the copying successes.
  Note that resume is not implemented (tmp file won't be reused).

> dfs -copyToLocal should guarantee file is complete
> --------------------------------------------------
>
>                 Key: HADOOP-1292
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1292
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: eric baldeschwieler
>         Attachments: HADOOP-1292_20070619.patch
>
>
> We should copy to a temporary file, maybe _tmp.<realname>, and then rename the file when the copy is complete.  Restarting a copy should reuse the _tmp file, just checksumming it.  Then ^Cing a copy will do the right thing.
> Original suggestion:
> On Apr 23, 2007, at 2:38 AM, Richard Kasperski wrote:
> I'd like to have a guarantee that a file copy is both completed and that the file is whole. In the past I've done this  by copying the file to a temporary name tmp.<realname> and then moving it to <realname> once I have the file copy is complete. This has the following very nice properties; If the <realname> exists then the file copy is complete and I'm not looking at a partial copy of the file. I believe that the copy to the cluster has both of these properties in that the file doesn't appear in a DFS directory until the whole file has been copied. The copy from the cluster to a local file system does not have these guarantees and it would be very nice if it did. There are two scenarios under what I wish to use this. First is that if I ctrl-c the 'hadoop dfs -copyToLocal' I know what parts are complete and what parts aren't. Second I can run a background compressor to compress the files as they are copied.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.