You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Laurent Goujon (JIRA)" <ji...@apache.org> on 2014/01/28 07:56:40 UTC

[jira] [Commented] (HADOOP-10295) Allow distcp to automatically identify the checksum type of source files and use it for the target

    [ https://issues.apache.org/jira/browse/HADOOP-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13883830#comment-13883830 ] 

Laurent Goujon commented on HADOOP-10295:
-----------------------------------------

Funny, I have been preparing a patch for this very same issue for a week.

Some comments regarding your patch:
* instead of a new commandline option, it may be better to extend FileAttribute enum
* MD5MD5CRC32GzipFileChecksum and MD5MD5CRC32CastagnoliFileChecksum are probably HDFS specific (although being available in hadoop-common). I opened HADOOP-10297 for having {{FileChecksum.getChecksumOpt()}}
* Instead of doing two instanceof check, it is possible to use the super class MD5MD5CRC32FileChecksum
* EnumSet.of(CreateFlag.OVERWRITE) is not equivalent of setting overwrite argument to true. From DistributedFileSystem, it is EnumSet.of(CreateFlag.CREATE, CreateFlag.OVERWRITE)
* Having a test to check if the option actually works would be a nice to have (according to me)

Since I also have a patch, I'll attach it to this ticket to, and let have a hadoop maintainer help us sorting them out :) 

> Allow distcp to automatically identify the checksum type of source files and use it for the target
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-10295
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10295
>             Project: Hadoop Common
>          Issue Type: Improvement
>    Affects Versions: 2.2.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>         Attachments: HADOOP-10295.000.patch
>
>
> Currently while doing distcp, users can use "-Ddfs.checksum.type" to specify the checksum type in the target FS. This works fine if all the source files are using the same checksum type. If files in the source cluster have mixed types of checksum, users have to either use "-skipcrccheck" or have checksum mismatching exception. Thus we may need to consider adding a new option to distcp so that it can automatically identify the original checksum type of each source file and use the same checksum type in the target FS. 



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)