You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Kai Xie (JIRA)" <ji...@apache.org> on 2019/03/04 13:57:00 UTC

[jira] [Comment Edited] (HADOOP-16158) DistCp supports checksum validation when copy blocks in parallel

    [ https://issues.apache.org/jira/browse/HADOOP-16158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16783384#comment-16783384 ] 

Kai Xie edited comment on HADOOP-16158 at 3/4/19 1:56 PM:
----------------------------------------------------------

Hi Steve, thanks for the comment.

`isSplit` here is introduced by HADOOP-11794 ([commit|https://github.com/apache/hadoop/commit/064c8b25eca9bc825dc07a54d9147d65c9290a03#diff-a3629647166ce008e67f0a93bc9c856bR265]) and used to indicate if the source data to copy is only a chunk of it (consists of one or more blocks, not all).

The patch skipped the checksum validation in DistCp CopyMapper / RetriableFileCopyCommand because 
 # the copied target data is just a chunk / a few blocks of the source data. 
 # existing FileSystem API `getFileChecksum` can't operate at the block level.

And I have 2 options for the patch:
 # add the checksum validation in DistCp CopyCommitter after chunks are merged back to one ([code|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java#L628]). But doing this will miss the chance to retry copying in the map phase if any checksum mismatch is detected.
 # add an API in FileSystem like `getFileChecksum(path, start, length)` and then we can use it in DistCp CopyMapper to validate the checksum between the source data and the copied blocks. But I'm not sure if such use case is strong enough to justify adding the new API


was (Author: kai33):
Hi Steve, thanks for the comment.

`isSplit` here is introduced by HADOOP-11794 ([commit|https://github.com/apache/hadoop/commit/064c8b25eca9bc825dc07a54d9147d65c9290a03#diff-a3629647166ce008e67f0a93bc9c856bR265]) and used to indicate if the source data to copy is only a chunk of it (consists of one or more blocks, not all).

The patch skipped the checksum validation in DistCp CopyMapper / RetriableFileCopyCommand because 
 # the copied target data is just a chunk / a few blocks of the source data. 
 # existing FileSystem API `getFileChecksum` can't operate at the block level.

And I have 2 options for the patch:
 # add the checksum validation in DistCp CopyCommitter after chunks are merged back to one ([code|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/CopyCommitter.java#L628]). But doing this will miss the chance to retry copying in the map phase.
 # add an API in FileSystem like `getFileChecksum(path, start, length)` and then we can use it in DistCp CopyMapper to validate the checksum between the source data and the copied blocks. But I'm not sure if such use case is strong enough to justify adding the new API

> DistCp supports checksum validation when copy blocks in parallel
> ----------------------------------------------------------------
>
>                 Key: HADOOP-16158
>                 URL: https://issues.apache.org/jira/browse/HADOOP-16158
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: tools/distcp
>    Affects Versions: 3.2.0, 2.9.2
>            Reporter: Kai Xie
>            Assignee: Kai Xie
>            Priority: Major
>
> Copying blocks in parallel (enabled when blocks per chunk > 0) is a great DistCp improvement that can hugely speed up copying big files. 
> But its checksum validation is skipped, e.g. in `RetriableFileCopyCommand.java`
>  
> {code:java}
> if (!source.isSplit()) {
>   compareCheckSums(sourceFS, source.getPath(), sourceChecksum,
>       targetFS, targetPath);
> }
> {code}
> and this could result in checksum/data mismatch without notifying developers/users (e.g. HADOOP-16049).
> I'd like to provide a patch to add the checksum validation.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org