You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Rohit Pegallapati (JIRA)" <ji...@apache.org> on 2018/04/16 02:19:00 UTC

[jira] [Comment Edited] (HADOOP-13023) Distcp with -update feature on first time raw data not working

    [ https://issues.apache.org/jira/browse/HADOOP-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438921#comment-16438921 ] 

Rohit Pegallapati edited comment on HADOOP-13023 at 4/16/18 2:18 AM:
---------------------------------------------------------------------

This looks inline with the intended behavior of -update option

[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code:java}
-update is used to copy files from source that don’t exist at the target or differ from the target version. -overwrite overwrites target-files that exist at the target.

The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/, where the source paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update or -overwrite, the DistCp defaults would create directories first/ and second/, under /target. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either -update or -overwrite is specified, the *contents* of the source-directories are copied to target, and not the source directories themselves. 
Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target

would yield the following contents in /target:

hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20
{code}


was (Author: rohit.peg):
This looks inline with the intended behavior of -update option

[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]

{code}

{{-update}} is used to copy files from source that don’t exist at the target or differ from the target version. {{-overwrite}} overwrites target-files that exist at the target.

The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from {{/source/first/}} and {{/source/second/}} to {{/target/}}, where the source paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without {{-update}} or {{-overwrite}}, the DistCp defaults would create directories {{first/}} and {{second/}}, under {{/target}}. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in {{/target}}:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either {{-update}} or {{-overwrite}} is specified, the *contents* of the source-directories are copied to target, and not the source directories themselves. Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
{code}

> Distcp with -update feature on first time raw data not working
> --------------------------------------------------------------
>
>                 Key: HADOOP-13023
>                 URL: https://issues.apache.org/jira/browse/HADOOP-13023
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: tools/distcp
>    Affects Versions: 2.6.0
>            Reporter: Mavin Martin
>            Priority: Major
>
> When attempting to do a distcp with the -update feature toggled on encrypted data, the distcp shows as successful.  Reading the encrypted file on the target_path does not work since the keyName does not exist.  
> Please see my example to reproduce the issue.
> {code}
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted                                DEF0000000000013
> [root@xxx bin]# hdfs dfs -ls -R /tmp
> drwxr-xr-x   - xxx xxx          0 2016-04-14 00:22 /tmp/a
> drwxr-xr-x   - xxx xxx          0 2016-04-14 00:00 /tmp/a/ted
> -rw-r--r--   3 xxx xxx         33 2016-04-14 00:00 /tmp/a/ted/test.txt
> [root@xxx bin]# hadoop distcp -update /.reserved/raw/tmp/a/ted /.reserved/raw/tmp/a-with-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted                                DEF0000000000013
> [root@xxx bin]# hadoop distcp /.reserved/raw/tmp/a/ted /.reserved/raw/tmp/a-no-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted                                DEF0000000000013
> /tmp/a-no-update/ted                      DEF0000000000013
> {code}
> The crypto zone for 'a-with-update' should have been created since this is a new destination.  You can verify this by looking at 'a-no-update'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org