You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Rohit Pegallapati (JIRA)" <ji...@apache.org> on 2018/04/16 02:19:00 UTC
[jira] [Comment Edited] (HADOOP-13023) Distcp with -update feature
on first time raw data not working
[ https://issues.apache.org/jira/browse/HADOOP-13023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16438921#comment-16438921 ]
Rohit Pegallapati edited comment on HADOOP-13023 at 4/16/18 2:18 AM:
---------------------------------------------------------------------
This looks inline with the intended behavior of -update option
[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code:java}
-update is used to copy files from source that don’t exist at the target or differ from the target version. -overwrite overwrites target-files that exist at the target.
The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from /source/first/ and /source/second/ to /target/, where the source paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without -update or -overwrite, the DistCp defaults would create directories first/ and second/, under /target. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either -update or -overwrite is specified, the *contents* of the source-directories are copied to target, and not the source directories themselves.
Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in /target:
hdfs://nn2:8020/target/1
hdfs://nn2:8020/target/2
hdfs://nn2:8020/target/10
hdfs://nn2:8020/target/20
{code}
was (Author: rohit.peg):
This looks inline with the intended behavior of -update option
[https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html]
{code}
{{-update}} is used to copy files from source that don’t exist at the target or differ from the target version. {{-overwrite}} overwrites target-files that exist at the target.
The Update and Overwrite options warrant special attention since their handling of source-paths varies from the defaults in a very subtle manner. Consider a copy from {{/source/first/}} and {{/source/second/}} to {{/target/}}, where the source paths have the following contents:
hdfs://nn1:8020/source/first/1
hdfs://nn1:8020/source/first/2
hdfs://nn1:8020/source/second/10
hdfs://nn1:8020/source/second/20
When DistCp is invoked without {{-update}} or {{-overwrite}}, the DistCp defaults would create directories {{first/}} and {{second/}}, under {{/target}}. Thus:
distcp hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
would yield the following contents in {{/target}}:
hdfs://nn2:8020/target/first/1
hdfs://nn2:8020/target/first/2
hdfs://nn2:8020/target/second/10
hdfs://nn2:8020/target/second/20
When either {{-update}} or {{-overwrite}} is specified, the *contents* of the source-directories are copied to target, and not the source directories themselves. Thus:
distcp -update hdfs://nn1:8020/source/first hdfs://nn1:8020/source/second hdfs://nn2:8020/target
{code}
> Distcp with -update feature on first time raw data not working
> --------------------------------------------------------------
>
> Key: HADOOP-13023
> URL: https://issues.apache.org/jira/browse/HADOOP-13023
> Project: Hadoop Common
> Issue Type: Bug
> Components: tools/distcp
> Affects Versions: 2.6.0
> Reporter: Mavin Martin
> Priority: Major
>
> When attempting to do a distcp with the -update feature toggled on encrypted data, the distcp shows as successful. Reading the encrypted file on the target_path does not work since the keyName does not exist.
> Please see my example to reproduce the issue.
> {code}
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted DEF0000000000013
> [root@xxx bin]# hdfs dfs -ls -R /tmp
> drwxr-xr-x - xxx xxx 0 2016-04-14 00:22 /tmp/a
> drwxr-xr-x - xxx xxx 0 2016-04-14 00:00 /tmp/a/ted
> -rw-r--r-- 3 xxx xxx 33 2016-04-14 00:00 /tmp/a/ted/test.txt
> [root@xxx bin]# hadoop distcp -update /.reserved/raw/tmp/a/ted /.reserved/raw/tmp/a-with-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted DEF0000000000013
> [root@xxx bin]# hadoop distcp /.reserved/raw/tmp/a/ted /.reserved/raw/tmp/a-no-update/ted
> [root@xxx bin]# hdfs crypto -listZones
> /tmp/a/ted DEF0000000000013
> /tmp/a-no-update/ted DEF0000000000013
> {code}
> The crypto zone for 'a-with-update' should have been created since this is a new destination. You can verify this by looking at 'a-no-update'.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org