You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by GitBox <gi...@apache.org> on 2022/02/10 14:49:45 UTC

[GitHub] [hadoop] ayushtkn commented on pull request #3940: HADOOP-18096. Distcp: Sync moves filtered file to home directory rather than deleting.

ayushtkn commented on pull request #3940:
URL: https://github.com/apache/hadoop/pull/3940#issuecomment-1035009857


   @saintstack 
   The path is actually relative for rename entries it is made absolute here:
   https://github.com/apache/hadoop/blob/efdec92cab8a88deb5ec9e81f5c8feb7a0fa873b/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L471-L474
   
   For normal delete there won't be any target, it would be always ``null``, so it is added just like that in normal cases.
   https://github.com/apache/hadoop/blob/efdec92cab8a88deb5ec9e81f5c8feb7a0fa873b/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L465-L468
   
   In this particular case when using filters.
   
   The actual entry is a ``RENAME`` entry which has target. Rename has to have a target. So, it takes this else block
   https://github.com/apache/hadoop/blob/efdec92cab8a88deb5ec9e81f5c8feb7a0fa873b/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L254
   
   And when converting it to a ``DELETE`` entry, it even adds the target.
   https://github.com/apache/hadoop/blob/efdec92cab8a88deb5ec9e81f5c8feb7a0fa873b/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L262-L265
   
   But since it is a delete entry the path isn't made absolute wrt target. So it stays like a relative path. like `filterDir1` and since it doesn't start with / and the normal logic by default it gets resolved to home directory.
   
   Then the code that you shared does the magic, it moves it...
   
   One example of target being set to ``null`` :
   https://github.com/apache/hadoop/blob/efdec92cab8a88deb5ec9e81f5c8feb7a0fa873b/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpSync.java#L446-L447
   
   May be there could be sanity check in delete diff for target, but not very confident about that part, will explore sometime if there is any use case possible where it can be not-null & compat stuff.
   
   Further general optimisations as well are possible, like don't rename to tmp and then delete, directly delete(There is a reason why it is like that), that is something in my TODO list, will chase in future
   
   General Info: Filters are like quite used in DR setups, some time we don't want to copy some data to replica clusters. One example could be Trash data, many other use cases as well.
   
   Lemme know if it isn't still convincing..


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org