You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/01/03 06:29:00 UTC

[jira] [Work logged] (HADOOP-18056) DistCp: Filter duplicates in the source paths

     [ https://issues.apache.org/jira/browse/HADOOP-18056?focusedWorklogId=702853&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-702853 ]

ASF GitHub Bot logged work on HADOOP-18056:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Jan/22 06:28
            Start Date: 03/Jan/22 06:28
    Worklog Time Spent: 10m 
      Work Description: ayushtkn commented on a change in pull request #3825:
URL: https://github.com/apache/hadoop/pull/3825#discussion_r777308040



##########
File path: hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/DistCpOptions.java
##########
@@ -234,7 +236,18 @@ public Path getSourceFileListing() {
 
   public List<Path> getSourcePaths() {
     return sourcePaths == null ?
-        null : Collections.unmodifiableList(sourcePaths);
+        null :
+        Collections.unmodifiableList(getUniquePaths(sourcePaths));
+  }
+
+  private List<Path> getUniquePaths(List<Path> srcPaths) {
+    Set<Path> uniquePaths = new LinkedHashSet<>();
+    for (Path path : srcPaths) {
+      if (!uniquePaths.add(path)) {
+        LOG.warn("Path: {} added multiple times, Ignoring the redundant entry.", path);

Review comment:
       Thanx @tomscut, I have changed it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 702853)
    Time Spent: 1h 10m  (was: 1h)

> DistCp: Filter duplicates in the source paths
> ---------------------------------------------
>
>                 Key: HADOOP-18056
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18056
>             Project: Hadoop Common
>          Issue Type: Improvement
>            Reporter: Ayush Saxena
>            Assignee: Ayush Saxena
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Add a basic filtering to remove the exact duplicate paths exposed for copying.
> In case two same srcPath say /tmp/file1 is passed in the list twice. DistCp fails with DuplicateFileException, post building the listing.
> Would be better if we do a basic filtering of duplicate paths. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org