You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Rich Haase (JIRA)" <ji...@apache.org> on 2015/04/28 01:01:39 UTC

[jira] [Commented] (HADOOP-1540) distcp should support an exclude list

    [ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515182#comment-14515182 ] 

Rich Haase commented on HADOOP-1540:
------------------------------------

I have a patch for this JIRA that I've just started testing.  https://github.com/richhaase/hadoop-patches/blob/master/HADOOP-1540.branch-2.6.0.001.patch

The patch adds a -exclusions <arg> option to distcp.  The argument is a file containing a list of Java Regex Patterns (one per line).  Each file that is to be copied will be compared the list of exclusion patterns.  IF an exclusion pattern is matched then the file will not be copied.  

Example CLI (running with a patched JAR on a Hortonworks HDP 2.2.4 cluster):

*$ export HADOOP_USER_CLASSPATH_FIRST=true; export HADOOP_CLASSPATH=/home/rhaase/hadoop-distcp-2.6.0-20150426160037.jar; mapred distcp -update -exclusions exclude.txt /user/hadoop/radio /user/rhaase/radio*
5/04/27 15:26:55 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[/user/hadoop/radio], targetPath=/user/rhaase/radio, targetPathExists=false, preserveRawXattrs=false, exclusionsFile='exclude.txt'}
...
15/04/27 15:42:27 INFO mapreduce.Job:  map 100% reduce 0%
15/04/27 15:42:27 INFO mapreduce.Job: Job job_1429896015201_0035 completed successfully
15/04/27 15:42:27 INFO mapreduce.Job: Counters: 35
        File System Counters
                FILE: Number of bytes read=0
                FILE: Number of bytes written=2392499
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=358894362945
                HDFS: Number of bytes written=358893418844
                HDFS: Number of read operations=3214
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=633
        Job Counters
                Launched map tasks=21
                Other local map tasks=21
                Total time spent by all maps in occupied slots (ms)=4297461
                Total time spent by all reduces in occupied slots (ms)=0
                Total time spent by all map tasks (ms)=4297461
                Total vcore-seconds taken by all map tasks=4297461
                Total megabyte-seconds taken by all map tasks=4400600064
        Map-Reduce Framework
                Map input records=4296
                Map output records=0
                Input split bytes=2457
                Spilled Records=0
                Failed Shuffles=0
                Merged Map outputs=0
                GC time elapsed (ms)=4573
                CPU time spent (ms)=2571060
                Physical memory (bytes) snapshot=10379874304
                Virtual memory (bytes) snapshot=56655720448
                Total committed heap usage (bytes)=43711463424
        File Input Format Counters
                Bytes Read=941644
        File Output Format Counters
                Bytes Written=0
        org.apache.hadoop.tools.mapred.CopyMapper$Counter
                BYTESCOPIED=358893418844
                *BYTESEXCLUDED=1407553620118*
                BYTESEXPECTED=358893418844
                COPY=322
                *EXCLUDED=3974*




> distcp should support an exclude list
> -------------------------------------
>
>                 Key: HADOOP-1540
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1540
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: util
>            Reporter: Senthil Subramanian
>            Priority: Minor
>
> There should be a way to ignore specific paths (eg: those that have already been copied over under the current srcPath). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)