You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Rich Haase (JIRA)" <ji...@apache.org> on 2015/04/28 01:01:39 UTC
[jira] [Commented] (HADOOP-1540) distcp should support an exclude
list
[ https://issues.apache.org/jira/browse/HADOOP-1540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14515182#comment-14515182 ]
Rich Haase commented on HADOOP-1540:
------------------------------------
I have a patch for this JIRA that I've just started testing. https://github.com/richhaase/hadoop-patches/blob/master/HADOOP-1540.branch-2.6.0.001.patch
The patch adds a -exclusions <arg> option to distcp. The argument is a file containing a list of Java Regex Patterns (one per line). Each file that is to be copied will be compared the list of exclusion patterns. IF an exclusion pattern is matched then the file will not be copied.
Example CLI (running with a patched JAR on a Hortonworks HDP 2.2.4 cluster):
*$ export HADOOP_USER_CLASSPATH_FIRST=true; export HADOOP_CLASSPATH=/home/rhaase/hadoop-distcp-2.6.0-20150426160037.jar; mapred distcp -update -exclusions exclude.txt /user/hadoop/radio /user/rhaase/radio*
5/04/27 15:26:55 INFO tools.DistCp: Input Options: DistCpOptions{atomicCommit=false, syncFolder=true, deleteMissing=false, ignoreFailures=false, maxMaps=20, sslConfigurationFile='null', copyStrategy='uniformsize', sourceFileListing=null, sourcePaths=[/user/hadoop/radio], targetPath=/user/rhaase/radio, targetPathExists=false, preserveRawXattrs=false, exclusionsFile='exclude.txt'}
...
15/04/27 15:42:27 INFO mapreduce.Job: map 100% reduce 0%
15/04/27 15:42:27 INFO mapreduce.Job: Job job_1429896015201_0035 completed successfully
15/04/27 15:42:27 INFO mapreduce.Job: Counters: 35
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=2392499
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=358894362945
HDFS: Number of bytes written=358893418844
HDFS: Number of read operations=3214
HDFS: Number of large read operations=0
HDFS: Number of write operations=633
Job Counters
Launched map tasks=21
Other local map tasks=21
Total time spent by all maps in occupied slots (ms)=4297461
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=4297461
Total vcore-seconds taken by all map tasks=4297461
Total megabyte-seconds taken by all map tasks=4400600064
Map-Reduce Framework
Map input records=4296
Map output records=0
Input split bytes=2457
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=4573
CPU time spent (ms)=2571060
Physical memory (bytes) snapshot=10379874304
Virtual memory (bytes) snapshot=56655720448
Total committed heap usage (bytes)=43711463424
File Input Format Counters
Bytes Read=941644
File Output Format Counters
Bytes Written=0
org.apache.hadoop.tools.mapred.CopyMapper$Counter
BYTESCOPIED=358893418844
*BYTESEXCLUDED=1407553620118*
BYTESEXPECTED=358893418844
COPY=322
*EXCLUDED=3974*
> distcp should support an exclude list
> -------------------------------------
>
> Key: HADOOP-1540
> URL: https://issues.apache.org/jira/browse/HADOOP-1540
> Project: Hadoop Common
> Issue Type: Improvement
> Components: util
> Reporter: Senthil Subramanian
> Priority: Minor
>
> There should be a way to ignore specific paths (eg: those that have already been copied over under the current srcPath).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)