You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-commits@hadoop.apache.org by dh...@apache.org on 2010/07/06 08:26:03 UTC

svn commit: r960808 - in /hadoop/mapreduce/trunk: CHANGES.txt src/contrib/raid/src/java/org/apache/hadoop/raid/DistRaid.java

Author: dhruba
Date: Tue Jul  6 06:26:02 2010
New Revision: 960808

URL: http://svn.apache.org/viewvc?rev=960808&view=rev
Log:
MAPREDUCE-1838. Reduce the time needed for raiding a bunch of files
by randomly assigning files to map tasks. (Ramkumar Vadali via dhruba)


Modified:
    hadoop/mapreduce/trunk/CHANGES.txt
    hadoop/mapreduce/trunk/src/contrib/raid/src/java/org/apache/hadoop/raid/DistRaid.java

Modified: hadoop/mapreduce/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/CHANGES.txt?rev=960808&r1=960807&r2=960808&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/CHANGES.txt (original)
+++ hadoop/mapreduce/trunk/CHANGES.txt Tue Jul  6 06:26:02 2010
@@ -146,6 +146,9 @@ Trunk (unreleased changes)
     MAPREDUCE-1894. Fixed a bug in DistributedRaidFileSystem.readFully() 
     that was causing it to loop infinitely. (Ramkumar Vadali via dhruba)
 
+    MAPREDUCE-1838. Reduce the time needed for raiding a bunch of files
+    by randomly assigning files to map tasks. (Ramkumar Vadali via dhruba)
+
 Release 0.21.0 - Unreleased
 
   INCOMPATIBLE CHANGES

Modified: hadoop/mapreduce/trunk/src/contrib/raid/src/java/org/apache/hadoop/raid/DistRaid.java
URL: http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/contrib/raid/src/java/org/apache/hadoop/raid/DistRaid.java?rev=960808&r1=960807&r2=960808&view=diff
==============================================================================
--- hadoop/mapreduce/trunk/src/contrib/raid/src/java/org/apache/hadoop/raid/DistRaid.java (original)
+++ hadoop/mapreduce/trunk/src/contrib/raid/src/java/org/apache/hadoop/raid/DistRaid.java Tue Jul  6 06:26:02 2010
@@ -324,6 +324,11 @@ public class DistRaid {
       opWriter = SequenceFile.createWriter(fs, jobconf, opList, Text.class,
           PolicyInfo.class, SequenceFile.CompressionType.NONE);
       for (RaidPolicyPathPair p : raidPolicyPathPairList) {
+        // If a large set of files are Raided for the first time, files
+        // in the same directory that tend to have the same size will end up
+        // with the same map. This shuffle mixes things up, allowing a better
+        // mix of files.
+        java.util.Collections.shuffle(p.srcPaths);
         for (FileStatus st : p.srcPaths) {
           opWriter.append(new Text(st.getPath().toString()), p.policy);
           opCount++;