You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Saile <da...@uni-koblenz.de> on 2011/02/27 23:27:16 UTC

TeraSort bug?

Hi,

I have a problem concerning the TeraSort benchmark.
I am running the version that ships with hadoop-0.21.0 and if I use it as described (i.e. TeraGen -TeraSort - TeraValidate), everything works fine.

However, for some tests I need to run, I added a simple job between TeraGen and TeraSort that does nothing but copy the input. I included its code below. 

If I run this Copy-job after TeraGen, TeraSort will partition the input in a way, that most tuples will go to the last reducer. 
For example if I run TeraSort with 500MB input, and 20 Reducers I get the following distribution:
-Reducers 0-18 process ~10.000 tuples each
-Reducer 19 processes ~5.000.000 tuples 

Can anyone reproduce this behavior? I would really appreciated any help!

David


public class Copy extends Configured implements Tool {

    public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
  	Job job = Job.getInstance(new Cluster(getConf()), getConf());
    
  	Path inputDirOld = new Path(args[0]);
	TeraInputFormat.addInputPath(job, inputDirOld);
    	job.setInputFormatClass(TeraInputFormat.class);
    
    	job.setJobName("Copy");
    	job.setJarByClass(Void.class);
    	job.setMapOutputKeyClass(Text.class);
    	job.setMapOutputValueClass(Text.class);
    	
    	FileOutputFormat.setOutputPath(job, new Path(args[1]));
    	job.setOutputFormatClass(TeraOutputFormat.class);
    	job.setOutputKeyClass(Text.class);
    	job.setOutputValueClass(Text.class);

    	return job.waitForCompletion(true) ? 0 : 1;
		
    }

     public static void main(String[] args) throws Exception {
    	int res = ToolRunner.run(new Configuration(), new Void(), args);
    	System.exit(res);
     }
}