You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by David Saile <da...@uni-koblenz.de> on 2011/02/27 23:27:16 UTC
TeraSort bug?
Hi,
I have a problem concerning the TeraSort benchmark.
I am running the version that ships with hadoop-0.21.0 and if I use it as described (i.e. TeraGen -TeraSort - TeraValidate), everything works fine.
However, for some tests I need to run, I added a simple job between TeraGen and TeraSort that does nothing but copy the input. I included its code below.
If I run this Copy-job after TeraGen, TeraSort will partition the input in a way, that most tuples will go to the last reducer.
For example if I run TeraSort with 500MB input, and 20 Reducers I get the following distribution:
-Reducers 0-18 process ~10.000 tuples each
-Reducer 19 processes ~5.000.000 tuples
Can anyone reproduce this behavior? I would really appreciated any help!
David
public class Copy extends Configured implements Tool {
public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Job job = Job.getInstance(new Cluster(getConf()), getConf());
Path inputDirOld = new Path(args[0]);
TeraInputFormat.addInputPath(job, inputDirOld);
job.setInputFormatClass(TeraInputFormat.class);
job.setJobName("Copy");
job.setJarByClass(Void.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TeraOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0 : 1;
}
public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Void(), args);
System.exit(res);
}
}