You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Elazar Leibovich <el...@gmail.com> on 2014/09/05 00:14:51 UTC

[ANN] Multireducers - run multiple reducers on the same mapreduce job

I'll appreciate reviews of the code and the API of multireducers - a way to
run a couple of map and reduce classes in the same MapReduce job.

Thanks,

https://github.com/elazarl/multireducers

Usage example:

MultiJob.create().
        withMapper(SelectFirstField.class, Text.class, IntWritable.class).
        withReducer(CountFirstField.class, 1).
        withCombiner(CountFirstField.class).
        withOutputFormat(TextOutputFormat.class, Text.class, IntWritable.class).
        addTo(job);MultiJob.create().
        withMapper(SelectSecondField.class, IntWritableInRange.class,
IntWritable.class).
        withReducer(CountSecondField.class, 1).
        withCombiner(CountSecondField.class).
        withOutputFormat(TextOutputFormat.class, Text.class, IntWritable.class).
        addTo(job);

Motivation:

Sometimes, one would like to run more than one MapReduce job on the same
input files.

A classic example, is one would like to select two different fields from a
CSV file with two different mappers, and count the distinct values for each
field.

Let's say we're having a CSV file with employee's names and height

john,120
john,130
joe,180
moe,190
dough,130

We want one MapReduce job to count how many employees we have for each name
(two johns in our cases), and also, how many employees do we have for each
height (two employees 130 cm high).

The code for the mappers looks like

// i = 0 for the first reducer, 1 for the secondprotected void
map(LongWritable key, Text value, Context context) {
    context.write(new Text(value.toString().split(",")[i]), one);}

The code for the reducers looks like

protected void reduce(Text key, Iterable<IntWritable> values, Context context) {
    context.write(key, new IntWritable(Iterables.size(values));}