You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Jørn Schou-Rode <js...@malamute.dk> on 2010/03/03 21:47:33 UTC

Efficient implementation of "MapReduceReduce" in Hadoop

After mapping and reducing some data, I need to do an additional
processing step. This additional step shares the conract of a reduce
function, expecting its input data (the output from the original reduce)
to be grouped by key.

Currently, I achieve the above using two iterations:

 1. MyMapper -> MyFirstReducer
 2. IdentityMapper -> MySecondReducer

As my project is purely academic, I am wondering if this approach really
is the best I can do with respect to performance. Unless Hadoop has some
built in optimization around the IdentityMapper class (v0.18.3), I
believe my current approach causes the intermediate data between the two
reduce functions to be completely read and rewritten to HDFS for no
reason.

Can Hadoop be instructed to completely skip running the IdentityMapper
in my application? Or is there some other/better way to run do
"MapReduceReduce"?

Thanks in advance.

/Jørn