You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by psdc1978 <ps...@gmail.com> on 2010/02/17 11:34:44 UTC

Where duplicated data is ignored?

Hi,

In Hadoop MapRed, when I define the number of reduce tasks to run,

<property>
        <name>mapred.reduce.tasks</name>
        <value>3</value>
</property>

I've noticed that during the execution of an MapRed example, the Reduces
threads request 9 times the MapOutputServlet on the TaskTracker. The value 9
comes from the 3 reduces tasks times 3 splits that exist that have map
output. The purpose of MapOutputServlet is to give the map output data to a
reduce thread.

Since the merge result from my example - btw the example is the one that
counts words - doesn't contain duplicated data, where the duplicated data is
ignored?

- Is it by the MapOutputServlet that detects that the split was already
requested?
- Is it by the Reduce task after retrieving data from the MapOutputServlet
and before the merging phase?
- Is it during the merging phase?

Thanks for the help,

-- 
Pedro