You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Ahmad Humayun <ah...@gmail.com> on 2008/03/03 16:43:00 UTC

intermediate map data

Hello everyone,

I have a question about the intermediate data output by the map function. I
wanted to know that does this intermediate data get written to the HDFS or
it stays in the node's local memory? According to the MapReduce paper, the
intermediate data is run through a hash function which maps every key to a
given a reduce worker. So how does this whole process happen? Does the map
worker write the intermediate data to the HDFS and then tells the JobTracker
(Master) which Reduce worker should be allotted this data? Or the Map worker
keeps the intermediate data in memory and makes an RPC call directly to the
reduce worker (which was figured out by the hash function) to transfer the
intermediate data?

It will be great if you can point me to the place, where these
functionalities are implemented in hadoop.
Plus it will be great if you can also point me to the place where the hash
function is in map?

thanks again for the great support on this mailing list.


regards,
-- 
Ahmad Humayun
Research Assistant
Computer Science Dpt., LUMS
+92 321 4457315

Re: intermediate map data

Posted by Ahmad Humayun <ah...@gmail.com>.

Thanks a lot Amar. As usual, you have cleared a lot of the haze in head :)


regards,

On Mon, Mar 3, 2008 at 9:32 PM, Amar Kamat <am...@yahoo-inc.com> wrote:

> On Mon, 3 Mar 2008, Ahmad Humayun wrote:
>
> > Hello everyone,
> >
> > I have a question about the intermediate data output by the map
> function. I
> > wanted to know that does this intermediate data get written to the HDFS
> or
> > it stays in the node's local memory?
> It stored on the local disk.
> > According to the MapReduce paper, the
> > intermediate data is run through a hash function which maps every key to
> a
> > given a reduce worker. So how does this whole process happen? Does the
> map
> > worker write the intermediate data to the HDFS and then tells the
> JobTracker
> > (Master) which Reduce worker should be allotted this data? Or the Map
> worker
> > keeps the intermediate data in memory and makes an RPC call directly to
> the
> > reduce worker (which was figured out by the hash function) to transfer
> the
> > intermediate data?
> >
> The map uses something called the partitioner. Each map writes they <k,v>
> pair for the appropriate reducer determined by this partition function. In
> the end there is a map output file which is nothing but outputs for each
> reduce function concatenated in sequence based on reduce id. The hash
> function you are talking about is the partition function in HADOOP.
> JobTracker is not involved in these things. Since the map has generated
> output for each reducer, whenever a reducer requests for a map output the
> tracker indexes into the mapouput file and sends the appropriate map
> output chunk.
> > It will be great if you can point me to the place, where these
> > functionalities are implemented in hadoop.
> See TaskTracker$MapOutputServlet.
> > Plus it will be great if you can also point me to the place where the
> hash
> > function is in map?
> See o.a.h.m.Partitioner.java
> >
> > thanks again for the great support on this mailing list.
> >
> >
> > regards,
> >
>



-- 
Ahmad Humayun
Research Assistant
Computer Science Dpt., LUMS
+92 321 4457315

Re: intermediate map data

Posted by Amar Kamat <am...@yahoo-inc.com>.

On Mon, 3 Mar 2008, Ahmad Humayun wrote:

> Hello everyone,
>
> I have a question about the intermediate data output by the map function. I
> wanted to know that does this intermediate data get written to the HDFS or
> it stays in the node's local memory?
It stored on the local disk.
> According to the MapReduce paper, the
> intermediate data is run through a hash function which maps every key to a
> given a reduce worker. So how does this whole process happen? Does the map
> worker write the intermediate data to the HDFS and then tells the JobTracker
> (Master) which Reduce worker should be allotted this data? Or the Map worker
> keeps the intermediate data in memory and makes an RPC call directly to the
> reduce worker (which was figured out by the hash function) to transfer the
> intermediate data?
>
The map uses something called the partitioner. Each map writes they <k,v> 
pair for the appropriate reducer determined by this partition function. In 
the end there is a map output file which is nothing but outputs for each 
reduce function concatenated in sequence based on reduce id. The hash 
function you are talking about is the partition function in HADOOP. 
JobTracker is not involved in these things. Since the map has generated 
output for each reducer, whenever a reducer requests for a map output the 
tracker indexes into the mapouput file and sends the appropriate map 
output chunk.
> It will be great if you can point me to the place, where these
> functionalities are implemented in hadoop.
See TaskTracker$MapOutputServlet.
> Plus it will be great if you can also point me to the place where the hash
> function is in map?
See o.a.h.m.Partitioner.java
>
> thanks again for the great support on this mailing list.
>
>
> regards,
>