You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Stuart Sierra <ma...@stuartsierra.com> on 2008/07/03 16:54:03 UTC

Difference between joining and reducing

Hello all,

After recent talk about joins, I have a (possibly) stupid question:

What is the difference between the "join" operations in
o.a.h.mapred.join and the standard merge step in a MapReduce job?

I understand that doing a join in the Mapper would be much more
efficient if you're lucky enough to have your input pre-sorted and
-partitioned.

But how is a join operation in the Reducer any different from the
shuffle/sort/merge that the MapReduce framework already does?

Be gentle.  Thanks,
-Stuart

RE: Difference between joining and reducing

Posted by Ashish Thusoo <at...@facebook.com>.

Hi Stuart,

Join is a higher level logical operation while map/reduce is a technique that could be used implement it. Specifically, in relational algebra, the join construct specifies how to form a single output row from 2 rows arising from two input streams. There are very many ways of implementing this logical operation and traditional database systems have a number of such implementations. Map/reduce being a system that essential allows you to cluster data by doing a distributed sort, is amenable to the sort based techinque for doing the join. A particular implementation of the reducer gets a combined stream of data from the two or more input streams such that they match on the key. It then proceeds to generate the cartesian product of the rows from the imput streams. In order to implement a join, you need to implement this join reducer yourself which is what org.apache.hadoop.mapred.join does. I hope that clears up the confusion.

Cheers,
Ashish


-----Original Message-----
From: the.stuart.sierra@gmail.com on behalf of Stuart Sierra
Sent: Thu 7/3/2008 7:54 AM
To: core-user@hadoop.apache.org
Subject: Difference between joining and reducing
 
Hello all,

After recent talk about joins, I have a (possibly) stupid question:

What is the difference between the "join" operations in
o.a.h.mapred.join and the standard merge step in a MapReduce job?

I understand that doing a join in the Mapper would be much more
efficient if you're lucky enough to have your input pre-sorted and
-partitioned.

But how is a join operation in the Reducer any different from the
shuffle/sort/merge that the MapReduce framework already does?

Be gentle.  Thanks,
-Stuart

Re: Difference between joining and reducing

Posted by Chris Douglas <ch...@yahoo-inc.com>.

Ashish ably outlined the differences between a join and a merge, but  
might be confusing the o.a.h.mapred.join package and the contrib/ 
data_join framework. The former is used for map-side joins and has  
nothing to do with either the shuffle or the reduce; the latter  
effects joins in the reduce.

The critical difference between the merge phase in map/reduce and a  
join is that merge outputs are grouped by a comparator and consumed in  
sorted order while, in contrast, joins involve n datasets and  
consumers will consider the cartesian product of selected keys (in  
both frameworks, equal keys). The practical differences between the  
two aforementioned join frameworks involve tradeoffs in efficiency and  
constraints on input data. -C

On Jul 3, 2008, at 7:54 AM, Stuart Sierra wrote:

> Hello all,
>
> After recent talk about joins, I have a (possibly) stupid question:
>
> What is the difference between the "join" operations in
> o.a.h.mapred.join and the standard merge step in a MapReduce job?
>
> I understand that doing a join in the Mapper would be much more
> efficient if you're lucky enough to have your input pre-sorted and
> -partitioned.
>
> But how is a join operation in the Reducer any different from the
> shuffle/sort/merge that the MapReduce framework already does?
>
> Be gentle.  Thanks,
> -Stuart