You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Kevin <kl...@gmail.com> on 2008/07/14 18:20:35 UTC

How does org.apache.hadoop.mapred.join work?

Hi,

I find limited information about this package which looks like could
do "equi?" join. "Given a set of sorted datasets keyed with the same
class and yielding equal partitions, it is possible to effect a join
of those datasets prior to the map. " What does "yielding equal
partitions" mean?

Thank you.

-Kevin

Re: How does org.apache.hadoop.mapred.join work?

Posted by Kevin <kl...@gmail.com>.
Thank you, Chris. This solves my questions.
-Kevin


On Mon, Jul 14, 2008 at 11:17 AM, Chris Douglas <ch...@yahoo-inc.com> wrote:
> "Yielding equal partitions" means that each input source will offer n
> partitions and for any given partition 0 <= i < n, the records in that
> partition are 1) sorted on the same key 2) unique to that partition, i.e. if
> a key k is in partition i for a given source, k appears in no other
> partitions from that source and if any other source contains k, all
> occurrences appear in partition i from that source. All the framework really
> effects is the cartesian product of all matching keys, so yes, that implies
> equi-joins.
>
> It's a fairly strict requirement. Satisfying it is less onerous if one is
> joining the output of several m/r jobs, each of which uses the same
> keys/partitioner, the same number of reduces, and each output file
> (part-xxxxx) of each job is not splittable. In this case, n is equal to the
> number of output files from each job (the number of reduces), (1) is
> satisfied if the reduce emits records in the same order (i.e. no new keys,
> no records out of order), and (2) is guaranteed by the partitioner and (1).
>
> An InputFormat capable of parsing metadata about each source to generate
> partitions from the set of input sources is ideal, but I can point to no
> existing implementation. -C
>
> On Jul 14, 2008, at 9:20 AM, Kevin wrote:
>
>> Hi,
>>
>> I find limited information about this package which looks like could
>> do "equi?" join. "Given a set of sorted datasets keyed with the same
>> class and yielding equal partitions, it is possible to effect a join
>> of those datasets prior to the map. " What does "yielding equal
>> partitions" mean?
>>
>> Thank you.
>>
>> -Kevin
>
>

Re: How does org.apache.hadoop.mapred.join work?

Posted by Chris Douglas <ch...@yahoo-inc.com>.
"Yielding equal partitions" means that each input source will offer n  
partitions and for any given partition 0 <= i < n, the records in that  
partition are 1) sorted on the same key 2) unique to that partition,  
i.e. if a key k is in partition i for a given source, k appears in no  
other partitions from that source and if any other source contains k,  
all occurrences appear in partition i from that source. All the  
framework really effects is the cartesian product of all matching  
keys, so yes, that implies equi-joins.

It's a fairly strict requirement. Satisfying it is less onerous if one  
is joining the output of several m/r jobs, each of which uses the same  
keys/partitioner, the same number of reduces, and each output file  
(part-xxxxx) of each job is not splittable. In this case, n is equal  
to the number of output files from each job (the number of reduces),  
(1) is satisfied if the reduce emits records in the same order (i.e.  
no new keys, no records out of order), and (2) is guaranteed by the  
partitioner and (1).

An InputFormat capable of parsing metadata about each source to  
generate partitions from the set of input sources is ideal, but I can  
point to no existing implementation. -C

On Jul 14, 2008, at 9:20 AM, Kevin wrote:

> Hi,
>
> I find limited information about this package which looks like could
> do "equi?" join. "Given a set of sorted datasets keyed with the same
> class and yielding equal partitions, it is possible to effect a join
> of those datasets prior to the map. " What does "yielding equal
> partitions" mean?
>
> Thank you.
>
> -Kevin