You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Mike Spreitzer <ms...@us.ibm.com> on 2012/01/15 09:20:36 UTC

What is the right way to do map-side joins in Hadoop 1.0?

I have a problem that needs to be solved by an iteration of MapReduce 
jobs, and in each iteration I need to start by doing an equijoin between a 
large constant dataset and the output of the previous iteration; the 
remainder of my map function works on a joined tuple in a way whose 
details are not important here.  The reduce output I am happy to describe 
as (key,value) pairs, and the large constant dataset can well be described 
that way too; in this case the join condition is equality of keys.  What 
is the best way to get those equijoins done in the maps?

I presume I should be looking for a solution using 
org.apache.hadoop.mapreduce.* rather than org.apache.hadoop.mapred.*

I do not want to cache the entirety of the large constant dataset in 
memory during the setup method of my Mapper --- that would require way too 
much memory.  I do not even want to make a copy of the entirety of the 
large constant dataset in the local filesystem of each node in my cluster. 
 What I want is to have the large constant dataset partitioned among my 
nodes.  It is OK (even preferable) if there is a bit of replication. Thus, 
storing it in HDFS --- as one file or as a collection of files --- would 
be fine.

Thanks,
Mike

Re: What is the right way to do map-side joins in Hadoop 1.0?

Posted by Mike Spreitzer <ms...@us.ibm.com>.

Yes, I did look at CompositeInputFormat.  That is why I remarked that I 
suppose that I should be looking under org.apache.hadoop.mapreduce.* and 
sent the earlier question about why CompositeInputFormat is not under 
org.apache.hadoop.mapreduce.* in Hadoop 1.0.0.  But I have gotten no 
answers yet.

And yes, I very much want to do the joins as 'early' as possible (i.e., in 
the mapper not the reducer); I do not want to waste work sending copies of 
the large constant dataset around if I do not have to.

BTW, my outline below was written too hastily; I see no obvious way to get 
the large constant dataset and the previous iteration's output placed 
consistently, which is going to be needed for top performance.

Thanks,
Mike



From:   Bejoy Ks <be...@gmail.com>
To:     mapreduce-user@hadoop.apache.org
Date:   01/15/2012 07:49 AM
Subject:        Re: What is the right way to do map-side joins in Hadoop 
1.0?



Hi Mark
           Have a look at CompositeInputFormat. I guess it is what you are 
looking for to achieve map side joins. If you are fine with a Reduce side 
join go in with MultipleInputFormat. I have tried the same sort of joins 
using  MultipleInputFormat and have scribbled something on the same. Check 
out if it'd be useful for you. (A very crude implementation :), you may 
have better ways )
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

Hope it helps!...

Regards
Bejoy.K.S

On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <ms...@us.ibm.com> 
wrote:
BTW, each key appears exactly once in the large constant dataset, and 
exactly once in each MR job's output. 

I am thinking the right approach is to consistently partition the job 
output and the large constant dataset, with the number of partitions being 
the number of reduce tasks; each part goes into its own file.  Make an 
InputFormat whose number of splits equals the number of reduce tasks. 
 Reading a split will consist of reading a corresponding pair of files, 
stepping through each.  Seems like something that should already be 
provided by something in org.apache.hadoop.mapreduce.*. 

Thanks, 
Mike

Re: What is the right way to do map-side joins in Hadoop 1.0?

Posted by Bejoy Ks <be...@gmail.com>.

Hi Mark
           Have a look at CompositeInputFormat. I guess it is what you are
looking for to achieve map side joins. If you are fine with a Reduce side
join go in with MultipleInputFormat. I have tried the same sort of joins
using  MultipleInputFormat and have scribbled something on the same. Check
out if it'd be useful for you. (A very crude implementation :), you may
have better ways )
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html

Hope it helps!...

Regards
Bejoy.K.S

On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <ms...@us.ibm.com> wrote:

> BTW, each key appears exactly once in the large constant dataset, and
> exactly once in each MR job's output.
>
> I am thinking the right approach is to consistently partition the job
> output and the large constant dataset, with the number of partitions being
> the number of reduce tasks; each part goes into its own file.  Make an
> InputFormat whose number of splits equals the number of reduce tasks.
>  Reading a split will consist of reading a corresponding pair of files,
> stepping through each.  Seems like something that should already be
> provided by something in org.apache.hadoop.mapreduce.*.
>
> Thanks,
> Mike

Re: What is the right way to do map-side joins in Hadoop 1.0?

Posted by Mike Spreitzer <ms...@us.ibm.com>.

BTW, each key appears exactly once in the large constant dataset, and 
exactly once in each MR job's output.

I am thinking the right approach is to consistently partition the job 
output and the large constant dataset, with the number of partitions being 
the number of reduce tasks; each part goes into its own file.  Make an 
InputFormat whose number of splits equals the number of reduce tasks. 
Reading a split will consist of reading a corresponding pair of files, 
stepping through each.  Seems like something that should already be 
provided by something in org.apache.hadoop.mapreduce.*.

Thanks,
Mike