You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Mike Spreitzer <ms...@us.ibm.com> on 2012/01/15 09:20:36 UTC
What is the right way to do map-side joins in Hadoop 1.0?
I have a problem that needs to be solved by an iteration of MapReduce
jobs, and in each iteration I need to start by doing an equijoin between a
large constant dataset and the output of the previous iteration; the
remainder of my map function works on a joined tuple in a way whose
details are not important here. The reduce output I am happy to describe
as (key,value) pairs, and the large constant dataset can well be described
that way too; in this case the join condition is equality of keys. What
is the best way to get those equijoins done in the maps?
I presume I should be looking for a solution using
org.apache.hadoop.mapreduce.* rather than org.apache.hadoop.mapred.*
I do not want to cache the entirety of the large constant dataset in
memory during the setup method of my Mapper --- that would require way too
much memory. I do not even want to make a copy of the entirety of the
large constant dataset in the local filesystem of each node in my cluster.
What I want is to have the large constant dataset partitioned among my
nodes. It is OK (even preferable) if there is a bit of replication. Thus,
storing it in HDFS --- as one file or as a collection of files --- would
be fine.
Thanks,
Mike
Re: What is the right way to do map-side joins in Hadoop 1.0?
Posted by Mike Spreitzer <ms...@us.ibm.com>.
Yes, I did look at CompositeInputFormat. That is why I remarked that I
suppose that I should be looking under org.apache.hadoop.mapreduce.* and
sent the earlier question about why CompositeInputFormat is not under
org.apache.hadoop.mapreduce.* in Hadoop 1.0.0. But I have gotten no
answers yet.
And yes, I very much want to do the joins as 'early' as possible (i.e., in
the mapper not the reducer); I do not want to waste work sending copies of
the large constant dataset around if I do not have to.
BTW, my outline below was written too hastily; I see no obvious way to get
the large constant dataset and the previous iteration's output placed
consistently, which is going to be needed for top performance.
Thanks,
Mike
From: Bejoy Ks <be...@gmail.com>
To: mapreduce-user@hadoop.apache.org
Date: 01/15/2012 07:49 AM
Subject: Re: What is the right way to do map-side joins in Hadoop
1.0?
Hi Mark
Have a look at CompositeInputFormat. I guess it is what you are
looking for to achieve map side joins. If you are fine with a Reduce side
join go in with MultipleInputFormat. I have tried the same sort of joins
using MultipleInputFormat and have scribbled something on the same. Check
out if it'd be useful for you. (A very crude implementation :), you may
have better ways )
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html
Hope it helps!...
Regards
Bejoy.K.S
On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <ms...@us.ibm.com>
wrote:
BTW, each key appears exactly once in the large constant dataset, and
exactly once in each MR job's output.
I am thinking the right approach is to consistently partition the job
output and the large constant dataset, with the number of partitions being
the number of reduce tasks; each part goes into its own file. Make an
InputFormat whose number of splits equals the number of reduce tasks.
Reading a split will consist of reading a corresponding pair of files,
stepping through each. Seems like something that should already be
provided by something in org.apache.hadoop.mapreduce.*.
Thanks,
Mike
Re: What is the right way to do map-side joins in Hadoop 1.0?
Posted by Bejoy Ks <be...@gmail.com>.
Hi Mark
Have a look at CompositeInputFormat. I guess it is what you are
looking for to achieve map side joins. If you are fine with a Reduce side
join go in with MultipleInputFormat. I have tried the same sort of joins
using MultipleInputFormat and have scribbled something on the same. Check
out if it'd be useful for you. (A very crude implementation :), you may
have better ways )
http://kickstarthadoop.blogspot.com/2011/09/joins-with-plain-map-reduce.html
Hope it helps!...
Regards
Bejoy.K.S
On Sun, Jan 15, 2012 at 4:34 PM, Mike Spreitzer <ms...@us.ibm.com> wrote:
> BTW, each key appears exactly once in the large constant dataset, and
> exactly once in each MR job's output.
>
> I am thinking the right approach is to consistently partition the job
> output and the large constant dataset, with the number of partitions being
> the number of reduce tasks; each part goes into its own file. Make an
> InputFormat whose number of splits equals the number of reduce tasks.
> Reading a split will consist of reading a corresponding pair of files,
> stepping through each. Seems like something that should already be
> provided by something in org.apache.hadoop.mapreduce.*.
>
> Thanks,
> Mike
Re: What is the right way to do map-side joins in Hadoop 1.0?
Posted by Mike Spreitzer <ms...@us.ibm.com>.
BTW, each key appears exactly once in the large constant dataset, and
exactly once in each MR job's output.
I am thinking the right approach is to consistently partition the job
output and the large constant dataset, with the number of partitions being
the number of reduce tasks; each part goes into its own file. Make an
InputFormat whose number of splits equals the number of reduce tasks.
Reading a split will consist of reading a corresponding pair of files,
stepping through each. Seems like something that should already be
provided by something in org.apache.hadoop.mapreduce.*.
Thanks,
Mike