You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Edmund Kohlwey <ek...@gmail.com> on 2009/11/04 14:52:06 UTC

Cross Join

Hi,
I'm looking for an efficient way to do a cross join. I've gone through a 
few implementations, and I wanted to seek some advice before attempting 
another. The join is a "large collection to large collection" - so 
there's no trick optimizations like downloading one side of the join on 
each node (ie. map side join). The output of the join will be sparse, 
(its basically matching a large collection of regexes to a large 
collection of strings), but because of the nature of the data there's 
not really any way to pre-process either side of the join.

1. Naive approach - on a single node, iterate over both collections, 
resulting in reading the "left" file 1 times and the right file n times 
- I know this is bad.
2. Indexed approach - index data item with a row/col - requires 
replicating, sorting, and shuffling all the records 2 times - also not 
good. This actually seemed to perform worse than 1, and resulted in 
running out of disk space on the mappers when output was spilled to disk.

I'm now considering what to try next. One idea is to improve on 1 by 
"blocking" the reads, so that the right side of the join is read b 
times, where b is the number of blocks the left side is split into.

The other (imho, best) idea is to write a "reduce-side" join, which 
would actually be fully parallelized, which basically relies on 
map/reduce to split the left side into blocks, and then allows each 
reducer to stream through the right side once. In this version, the 
right side is still downloaded b times, but the operation is done in 
parallel. The only issue with this is that I would need to iterate over 
the reduce iterators multiple times, which is something that M/R doesn't 
allow (I think). I know I could save the contents of the iterator 
locally, but this seems like a bad design choice too. Does anybody know 
if there's a smart way to iterate twice in a reducer?

There's probably some methods I haven't really thought of. Does anyone 
have any suggestions?

Re: Cross Join

Posted by Edmund Kohlwey <ek...@gmail.com>.

Thanks to all who commented on this. I think there was some confusion 
over what I was trying to do: indeed there was no common key between the 
two tables to join on, which made all the methods I investigated either 
inappropriate or inefficient. In the end I decided to write my own join 
class. It can be written in a reducer or a mapper. While the reducer 
implementation is a bit cleaner, the mapper implementation provides 
(theoretically) better distributed processing. For those who are 
interested, the basic algorithm is:

x is defined as the cross product of two vectors

proc crossproduct:
   Allow mapreduce to partition the left side of the input
   on each mapper
     let left_i = save all the left side key/value pairs that are 
processed on that node
     in cleanup (or at the end of the reduce) : let right = open the 
right side of the join on each node through hdfs
     for each pair of pairs in left_i x right:
         if transform (pair) !=null emit transform (pair)
         else continue
     endfor
   end on each
end proc

The important

On 11/5/09 1:15 PM, Ashutosh Chauhan wrote:
> Hi Edmund,
>
> If you can prepare your dataset in a way org.apache.hadoop.mapred.join
> requires, then it might be an efficient way to do joins in your case. IMHO
> though requirements placed by it though are pretty restrictive. Also,
> instead of reinventing the wheel, I would also suggest you to take a look
> how Pig tries to solve "joining large dataset" problem. It has infact four
> different join algorithms implemented and one or more them should satisfy
> your requirements. It seems to me merge-join of Pig is well suited in your
> case. Its only requirement is it wants dataset to be sorted on both sides.
> Datasets need not to be equipartitioned, need not to have same number of
> partitions etc. You said that sorting the dataset is pain in your case.
> Pig's orderby is quite sophisticated and performs sorting rather quite
> efficiently. If indeed doing sort is not an option, then you may want to
> consider hash join or skewed join of Pig.
>
> Joins in Pig are explained at high-level here:
> http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/
>
> Hope it helps,
> Ashutosh
>
> On Thu, Nov 5, 2009 at 06:19, Jason Venner<ja...@gmail.com>  wrote:
>
>    
>> Look at the join package in map reduce, it provides this functionality
>> quite
>> cleaning, for ordered datasets that have the same partitioning.
>> org.apache.hadoop.mapred.join in hadoop 19
>>
>> On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlwey<ek...@gmail.com>  wrote:
>>
>>      
>>> Hi,
>>> I'm looking for an efficient way to do a cross join. I've gone through a
>>> few implementations, and I wanted to seek some advice before attempting
>>> another. The join is a "large collection to large collection" - so
>>>        
>> there's
>>      
>>> no trick optimizations like downloading one side of the join on each node
>>> (ie. map side join). The output of the join will be sparse, (its
>>>        
>> basically
>>      
>>> matching a large collection of regexes to a large collection of strings),
>>> but because of the nature of the data there's not really any way to
>>> pre-process either side of the join.
>>>
>>> 1. Naive approach - on a single node, iterate over both collections,
>>> resulting in reading the "left" file 1 times and the right file n times -
>>>        
>> I
>>      
>>> know this is bad.
>>> 2. Indexed approach - index data item with a row/col - requires
>>> replicating, sorting, and shuffling all the records 2 times - also not
>>>        
>> good.
>>      
>>> This actually seemed to perform worse than 1, and resulted in running out
>>>        
>> of
>>      
>>> disk space on the mappers when output was spilled to disk.
>>>
>>> I'm now considering what to try next. One idea is to improve on 1 by
>>> "blocking" the reads, so that the right side of the join is read b times,
>>> where b is the number of blocks the left side is split into.
>>>
>>> The other (imho, best) idea is to write a "reduce-side" join, which would
>>> actually be fully parallelized, which basically relies on map/reduce to
>>> split the left side into blocks, and then allows each reducer to stream
>>> through the right side once. In this version, the right side is still
>>> downloaded b times, but the operation is done in parallel. The only issue
>>> with this is that I would need to iterate over the reduce iterators
>>>        
>> multiple
>>      
>>> times, which is something that M/R doesn't allow (I think). I know I
>>>        
>> could
>>      
>>> save the contents of the iterator locally, but this seems like a bad
>>>        
>> design
>>      
>>> choice too. Does anybody know if there's a smart way to iterate twice in
>>>        
>> a
>>      
>>> reducer?
>>>
>>> There's probably some methods I haven't really thought of. Does anyone
>>>        
>> have
>>      
>>> any suggestions?
>>>
>>>
>>>
>>>        
>>
>> --
>> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
>> http://www.amazon.com/dp/1430219424?tag=jewlerymall
>> www.prohadoopbook.com a community for Hadoop Professionals
>>
>>      
>

Re: Cross Join

Posted by Ashutosh Chauhan <as...@gmail.com>.

Hi Edmund,

If you can prepare your dataset in a way org.apache.hadoop.mapred.join
requires, then it might be an efficient way to do joins in your case. IMHO
though requirements placed by it though are pretty restrictive. Also,
instead of reinventing the wheel, I would also suggest you to take a look
how Pig tries to solve "joining large dataset" problem. It has infact four
different join algorithms implemented and one or more them should satisfy
your requirements. It seems to me merge-join of Pig is well suited in your
case. Its only requirement is it wants dataset to be sorted on both sides.
Datasets need not to be equipartitioned, need not to have same number of
partitions etc. You said that sorting the dataset is pain in your case.
Pig's orderby is quite sophisticated and performs sorting rather quite
efficiently. If indeed doing sort is not an option, then you may want to
consider hash join or skewed join of Pig.

Joins in Pig are explained at high-level here:
http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/

Hope it helps,
Ashutosh

On Thu, Nov 5, 2009 at 06:19, Jason Venner <ja...@gmail.com> wrote:

> Look at the join package in map reduce, it provides this functionality
> quite
> cleaning, for ordered datasets that have the same partitioning.
> org.apache.hadoop.mapred.join in hadoop 19
>
> On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlwey <ek...@gmail.com> wrote:
>
> > Hi,
> > I'm looking for an efficient way to do a cross join. I've gone through a
> > few implementations, and I wanted to seek some advice before attempting
> > another. The join is a "large collection to large collection" - so
> there's
> > no trick optimizations like downloading one side of the join on each node
> > (ie. map side join). The output of the join will be sparse, (its
> basically
> > matching a large collection of regexes to a large collection of strings),
> > but because of the nature of the data there's not really any way to
> > pre-process either side of the join.
> >
> > 1. Naive approach - on a single node, iterate over both collections,
> > resulting in reading the "left" file 1 times and the right file n times -
> I
> > know this is bad.
> > 2. Indexed approach - index data item with a row/col - requires
> > replicating, sorting, and shuffling all the records 2 times - also not
> good.
> > This actually seemed to perform worse than 1, and resulted in running out
> of
> > disk space on the mappers when output was spilled to disk.
> >
> > I'm now considering what to try next. One idea is to improve on 1 by
> > "blocking" the reads, so that the right side of the join is read b times,
> > where b is the number of blocks the left side is split into.
> >
> > The other (imho, best) idea is to write a "reduce-side" join, which would
> > actually be fully parallelized, which basically relies on map/reduce to
> > split the left side into blocks, and then allows each reducer to stream
> > through the right side once. In this version, the right side is still
> > downloaded b times, but the operation is done in parallel. The only issue
> > with this is that I would need to iterate over the reduce iterators
> multiple
> > times, which is something that M/R doesn't allow (I think). I know I
> could
> > save the contents of the iterator locally, but this seems like a bad
> design
> > choice too. Does anybody know if there's a smart way to iterate twice in
> a
> > reducer?
> >
> > There's probably some methods I haven't really thought of. Does anyone
> have
> > any suggestions?
> >
> >
> >
>
>
> --
> Pro Hadoop, a book to guide you from beginner to hadoop mastery,
> http://www.amazon.com/dp/1430219424?tag=jewlerymall
> www.prohadoopbook.com a community for Hadoop Professionals
>

Re: Cross Join

Posted by Jason Venner <ja...@gmail.com>.

Look at the join package in map reduce, it provides this functionality quite
cleaning, for ordered datasets that have the same partitioning.
org.apache.hadoop.mapred.join in hadoop 19

On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlwey <ek...@gmail.com> wrote:

> Hi,
> I'm looking for an efficient way to do a cross join. I've gone through a
> few implementations, and I wanted to seek some advice before attempting
> another. The join is a "large collection to large collection" - so there's
> no trick optimizations like downloading one side of the join on each node
> (ie. map side join). The output of the join will be sparse, (its basically
> matching a large collection of regexes to a large collection of strings),
> but because of the nature of the data there's not really any way to
> pre-process either side of the join.
>
> 1. Naive approach - on a single node, iterate over both collections,
> resulting in reading the "left" file 1 times and the right file n times - I
> know this is bad.
> 2. Indexed approach - index data item with a row/col - requires
> replicating, sorting, and shuffling all the records 2 times - also not good.
> This actually seemed to perform worse than 1, and resulted in running out of
> disk space on the mappers when output was spilled to disk.
>
> I'm now considering what to try next. One idea is to improve on 1 by
> "blocking" the reads, so that the right side of the join is read b times,
> where b is the number of blocks the left side is split into.
>
> The other (imho, best) idea is to write a "reduce-side" join, which would
> actually be fully parallelized, which basically relies on map/reduce to
> split the left side into blocks, and then allows each reducer to stream
> through the right side once. In this version, the right side is still
> downloaded b times, but the operation is done in parallel. The only issue
> with this is that I would need to iterate over the reduce iterators multiple
> times, which is something that M/R doesn't allow (I think). I know I could
> save the contents of the iterator locally, but this seems like a bad design
> choice too. Does anybody know if there's a smart way to iterate twice in a
> reducer?
>
> There's probably some methods I haven't really thought of. Does anyone have
> any suggestions?
>
>
>


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals