You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Renato Marroquín Mogrovejo <re...@gmail.com> on 2011/02/07 19:43:01 UTC
Re: How would you implement a custom join?

I found really interesting all those papers. I haven't finished reading the
band join algorithm paper either, but there are a couple of things that
intrigue me e.g. the Almaden paper compares its results against Pig version
0.2 I mean I think the study made by them is great, but Pig is in a stable
0.8 now, wouldn't Pig perform better now than then? Has Pig embraced any of
the paper suggestions?
Anyways, creating a custom join inside a UDF might not be suitable for some
specialized types of join, but maybe for others such as Parallel
Set-*Similarity
Joins *would be easier (flamingo.ics.uci.edu/pub/sigmod10-vernica.pdf),
don't you think it might be possible? I mean we could take advantage of not
doing only raw MapReduce

Renato M.

2011/1/28 Alan Gates <ga...@yahoo-inc.com>

> Depending on the join algorithm you may be able to implement it with
> cogroup, a custom UDF, and possibly a custom partitioner.  I haven't
> finished reading the band join algorithm paper I sent a link for, but I
> suspect it requires some records to be duplicated (since records within the
> band will need to be sent to multiple reducers to match records from the
> other side).  That you cannot do without implementing a custom join.
>
> For an example of how to implement a custom join take a look at
> https://issues.apache.org/jira/browse/PIG-792  This has a lot of sampling
> code you won't have to worry about.  But it will give you an idea of the
> logical and physical operators inside Pig that would be needed.
>
> Also, here's some input from Chris Olston, one of our research scientists
> at Yahoo with expertise in databases:
>
> >>>
> I have not read the paper you sent but it seems to be about so-called “band
> joins”, which are a special case of non-equijoin that arise frequently in
> practice, and offer obvious opportunities for locality-based strategies e.g.
> using indexes and (distributed) partitioning. One approach that would be
> consistent with the Pig “low-level” philosophy would be to expose “BAND
> JOIN” as an operator and have a corresponding implementation along the lines
> of what that paper proposes.
>
> Also, as you know Utkarsh’s original implementation of CROSS (still the
> same?) performs a “generalized fragment-and-replicate” strategy, which is a
> way to do arbitrary non-equi-joins in a way that spreads work onto lots of
> machines (CROSS can be seen as non-equi-join with a very promiscuous join
> predicate :). There are probably papers that try to optimize the NxM grid
> structure of the generalized f-and-r topology, based on the relative sizes
> of the inputs, the join selectivity, data distributions, etc. I think the
> paper that originally surfaced this idea is:
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=250116. Not sure
> whether there were follow-on papers that try to do more optimization.
> Fast-forwarding to modern times, I believe the Almaden SIGMOD’10 paper might
> have investigated f-and-r join strategies for the map-reduce context:
> http://portal.acm.org/citation.cfm?doid=1807167.1807273. There’s also the
> Ullman paper that proposes (but does not evaluate empirically) some
> map-reduce join strategies:
> http://ilpubs.stanford.edu:8090/957/1/mapred-join-report.pdf
> <<<
>
> Alan.
>
>
> On Jan 28, 2011, at 7:35 AM, Jonathan Coveney wrote:
>
>  I'm not sure if this can be done at the UDF level, or if it'd have to be
>> done lower level. Imagine you have a good candidate for a replicated join,
>> but beyond that you know most about the structure of one of the pieces of
>> information you are joining (for example, that you could build a binary
>> search tree from it and do your comparisons really quickly, or something).
>> Is there a way to make your own join, or extend the one in pig? I could
>> imagine a UDF that takes two bags, the left piece and the right piece,
>> constructs your join, etc, but I don't know that that would be as fast.
>>
>> Any thoughts?
>>
>
>