You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/01/27 02:09:37 UTC

Efficient ways to do non-equijoins?

A is (val:int)
B is (thing:chararray, min:int, max:int)

Basically what I want is C = (val, thing) where val is between min and max
for that thing. In sql the syntax for this would not be hard, in pig the
naive solution I have is..

cro = CROSS A,B;
fil = FILTER cro BY val >= min AND val <= max;
C = FOREACH fil GENERATE val,thing;

I am wondering what the most efficient way of doing this sort of operation
is. I imagine with some sort of indexing you could ideally speed things up?
Not sure. But this is important enough that I'd be willing to do some
legwork.

As always, thanks for your help.

Re: Efficient ways to do non-equijoins?

Posted by Jonathan Coveney <jc...@gmail.com>.

Alan, thanks for sharing the paper. I will definitely read it over. For my
specific case, I found a data structure that actually should work perfectly
http://en.wikipedia.org/wiki/Interval_tree

My question becomes this: what is the best way to implement this sort of
custom join in pig? I don't know what would be involved in being able to
write your own join, so I imagine it'd have to be a udf, but would that be
anywhere near as efficient? I can imagine a algebraic UDF that takes all of
your data, uses the (ideally much smaller) joining dataset to make the
datastructure, and then begins spitting out your output...the one issue
being that making the datastructure is definitely costly, and you'd want to
do it as few times as possible (possibly even doing a pre-step, but I doubt
pig is capable of saving a java datastructure). It "feels" like you'd need
to do something lower level to make sure to generate the Interval tree as
few times as possible and then join with it.

2011/1/27 Alan Gates <ga...@yahoo-inc.com>

> This time with the link to the paper:
> http://www.vldb.org/conf/1991/P443.PDF :)
>
> Alan.
>
> On Jan 27, 2011, at 8:48 AM, Alan Gates wrote:
>
>  The script you propose will work, but if your data is of even
>> reasonable size it will be very slow.  A quick search of the web
>> turned up one paper with an algorithm for parallel non-equijoins that
>> at first glance might work in your case.
>>
>> Alan.
>>
>> On Jan 26, 2011, at 5:15 PM, Jonathan Coveney wrote:
>>
>>  Also, it'd be worth thinking about this for the case where the min
>>> and maxes
>>> are arbitrary, and also the case where they aren't overlapping. That
>>> is to
>>> say, there is only one thing for a given value.
>>>
>>> 2011/1/26 Jonathan Coveney <jc...@gmail.com>
>>>
>>>  A is (val:int)
>>>> B is (thing:chararray, min:int, max:int)
>>>>
>>>> Basically what I want is C = (val, thing) where val is between min
>>>> and max
>>>> for that thing. In sql the syntax for this would not be hard, in
>>>> pig the
>>>> naive solution I have is..
>>>>
>>>> cro = CROSS A,B;
>>>> fil = FILTER cro BY val >= min AND val <= max;
>>>> C = FOREACH fil GENERATE val,thing;
>>>>
>>>> I am wondering what the most efficient way of doing this sort of
>>>> operation
>>>> is. I imagine with some sort of indexing you could ideally speed
>>>> things up?
>>>> Not sure. But this is important enough that I'd be willing to do some
>>>> legwork.
>>>>
>>>> As always, thanks for your help.
>>>>
>>>>
>>
>

Re: Efficient ways to do non-equijoins?

Posted by Alan Gates <ga...@yahoo-inc.com>.

This time with the link to the paper:  http://www.vldb.org/conf/1991/P443.PDF 
  :)

Alan.
On Jan 27, 2011, at 8:48 AM, Alan Gates wrote:

> The script you propose will work, but if your data is of even
> reasonable size it will be very slow.  A quick search of the web
> turned up one paper with an algorithm for parallel non-equijoins that
> at first glance might work in your case.
>
> Alan.
>
> On Jan 26, 2011, at 5:15 PM, Jonathan Coveney wrote:
>
>> Also, it'd be worth thinking about this for the case where the min
>> and maxes
>> are arbitrary, and also the case where they aren't overlapping. That
>> is to
>> say, there is only one thing for a given value.
>>
>> 2011/1/26 Jonathan Coveney <jc...@gmail.com>
>>
>>> A is (val:int)
>>> B is (thing:chararray, min:int, max:int)
>>>
>>> Basically what I want is C = (val, thing) where val is between min
>>> and max
>>> for that thing. In sql the syntax for this would not be hard, in
>>> pig the
>>> naive solution I have is..
>>>
>>> cro = CROSS A,B;
>>> fil = FILTER cro BY val >= min AND val <= max;
>>> C = FOREACH fil GENERATE val,thing;
>>>
>>> I am wondering what the most efficient way of doing this sort of
>>> operation
>>> is. I imagine with some sort of indexing you could ideally speed
>>> things up?
>>> Not sure. But this is important enough that I'd be willing to do  
>>> some
>>> legwork.
>>>
>>> As always, thanks for your help.
>>>
>

Re: Efficient ways to do non-equijoins?

Posted by Alan Gates <ga...@yahoo-inc.com>.

The script you propose will work, but if your data is of even  
reasonable size it will be very slow.  A quick search of the web  
turned up one paper with an algorithm for parallel non-equijoins that  
at first glance might work in your case.

Alan.

On Jan 26, 2011, at 5:15 PM, Jonathan Coveney wrote:

> Also, it'd be worth thinking about this for the case where the min  
> and maxes
> are arbitrary, and also the case where they aren't overlapping. That  
> is to
> say, there is only one thing for a given value.
>
> 2011/1/26 Jonathan Coveney <jc...@gmail.com>
>
>> A is (val:int)
>> B is (thing:chararray, min:int, max:int)
>>
>> Basically what I want is C = (val, thing) where val is between min  
>> and max
>> for that thing. In sql the syntax for this would not be hard, in  
>> pig the
>> naive solution I have is..
>>
>> cro = CROSS A,B;
>> fil = FILTER cro BY val >= min AND val <= max;
>> C = FOREACH fil GENERATE val,thing;
>>
>> I am wondering what the most efficient way of doing this sort of  
>> operation
>> is. I imagine with some sort of indexing you could ideally speed  
>> things up?
>> Not sure. But this is important enough that I'd be willing to do some
>> legwork.
>>
>> As always, thanks for your help.
>>

Re: Efficient ways to do non-equijoins?

Posted by Thejas M Nair <te...@yahoo-inc.com>.

If B is small enough to fit into memory of a map task, you can use replicated join to simulate cross, that would be much faster -
    cro = join A by 1, B by 1 using 'replicated';

If the ranges size in B are fixed and contiguous (eg 1-10,11-20,21-30...), you can use a udf to map A.val to values in B.min and do a join.

  J = join A by getMin(val), B by min ;
  C = foreach J generate val, thing;

-Thejas



On 1/26/11 5:15 PM, "Jonathan Coveney" <jc...@gmail.com> wrote:

Also, it'd be worth thinking about this for the case where the min and maxes
are arbitrary, and also the case where they aren't overlapping. That is to
say, there is only one thing for a given value.

2011/1/26 Jonathan Coveney <jc...@gmail.com>

> A is (val:int)
> B is (thing:chararray, min:int, max:int)
>
> Basically what I want is C = (val, thing) where val is between min and max
> for that thing. In sql the syntax for this would not be hard, in pig the
> naive solution I have is..
>
> cro = CROSS A,B;
> fil = FILTER cro BY val >= min AND val <= max;
> C = FOREACH fil GENERATE val,thing;
>
> I am wondering what the most efficient way of doing this sort of operation
> is. I imagine with some sort of indexing you could ideally speed things up?
> Not sure. But this is important enough that I'd be willing to do some
> legwork.
>
> As always, thanks for your help.
>

Re: Efficient ways to do non-equijoins?

Posted by Jonathan Coveney <jc...@gmail.com>.

Also, it'd be worth thinking about this for the case where the min and maxes
are arbitrary, and also the case where they aren't overlapping. That is to
say, there is only one thing for a given value.

2011/1/26 Jonathan Coveney <jc...@gmail.com>

> A is (val:int)
> B is (thing:chararray, min:int, max:int)
>
> Basically what I want is C = (val, thing) where val is between min and max
> for that thing. In sql the syntax for this would not be hard, in pig the
> naive solution I have is..
>
> cro = CROSS A,B;
> fil = FILTER cro BY val >= min AND val <= max;
> C = FOREACH fil GENERATE val,thing;
>
> I am wondering what the most efficient way of doing this sort of operation
> is. I imagine with some sort of indexing you could ideally speed things up?
> Not sure. But this is important enough that I'd be willing to do some
> legwork.
>
> As always, thanks for your help.
>