You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Michael Lok <fu...@gmail.com> on 2012/01/03 09:21:57 UTC

Left joins with != condition

Hi folks,

I have 2 tables which I'd like to perform joins via "!=" condition
similar to the SQL syntax below:

select * from yee a left join yer b on a.loc != b.loc

I've read through Pig Latin basics and books but there isn't a way I
can do it.  Is there some other way I can get around this?


Thanks.

Re: Left joins with != condition

Posted by Michael Lok <fu...@gmail.com>.
Hi Dimitry,

CROSS worked like a charm :)


Thanks!


On Wed, Jan 4, 2012 at 10:15 AM, Dmitriy Ryaboy <dv...@gmail.com> wrote:
> Do CROSS like Alan suggested, and then filter for non-equality, that's
> better.
> What Jacob proposed won't work, I think he misunderstood your question (or
> I did...)
>
> D
>
> On Tue, Jan 3, 2012 at 3:49 PM, Michael <fu...@gmail.com> wrote:
>
>> Hi folks,
>>
>> Thanks for the input. As terrible as the cross join is, I need the result
>> to perform matching, ie. For each record in T1, match with all records in
>> T2.
>>
>> My initial thought is that this can be easily done in the db and have the
>> results exported from there, but due to the sheer amount of output, it
>> makes more sense to have the results in hdfs.
>>
>> Not sure what Jacob proposed will generate, but will try that first as I'm
>> unsure how I can perform a join on a constant :)
>>
>>
>> Thanks again
>>
>>
>>
>> On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>>
>> > No Michael essentially wants a cross-product. It's a terrible thing and
>> > should be avoided :).
>> >
>> > T1:
>> >
>> > a  1
>> > b  2
>> > c  3
>> > d  4
>> >
>> > T2:
>> >
>> > a x
>> > b y
>> > c z
>> >
>> > joined this way on the first column becomes:
>> >
>> > a 1 b y
>> > a 1 c z
>> > b 2 a x
>> > b 2 c z
>> > c 3 a x
>> > c 3 b y
>> > d 4 a x
>> > d 4 b y
>> > d 4 c z
>> >
>> > Note the cardinality explosion. Now assume that you are doing this in
>> Pig /
>> > Hadoop because one of the relations is  TB-sized, or at least
>> > multi-gigabyte.
>> >
>> > And this is why Pig doesn't support it.
>> >
>> > But if you really want to, join on a constant (so all rows in T1 will
>> match
>> > all rows in T2) and filter out those for which T1.loc == T2.loc
>> >
>> > And don't say I didn't warn you :).
>> >
>> > D
>> >
>> > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <jacob.a.perkins@gmail.com
>> >wrote:
>> >
>> >> If I understand correctly, this is nothing more than an anti-join which
>> >> can be done with pig using a cogroup.
>> >>
>> >> So your SQL below:
>> >>
>> >>> select * from yee a left join yer b on a.loc != b.loc;
>> >>
>> >> becomes something like:
>> >>
>> >> a = load 'yee' as (loc:chararray, stuff:int);
>> >> b = load 'yer' as (loc:chararray, stuff:int);
>> >>
>> >> c = cogroup a by loc, b by loc;
>> >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>> >>
>> >> which will result in d containing only the records from a where the
>> >> 'loc' field doesn't match with the 'loc' field in b.
>> >>
>> >> --jacob
>> >> @thedatachef
>> >>
>> >>
>>

Re: Left joins with != condition

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Do CROSS like Alan suggested, and then filter for non-equality, that's
better.
What Jacob proposed won't work, I think he misunderstood your question (or
I did...)

D

On Tue, Jan 3, 2012 at 3:49 PM, Michael <fu...@gmail.com> wrote:

> Hi folks,
>
> Thanks for the input. As terrible as the cross join is, I need the result
> to perform matching, ie. For each record in T1, match with all records in
> T2.
>
> My initial thought is that this can be easily done in the db and have the
> results exported from there, but due to the sheer amount of output, it
> makes more sense to have the results in hdfs.
>
> Not sure what Jacob proposed will generate, but will try that first as I'm
> unsure how I can perform a join on a constant :)
>
>
> Thanks again
>
>
>
> On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:
>
> > No Michael essentially wants a cross-product. It's a terrible thing and
> > should be avoided :).
> >
> > T1:
> >
> > a  1
> > b  2
> > c  3
> > d  4
> >
> > T2:
> >
> > a x
> > b y
> > c z
> >
> > joined this way on the first column becomes:
> >
> > a 1 b y
> > a 1 c z
> > b 2 a x
> > b 2 c z
> > c 3 a x
> > c 3 b y
> > d 4 a x
> > d 4 b y
> > d 4 c z
> >
> > Note the cardinality explosion. Now assume that you are doing this in
> Pig /
> > Hadoop because one of the relations is  TB-sized, or at least
> > multi-gigabyte.
> >
> > And this is why Pig doesn't support it.
> >
> > But if you really want to, join on a constant (so all rows in T1 will
> match
> > all rows in T2) and filter out those for which T1.loc == T2.loc
> >
> > And don't say I didn't warn you :).
> >
> > D
> >
> > On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <jacob.a.perkins@gmail.com
> >wrote:
> >
> >> If I understand correctly, this is nothing more than an anti-join which
> >> can be done with pig using a cogroup.
> >>
> >> So your SQL below:
> >>
> >>> select * from yee a left join yer b on a.loc != b.loc;
> >>
> >> becomes something like:
> >>
> >> a = load 'yee' as (loc:chararray, stuff:int);
> >> b = load 'yer' as (loc:chararray, stuff:int);
> >>
> >> c = cogroup a by loc, b by loc;
> >> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
> >>
> >> which will result in d containing only the records from a where the
> >> 'loc' field doesn't match with the 'loc' field in b.
> >>
> >> --jacob
> >> @thedatachef
> >>
> >>
>

Re: Left joins with != condition

Posted by Michael <fu...@gmail.com>.
Hi folks,

Thanks for the input. As terrible as the cross join is, I need the result to perform matching, ie. For each record in T1, match with all records in T2. 

My initial thought is that this can be easily done in the db and have the results exported from there, but due to the sheer amount of output, it makes more sense to have the results in hdfs. 

Not sure what Jacob proposed will generate, but will try that first as I'm unsure how I can perform a join on a constant :)


Thanks again



On Jan 4, 2012, at 5:44, Dmitriy Ryaboy <dv...@gmail.com> wrote:

> No Michael essentially wants a cross-product. It's a terrible thing and
> should be avoided :).
> 
> T1:
> 
> a  1
> b  2
> c  3
> d  4
> 
> T2:
> 
> a x
> b y
> c z
> 
> joined this way on the first column becomes:
> 
> a 1 b y
> a 1 c z
> b 2 a x
> b 2 c z
> c 3 a x
> c 3 b y
> d 4 a x
> d 4 b y
> d 4 c z
> 
> Note the cardinality explosion. Now assume that you are doing this in Pig /
> Hadoop because one of the relations is  TB-sized, or at least
> multi-gigabyte.
> 
> And this is why Pig doesn't support it.
> 
> But if you really want to, join on a constant (so all rows in T1 will match
> all rows in T2) and filter out those for which T1.loc == T2.loc
> 
> And don't say I didn't warn you :).
> 
> D
> 
> On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <ja...@gmail.com>wrote:
> 
>> If I understand correctly, this is nothing more than an anti-join which
>> can be done with pig using a cogroup.
>> 
>> So your SQL below:
>> 
>>> select * from yee a left join yer b on a.loc != b.loc;
>> 
>> becomes something like:
>> 
>> a = load 'yee' as (loc:chararray, stuff:int);
>> b = load 'yer' as (loc:chararray, stuff:int);
>> 
>> c = cogroup a by loc, b by loc;
>> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>> 
>> which will result in d containing only the records from a where the
>> 'loc' field doesn't match with the 'loc' field in b.
>> 
>> --jacob
>> @thedatachef
>> 
>> 

Re: Left joins with != condition

Posted by Alan Gates <ga...@hortonworks.com>.
You can do better than joining on a constant.  Pig does support CROSS, and it does it in parallel.  (Joining on a constant would single thread the process.)  It still creates a massive volume of data and is slow, but it will work.

Alan.

On Jan 3, 2012, at 1:44 PM, Dmitriy Ryaboy wrote:

> No Michael essentially wants a cross-product. It's a terrible thing and
> should be avoided :).
> 
> T1:
> 
> a  1
> b  2
> c  3
> d  4
> 
> T2:
> 
> a x
> b y
> c z
> 
> joined this way on the first column becomes:
> 
> a 1 b y
> a 1 c z
> b 2 a x
> b 2 c z
> c 3 a x
> c 3 b y
> d 4 a x
> d 4 b y
> d 4 c z
> 
> Note the cardinality explosion. Now assume that you are doing this in Pig /
> Hadoop because one of the relations is  TB-sized, or at least
> multi-gigabyte.
> 
> And this is why Pig doesn't support it.
> 
> But if you really want to, join on a constant (so all rows in T1 will match
> all rows in T2) and filter out those for which T1.loc == T2.loc
> 
> And don't say I didn't warn you :).
> 
> D
> 
> On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <ja...@gmail.com>wrote:
> 
>> If I understand correctly, this is nothing more than an anti-join which
>> can be done with pig using a cogroup.
>> 
>> So your SQL below:
>> 
>>> select * from yee a left join yer b on a.loc != b.loc;
>> 
>> becomes something like:
>> 
>> a = load 'yee' as (loc:chararray, stuff:int);
>> b = load 'yer' as (loc:chararray, stuff:int);
>> 
>> c = cogroup a by loc, b by loc;
>> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>> 
>> which will result in d containing only the records from a where the
>> 'loc' field doesn't match with the 'loc' field in b.
>> 
>> --jacob
>> @thedatachef
>> 
>> 


Re: Left joins with != condition

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
No Michael essentially wants a cross-product. It's a terrible thing and
should be avoided :).

T1:

a  1
b  2
c  3
d  4

T2:

a x
b y
c z

joined this way on the first column becomes:

a 1 b y
a 1 c z
b 2 a x
b 2 c z
c 3 a x
c 3 b y
d 4 a x
d 4 b y
d 4 c z

Note the cardinality explosion. Now assume that you are doing this in Pig /
Hadoop because one of the relations is  TB-sized, or at least
multi-gigabyte.

And this is why Pig doesn't support it.

But if you really want to, join on a constant (so all rows in T1 will match
all rows in T2) and filter out those for which T1.loc == T2.loc

And don't say I didn't warn you :).

D

On Tue, Jan 3, 2012 at 5:34 AM, Jacob Perkins <ja...@gmail.com>wrote:

> If I understand correctly, this is nothing more than an anti-join which
> can be done with pig using a cogroup.
>
> So your SQL below:
>
> > select * from yee a left join yer b on a.loc != b.loc;
>
> becomes something like:
>
> a = load 'yee' as (loc:chararray, stuff:int);
> b = load 'yer' as (loc:chararray, stuff:int);
>
> c = cogroup a by loc, b by loc;
> d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);
>
> which will result in d containing only the records from a where the
> 'loc' field doesn't match with the 'loc' field in b.
>
> --jacob
> @thedatachef
>
>

Re: Left joins with != condition

Posted by Jacob Perkins <ja...@gmail.com>.
If I understand correctly, this is nothing more than an anti-join which
can be done with pig using a cogroup.

So your SQL below:

> select * from yee a left join yer b on a.loc != b.loc;

becomes something like:

a = load 'yee' as (loc:chararray, stuff:int);
b = load 'yer' as (loc:chararray, stuff:int);

c = cogroup a by loc, b by loc;
d = foreach (filter c by IsEmpty(b)) generate FLATTEN(a);

which will result in d containing only the records from a where the
'loc' field doesn't match with the 'loc' field in b.

--jacob
@thedatachef