You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Jonathan Coveney <jc...@gmail.com> on 2011/01/26 15:36:35 UTC

Possible bug in join?

So, bizarrely, I am either not understanding how pig does joins or there is
a bug... it has been quite frustrating to troubleshoot.

The issue is this: after doing a join to get set5, I do a foreach generate
to make set6. Depending on the order in the join statement, one value gets
erased by another. Here is the specific part I am talking about:

set1 = JOIN Z2 by demo,small_table by demo;
set2 = foreach set1 generate Z2::uid as uid,Z2::c2 as c2,Z2::ss2k as
ss2k,Z2::time_id as time_id ,Z2::countryCode as countryCode,Z2::segment as
segment,small_table::value as alsodemo;
set3 = filter set2 BY segment == 1;
set4 = filter set2 BY segment == 2;
set4_a = foreach set4 generate uid, c2, ss2k, time_id, countryCode, alsodemo
as gender;
set5 = join set4_a by (uid,c2,ss2k,time_id,countryCode) full, set3 by
(uid,c2,ss2k,time_id,countryCode);
set6 = foreach set5 generate ((set3::uid IS NULL) ? set4_a::uid : set3::uid)
as uid,
       ((set3::c2 IS NULL) ? set4_a::c2 : set3::c2) as c2,
       ((set3::ss2k IS NULL) ? set4_a::ss2k : set3::ss2k) as ss2k,
       ((set3::time_id IS NULL) ? set4_a::time_id : set3::time_id) as
time_id,
       ((set3::countryCode IS NULL) ? set4_a::countryCode :
set3::countryCode) as countryCode,
       gender, alsodemo as min_age;

If set5 joins set4_a on the left and set3 on the right, then while set5 will
output properly, set6 will not. the "gender" column will erase the alsodemo
column.

If set5 joins set3 on the left and set4_a on the right, then set5 still
works, but set6 will not: in this case, alsodemo will be fine, but it will
erase gender. Basically, the last two columns are always the same. Making
the references set4_a::gender and set3::alsodemo don't change anything (in
fact, originally that's how this was done, but in troubleshooting we have
been changing things to try and fix it).

This is a super bizarre example of the script behaving in a pretty awful
fashion. Not sure what is causing it, would love to know if anyone has any
ideas, and if this is a bug?

I can post the full script, but a lot of it isn't really germane to this.

Re: Possible bug in join?

Posted by Xiaomeng Wan <sh...@gmail.com>.
Isn't it the same problem mentioned in

http://mail-archives.apache.org/mod_mbox/pig-user/201101.mbox/%3C90570B63-0991-4127-8E3E-69E2B6E4B942@few.vu.nl%3E

if so, there is already a fix.

Shawn

On Wed, Jan 26, 2011 at 7:36 AM, Jonathan Coveney <jc...@gmail.com> wrote:
> So, bizarrely, I am either not understanding how pig does joins or there is
> a bug... it has been quite frustrating to troubleshoot.
>
> The issue is this: after doing a join to get set5, I do a foreach generate
> to make set6. Depending on the order in the join statement, one value gets
> erased by another. Here is the specific part I am talking about:
>
> set1 = JOIN Z2 by demo,small_table by demo;
> set2 = foreach set1 generate Z2::uid as uid,Z2::c2 as c2,Z2::ss2k as
> ss2k,Z2::time_id as time_id ,Z2::countryCode as countryCode,Z2::segment as
> segment,small_table::value as alsodemo;
> set3 = filter set2 BY segment == 1;
> set4 = filter set2 BY segment == 2;
> set4_a = foreach set4 generate uid, c2, ss2k, time_id, countryCode, alsodemo
> as gender;
> set5 = join set4_a by (uid,c2,ss2k,time_id,countryCode) full, set3 by
> (uid,c2,ss2k,time_id,countryCode);
> set6 = foreach set5 generate ((set3::uid IS NULL) ? set4_a::uid : set3::uid)
> as uid,
>       ((set3::c2 IS NULL) ? set4_a::c2 : set3::c2) as c2,
>       ((set3::ss2k IS NULL) ? set4_a::ss2k : set3::ss2k) as ss2k,
>       ((set3::time_id IS NULL) ? set4_a::time_id : set3::time_id) as
> time_id,
>       ((set3::countryCode IS NULL) ? set4_a::countryCode :
> set3::countryCode) as countryCode,
>       gender, alsodemo as min_age;
>
> If set5 joins set4_a on the left and set3 on the right, then while set5 will
> output properly, set6 will not. the "gender" column will erase the alsodemo
> column.
>
> If set5 joins set3 on the left and set4_a on the right, then set5 still
> works, but set6 will not: in this case, alsodemo will be fine, but it will
> erase gender. Basically, the last two columns are always the same. Making
> the references set4_a::gender and set3::alsodemo don't change anything (in
> fact, originally that's how this was done, but in troubleshooting we have
> been changing things to try and fix it).
>
> This is a super bizarre example of the script behaving in a pretty awful
> fashion. Not sure what is causing it, would love to know if anyone has any
> ideas, and if this is a bug?
>
> I can post the full script, but a lot of it isn't really germane to this.
>