You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Scott Carey <sc...@richrelevance.com> on 2010/04/16 01:24:10 UTC

incorrect Inner Join result for multi column join with null values in join key

CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2 weeks ago.

Joins on multiple columns have null key values matching.

IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int);
IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int, f1:int, f2:int);
--- both the above are the same

dump IN;
(,1,2,3)
(,-5,5,5)
( ,100,200,300)
(  ,0,200,300)
(a,4,5,6)
(a,7,8,9)
(b,10,11,12)
(b,11,11,12)

IN_NULLS = FILTER IN BY ind is NULL;
dump IN_NULLS;
(,1,2,3)
(,-5,5,5)

J1 = JOIN IN by (ind), IN2 by (ind);
dump J1;
(  ,0,200,300,  ,0,200,300)
(a,4,5,6,a,4,5,6)
(a,4,5,6,a,7,8,9)
(a,7,8,9,a,4,5,6)
(a,7,8,9,a,7,8,9)
( ,100,200,300, ,100,200,300)
(b,10,11,12,b,10,11,12)
(b,10,11,12,b,11,11,12)
(b,11,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)

The above is the expected result of the self-join on the first column.

J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
dump J2;
(  ,0,200,300,  ,0,200,300)
( ,100,200,300, ,100,200,300)
(a,4,5,6,a,4,5,6)
(a,7,8,9,a,7,8,9)
(b,10,11,12,b,10,11,12)
(b,11,11,12,b,11,11,12)
(,-5,5,5,,-5,5,5)
(,1,2,3,,1,2,3)


The above is incorrect, since it matched the rows that have NULL for the ind field.

There is a work-around, by explicitly filtering for null on the join columns before the join, but the above still looks incorrect to me.
I suspect it is fixed in 0.6 or later, but I have not been able to find a JIRA ticket or message on this list about this.

Re: incorrect Inner Join result for multi column join with null values in join key

Posted by hc busy <hc...@gmail.com>.

Cool! can't wait until CDH has >0.7...

Kinda surprised that nobody encountered this problem before... Can I file a
ticket?

On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

>
> On Apr 16, 2010, at 9:37 AM, hc busy wrote:
>
>  What scott noticed is present when the multiple column join key is used in
>> a
>> distributed setting. The trap is that when you unit test the behavior/PIG
>> script and it does the join right in a local environment and then you get
>> F'ed after u deploy to production in distributed enviro.
>>
>
> In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid
> that will fix these issues with development and deployment differences.
>
> Alan.
>
>
>
>>
>> On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey <scott@richrelevance.com
>> >wrote:
>>
>>  CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2
>>> weeks
>>> ago.
>>>
>>> Joins on multiple columns have null key values matching.
>>>
>>> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
>>> f1:int, f2:int);
>>> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
>>> f1:int, f2:int);
>>> --- both the above are the same
>>>
>>> dump IN;
>>> (,1,2,3)
>>> (,-5,5,5)
>>> ( ,100,200,300)
>>> (  ,0,200,300)
>>> (a,4,5,6)
>>> (a,7,8,9)
>>> (b,10,11,12)
>>> (b,11,11,12)
>>>
>>> IN_NULLS = FILTER IN BY ind is NULL;
>>> dump IN_NULLS;
>>> (,1,2,3)
>>> (,-5,5,5)
>>>
>>> J1 = JOIN IN by (ind), IN2 by (ind);
>>> dump J1;
>>> (  ,0,200,300,  ,0,200,300)
>>> (a,4,5,6,a,4,5,6)
>>> (a,4,5,6,a,7,8,9)
>>> (a,7,8,9,a,4,5,6)
>>> (a,7,8,9,a,7,8,9)
>>> ( ,100,200,300, ,100,200,300)
>>> (b,10,11,12,b,10,11,12)
>>> (b,10,11,12,b,11,11,12)
>>> (b,11,11,12,b,10,11,12)
>>> (b,11,11,12,b,11,11,12)
>>>
>>> The above is the expected result of the self-join on the first column.
>>>
>>> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
>>> dump J2;
>>> (  ,0,200,300,  ,0,200,300)
>>> ( ,100,200,300, ,100,200,300)
>>> (a,4,5,6,a,4,5,6)
>>> (a,7,8,9,a,7,8,9)
>>> (b,10,11,12,b,10,11,12)
>>> (b,11,11,12,b,11,11,12)
>>> (,-5,5,5,,-5,5,5)
>>> (,1,2,3,,1,2,3)
>>>
>>>
>>> The above is incorrect, since it matched the rows that have NULL for the
>>> ind field.
>>>
>>> There is a work-around, by explicitly filtering for null on the join
>>> columns before the join, but the above still looks incorrect to me.
>>> I suspect it is fixed in 0.6 or later, but I have not been able to find a
>>> JIRA ticket or message on this list about this.
>>>
>>>
>>>
>>>
>>>
>

Re: incorrect Inner Join result for multi column join with null values in join key

Posted by hc busy <hc...@gmail.com>.

Cool! can't wait until CDH has >0.7...

Kinda surprised that nobody encountered this problem before... Can I file a
ticket?

On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

>
> On Apr 16, 2010, at 9:37 AM, hc busy wrote:
>
>  What scott noticed is present when the multiple column join key is used in
>> a
>> distributed setting. The trap is that when you unit test the behavior/PIG
>> script and it does the join right in a local environment and then you get
>> F'ed after u deploy to production in distributed enviro.
>>
>
> In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid
> that will fix these issues with development and deployment differences.
>
> Alan.
>
>
>
>>
>> On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey <scott@richrelevance.com
>> >wrote:
>>
>>  CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2
>>> weeks
>>> ago.
>>>
>>> Joins on multiple columns have null key values matching.
>>>
>>> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
>>> f1:int, f2:int);
>>> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
>>> f1:int, f2:int);
>>> --- both the above are the same
>>>
>>> dump IN;
>>> (,1,2,3)
>>> (,-5,5,5)
>>> ( ,100,200,300)
>>> (  ,0,200,300)
>>> (a,4,5,6)
>>> (a,7,8,9)
>>> (b,10,11,12)
>>> (b,11,11,12)
>>>
>>> IN_NULLS = FILTER IN BY ind is NULL;
>>> dump IN_NULLS;
>>> (,1,2,3)
>>> (,-5,5,5)
>>>
>>> J1 = JOIN IN by (ind), IN2 by (ind);
>>> dump J1;
>>> (  ,0,200,300,  ,0,200,300)
>>> (a,4,5,6,a,4,5,6)
>>> (a,4,5,6,a,7,8,9)
>>> (a,7,8,9,a,4,5,6)
>>> (a,7,8,9,a,7,8,9)
>>> ( ,100,200,300, ,100,200,300)
>>> (b,10,11,12,b,10,11,12)
>>> (b,10,11,12,b,11,11,12)
>>> (b,11,11,12,b,10,11,12)
>>> (b,11,11,12,b,11,11,12)
>>>
>>> The above is the expected result of the self-join on the first column.
>>>
>>> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
>>> dump J2;
>>> (  ,0,200,300,  ,0,200,300)
>>> ( ,100,200,300, ,100,200,300)
>>> (a,4,5,6,a,4,5,6)
>>> (a,7,8,9,a,7,8,9)
>>> (b,10,11,12,b,10,11,12)
>>> (b,11,11,12,b,11,11,12)
>>> (,-5,5,5,,-5,5,5)
>>> (,1,2,3,,1,2,3)
>>>
>>>
>>> The above is incorrect, since it matched the rows that have NULL for the
>>> ind field.
>>>
>>> There is a work-around, by explicitly filtering for null on the join
>>> columns before the join, but the above still looks incorrect to me.
>>> I suspect it is fixed in 0.6 or later, but I have not been able to find a
>>> JIRA ticket or message on this list about this.
>>>
>>>
>>>
>>>
>>>
>

Re: incorrect Inner Join result for multi column join with null values in join key

Posted by Alan Gates <ga...@yahoo-inc.com>.

On Apr 16, 2010, at 9:37 AM, hc busy wrote:

> What scott noticed is present when the multiple column join key is  
> used in a
> distributed setting. The trap is that when you unit test the  
> behavior/PIG
> script and it does the join right in a local environment and then  
> you get
> F'ed after u deploy to production in distributed enviro.

In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll  
avoid that will fix these issues with development and deployment  
differences.

Alan.

>
>
> On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey  
> <sc...@richrelevance.com>wrote:
>
>> CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of  
>> 2 weeks
>> ago.
>>
>> Joins on multiple columns have null key values matching.
>>
>> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray,  
>> ts:int,
>> f1:int, f2:int);
>> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray,  
>> ts:int,
>> f1:int, f2:int);
>> --- both the above are the same
>>
>> dump IN;
>> (,1,2,3)
>> (,-5,5,5)
>> ( ,100,200,300)
>> (  ,0,200,300)
>> (a,4,5,6)
>> (a,7,8,9)
>> (b,10,11,12)
>> (b,11,11,12)
>>
>> IN_NULLS = FILTER IN BY ind is NULL;
>> dump IN_NULLS;
>> (,1,2,3)
>> (,-5,5,5)
>>
>> J1 = JOIN IN by (ind), IN2 by (ind);
>> dump J1;
>> (  ,0,200,300,  ,0,200,300)
>> (a,4,5,6,a,4,5,6)
>> (a,4,5,6,a,7,8,9)
>> (a,7,8,9,a,4,5,6)
>> (a,7,8,9,a,7,8,9)
>> ( ,100,200,300, ,100,200,300)
>> (b,10,11,12,b,10,11,12)
>> (b,10,11,12,b,11,11,12)
>> (b,11,11,12,b,10,11,12)
>> (b,11,11,12,b,11,11,12)
>>
>> The above is the expected result of the self-join on the first  
>> column.
>>
>> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
>> dump J2;
>> (  ,0,200,300,  ,0,200,300)
>> ( ,100,200,300, ,100,200,300)
>> (a,4,5,6,a,4,5,6)
>> (a,7,8,9,a,7,8,9)
>> (b,10,11,12,b,10,11,12)
>> (b,11,11,12,b,11,11,12)
>> (,-5,5,5,,-5,5,5)
>> (,1,2,3,,1,2,3)
>>
>>
>> The above is incorrect, since it matched the rows that have NULL  
>> for the
>> ind field.
>>
>> There is a work-around, by explicitly filtering for null on the join
>> columns before the join, but the above still looks incorrect to me.
>> I suspect it is fixed in 0.6 or later, but I have not been able to  
>> find a
>> JIRA ticket or message on this list about this.
>>
>>
>>
>>

Re: incorrect Inner Join result for multi column join with null values in join key

Posted by hc busy <hc...@gmail.com>.

What scott noticed is present when the multiple column join key is used in a
distributed setting. The trap is that when you unit test the behavior/PIG
script and it does the join right in a local environment and then you get
F'ed after u deploy to production in distributed enviro.


On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey <sc...@richrelevance.com>wrote:

> CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2 weeks
> ago.
>
> Joins on multiple columns have null key values matching.
>
> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
> f1:int, f2:int);
> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
> f1:int, f2:int);
> --- both the above are the same
>
> dump IN;
> (,1,2,3)
> (,-5,5,5)
> ( ,100,200,300)
> (  ,0,200,300)
> (a,4,5,6)
> (a,7,8,9)
> (b,10,11,12)
> (b,11,11,12)
>
> IN_NULLS = FILTER IN BY ind is NULL;
> dump IN_NULLS;
> (,1,2,3)
> (,-5,5,5)
>
> J1 = JOIN IN by (ind), IN2 by (ind);
> dump J1;
> (  ,0,200,300,  ,0,200,300)
> (a,4,5,6,a,4,5,6)
> (a,4,5,6,a,7,8,9)
> (a,7,8,9,a,4,5,6)
> (a,7,8,9,a,7,8,9)
> ( ,100,200,300, ,100,200,300)
> (b,10,11,12,b,10,11,12)
> (b,10,11,12,b,11,11,12)
> (b,11,11,12,b,10,11,12)
> (b,11,11,12,b,11,11,12)
>
> The above is the expected result of the self-join on the first column.
>
> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
> dump J2;
> (  ,0,200,300,  ,0,200,300)
> ( ,100,200,300, ,100,200,300)
> (a,4,5,6,a,4,5,6)
> (a,7,8,9,a,7,8,9)
> (b,10,11,12,b,10,11,12)
> (b,11,11,12,b,11,11,12)
> (,-5,5,5,,-5,5,5)
> (,1,2,3,,1,2,3)
>
>
> The above is incorrect, since it matched the rows that have NULL for the
> ind field.
>
> There is a work-around, by explicitly filtering for null on the join
> columns before the join, but the above still looks incorrect to me.
> I suspect it is fixed in 0.6 or later, but I have not been able to find a
> JIRA ticket or message on this list about this.
>
>
>
>