You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by hc busy <hc...@gmail.com> on 2010/04/16 20:28:58 UTC

Re: incorrect Inner Join result for multi column join with null values in join key

Cool! can't wait until CDH has >0.7...

Kinda surprised that nobody encountered this problem before... Can I file a
ticket?

On Fri, Apr 16, 2010 at 10:21 AM, Alan Gates <ga...@yahoo-inc.com> wrote:

>
> On Apr 16, 2010, at 9:37 AM, hc busy wrote:
>
>  What scott noticed is present when the multiple column join key is used in
>> a
>> distributed setting. The trap is that when you unit test the behavior/PIG
>> script and it does the join right in a local environment and then you get
>> F'ed after u deploy to production in distributed enviro.
>>
>
> In 0.7 local mode uses Hadoop's LocalJobRunner, so hopefully we'll avoid
> that will fix these issues with development and deployment differences.
>
> Alan.
>
>
>
>>
>> On Thu, Apr 15, 2010 at 4:24 PM, Scott Carey <scott@richrelevance.com
>> >wrote:
>>
>>  CDH2 Pig 0.5+.   Mapred mode, with CDH2 0.20.1+  Both latest as of 2
>>> weeks
>>> ago.
>>>
>>> Joins on multiple columns have null key values matching.
>>>
>>> IN = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
>>> f1:int, f2:int);
>>> IN2 = LOAD 'test_nulls' using PigStorage(',') as (ind:chararray, ts:int,
>>> f1:int, f2:int);
>>> --- both the above are the same
>>>
>>> dump IN;
>>> (,1,2,3)
>>> (,-5,5,5)
>>> ( ,100,200,300)
>>> (  ,0,200,300)
>>> (a,4,5,6)
>>> (a,7,8,9)
>>> (b,10,11,12)
>>> (b,11,11,12)
>>>
>>> IN_NULLS = FILTER IN BY ind is NULL;
>>> dump IN_NULLS;
>>> (,1,2,3)
>>> (,-5,5,5)
>>>
>>> J1 = JOIN IN by (ind), IN2 by (ind);
>>> dump J1;
>>> (  ,0,200,300,  ,0,200,300)
>>> (a,4,5,6,a,4,5,6)
>>> (a,4,5,6,a,7,8,9)
>>> (a,7,8,9,a,4,5,6)
>>> (a,7,8,9,a,7,8,9)
>>> ( ,100,200,300, ,100,200,300)
>>> (b,10,11,12,b,10,11,12)
>>> (b,10,11,12,b,11,11,12)
>>> (b,11,11,12,b,10,11,12)
>>> (b,11,11,12,b,11,11,12)
>>>
>>> The above is the expected result of the self-join on the first column.
>>>
>>> J2 = JOIN IN by (ind, ts) IN2 by (ind, ts);
>>> dump J2;
>>> (  ,0,200,300,  ,0,200,300)
>>> ( ,100,200,300, ,100,200,300)
>>> (a,4,5,6,a,4,5,6)
>>> (a,7,8,9,a,7,8,9)
>>> (b,10,11,12,b,10,11,12)
>>> (b,11,11,12,b,11,11,12)
>>> (,-5,5,5,,-5,5,5)
>>> (,1,2,3,,1,2,3)
>>>
>>>
>>> The above is incorrect, since it matched the rows that have NULL for the
>>> ind field.
>>>
>>> There is a work-around, by explicitly filtering for null on the join
>>> columns before the join, but the above still looks incorrect to me.
>>> I suspect it is fixed in 0.6 or later, but I have not been able to find a
>>> JIRA ticket or message on this list about this.
>>>
>>>
>>>
>>>
>>>
>