You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent BARAT <vi...@ubikod.com> on 2009/10/15 14:51:14 UTC

Possible bug in NULL fields handling

Hello,

I'm not sure if it's a bug, but the handling of NULL fields seems 
not to work correctly:

My data (events):

0,,jawi
,0,juug
,,lfou
0,0,caro

My script:

events = load 'events' using PigStorage(',') AS 
(sessionid:chararray, jobid:chararray, user:chararray);
user_events = group events by user;
dump user_events;
event_count_by_user = foreach user_events generate group, COUNT(events);
dump event_count_by_user;

The results:

user_events (correct):
(caro,{(0,0,caro)})
(jawi,{(0,,jawi)})
(juug,{(,0,juug)})
(lfou,{(,,lfou)})

event_count_by_user (incorrect):
(caro,1L)
(jawi,1L)
(juug,0L)
(lfou,0L)

event_count_by_user should be:

(caro,1L)
(jawi,1L)
(juug,1L)
(lfou,1L)

It seems that tuples starting with (, are not counted correctly.

Any suggestion?

Thanks a lot



Re: Possible bug in NULL fields handling

Posted by Vincent BARAT <vi...@ubikod.com>.
Thank you very much for your answer!
I was not aware about the COUNT_STAR() function.
I guess it has been introduced recently (otherwise it is a bug in 
the documentation :-)

Anyway, the end proposal in PIG-1014 seems ok to me. At least, I 
think that the current behavior of the COUNT when applied on bags is 
misleading.

Dmitriy Ryaboy a écrit :
> Currently, COUNT of a bag will ignore bags which have the first field
> as null (this stems from the fact that COUNT of a column will count
> non-null columns, for sql compatibility). You may want to try using
> COUNT_STAR. This behavior is currently being reconsidered:
> https://issues.apache.org/jira/browse/PIG-1014 (please provide input!)
> 
> -Dmitriy
> 
> On Thu, Oct 15, 2009 at 8:51 AM, Vincent BARAT <vi...@ubikod.com> wrote:
>> Hello,
>>
>> I'm not sure if it's a bug, but the handling of NULL fields seems not to
>> work correctly:
>>
>> My data (events):
>>
>> 0,,jawi
>> ,0,juug
>> ,,lfou
>> 0,0,caro
>>
>> My script:
>>
>> events = load 'events' using PigStorage(',') AS (sessionid:chararray,
>> jobid:chararray, user:chararray);
>> user_events = group events by user;
>> dump user_events;
>> event_count_by_user = foreach user_events generate group, COUNT(events);
>> dump event_count_by_user;
>>
>> The results:
>>
>> user_events (correct):
>> (caro,{(0,0,caro)})
>> (jawi,{(0,,jawi)})
>> (juug,{(,0,juug)})
>> (lfou,{(,,lfou)})
>>
>> event_count_by_user (incorrect):
>> (caro,1L)
>> (jawi,1L)
>> (juug,0L)
>> (lfou,0L)
>>
>> event_count_by_user should be:
>>
>> (caro,1L)
>> (jawi,1L)
>> (juug,1L)
>> (lfou,1L)
>>
>> It seems that tuples starting with (, are not counted correctly.
>>
>> Any suggestion?
>>
>> Thanks a lot
>>
>>
>>
> 
> 

Re: Possible bug in NULL fields handling

Posted by Dmitriy Ryaboy <dv...@gmail.com>.
Currently, COUNT of a bag will ignore bags which have the first field
as null (this stems from the fact that COUNT of a column will count
non-null columns, for sql compatibility). You may want to try using
COUNT_STAR. This behavior is currently being reconsidered:
https://issues.apache.org/jira/browse/PIG-1014 (please provide input!)

-Dmitriy

On Thu, Oct 15, 2009 at 8:51 AM, Vincent BARAT <vi...@ubikod.com> wrote:
> Hello,
>
> I'm not sure if it's a bug, but the handling of NULL fields seems not to
> work correctly:
>
> My data (events):
>
> 0,,jawi
> ,0,juug
> ,,lfou
> 0,0,caro
>
> My script:
>
> events = load 'events' using PigStorage(',') AS (sessionid:chararray,
> jobid:chararray, user:chararray);
> user_events = group events by user;
> dump user_events;
> event_count_by_user = foreach user_events generate group, COUNT(events);
> dump event_count_by_user;
>
> The results:
>
> user_events (correct):
> (caro,{(0,0,caro)})
> (jawi,{(0,,jawi)})
> (juug,{(,0,juug)})
> (lfou,{(,,lfou)})
>
> event_count_by_user (incorrect):
> (caro,1L)
> (jawi,1L)
> (juug,0L)
> (lfou,0L)
>
> event_count_by_user should be:
>
> (caro,1L)
> (jawi,1L)
> (juug,1L)
> (lfou,1L)
>
> It seems that tuples starting with (, are not counted correctly.
>
> Any suggestion?
>
> Thanks a lot
>
>
>