You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by "Uppuluri, Rohini" <ro...@corp.aol.com> on 2009/08/05 05:37:37 UTC

Duplicates with GROUP statement

Hi all, 
 
In one of my pig scripts, I am using GROUP on a few fields. I observed
that after there were duplicates entries of the fields on which I have
grouped in the output of the GROUP statement.
 
For example consider a sample script
A = raw LOAD 'myfile.txt' as (url, urlid, ah, img)
B = GROUP A by (url,urlid, ah);
C =  FOREACH B GENERATE group, count($1);
 
In C, I found duplicate entries of the combination (url,urlid, ah). I
found this while working on large datasets. I couldnt replicate this
scenario on small datasets.
 
Can someone point me to any known issue like this with GROUP if any.
 
Thanks, 
-Rohini

Re: Duplicates with GROUP statement

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.


Hi,

  If you are using PigStorage to read empty strings, iirc, it has a
'behavior' that it will silently convert them to null.
That is, reading a tuple through pig storage, where one of the fields is
an empty string, will cause that field for the tuple to become null
instead of being an empty string.

Now, I am not sure how well pig implements sql semantics - but from what
I recall two  (url1, urlid1, null) entries cant be grouped together.


IF pig handles sql semantics this way, and your group key has a null in
it, then it can be expected to not match.

Ofcourse, the implicit '' -> null by PigStorage is something which is,
imo, highly unintutive and potentially very buggy (sql semantics for
null always causes problems with other programmers here too - and most
dont expect pig to behave that way to begin with).

Ofcourse, I am making a couple of assumptions here - which might need to
be validated !


Regards,
Mridul



Uppuluri, Rohini wrote:
> Hi all, 
> 
> With further more experimentation on the data set, I found that there were actually two different lines in my input as follows:
> (url, urlid, null ,null) - null values 
> (url, urlid, '' ,'')- empty strings
> 
> However after group by they became something like this:
> (url, urlid, '' ,'') - empty strings
> (url, urlid, '' ,'')- empty strings
> 
> I guess that is why I was seeing two records with exactly same fields that I performed group on. 
> 
> Apparently in 'PigStorage' class, in  'putField' method, when we find that a data to be null, we are skipping it and not outputing anything.Could anyone please let me know if Group uses PigStorage or any similar function to store the temporary output after performing group?
> 
> Also, this kind of ingoring when the data is null, was it intentionally kept so leaving it to the application to decide how to deal with it? Please let me know. 
> 
> Thanks, 
> -Rohini
> 
> 
> -----Original Message-----
> From: zjffdu [mailto:zjffdu@gmail.com] 
> Sent: Thursday, August 06, 2009 10:20 AM
> To: pig-user@hadoop.apache.org
> Subject: RE: Duplicates with GROUP statement
> 
> I guess maybe there's white spaces around the field url,urlid or ah.
> So it makes you think they are duplicated.
> 
>  
> 
> -----Original Message-----
> From: Uppuluri, Rohini [mailto:rohini.uppuluri@corp.aol.com]
> Sent: 2009年8月4日 20:38
> To: pig-user@hadoop.apache.org
> Subject: Duplicates with GROUP statement
> 
> Hi all, 
>  
> In one of my pig scripts, I am using GROUP on a few fields. I observed that after there were duplicates entries of the fields on which I have grouped in the output of the GROUP statement.
>  
> For example consider a sample script
> A = raw LOAD 'myfile.txt' as (url, urlid, ah, img) B = GROUP A by (url,urlid, ah); C =  FOREACH B GENERATE group, count($1);
>  
> In C, I found duplicate entries of the combination (url,urlid, ah). I found this while working on large datasets. I couldnt replicate this scenario on small datasets.
>  
> Can someone point me to any known issue like this with GROUP if any.
>  
> Thanks,
> -Rohini
>

RE: Duplicates with GROUP statement

Posted by "Uppuluri, Rohini" <ro...@corp.aol.com>.

Hi all, 

With further more experimentation on the data set, I found that there were actually two different lines in my input as follows:
(url, urlid, null ,null) - null values 
(url, urlid, '' ,'')- empty strings

However after group by they became something like this:
(url, urlid, '' ,'') - empty strings
(url, urlid, '' ,'')- empty strings

I guess that is why I was seeing two records with exactly same fields that I performed group on. 

Apparently in 'PigStorage' class, in  'putField' method, when we find that a data to be null, we are skipping it and not outputing anything.Could anyone please let me know if Group uses PigStorage or any similar function to store the temporary output after performing group?

Also, this kind of ingoring when the data is null, was it intentionally kept so leaving it to the application to decide how to deal with it? Please let me know. 

Thanks, 
-Rohini


-----Original Message-----
From: zjffdu [mailto:zjffdu@gmail.com] 
Sent: Thursday, August 06, 2009 10:20 AM
To: pig-user@hadoop.apache.org
Subject: RE: Duplicates with GROUP statement

I guess maybe there's white spaces around the field url,urlid or ah.
So it makes you think they are duplicated.

 

-----Original Message-----
From: Uppuluri, Rohini [mailto:rohini.uppuluri@corp.aol.com]
Sent: 2009年8月4日 20:38
To: pig-user@hadoop.apache.org
Subject: Duplicates with GROUP statement

Hi all, 
 
In one of my pig scripts, I am using GROUP on a few fields. I observed that after there were duplicates entries of the fields on which I have grouped in the output of the GROUP statement.
 
For example consider a sample script
A = raw LOAD 'myfile.txt' as (url, urlid, ah, img) B = GROUP A by (url,urlid, ah); C =  FOREACH B GENERATE group, count($1);
 
In C, I found duplicate entries of the combination (url,urlid, ah). I found this while working on large datasets. I couldnt replicate this scenario on small datasets.
 
Can someone point me to any known issue like this with GROUP if any.
 
Thanks,
-Rohini

RE: Duplicates with GROUP statement

Posted by zjffdu <zj...@gmail.com>.

I guess maybe there's white spaces around the field url,urlid or ah.
So it makes you think they are duplicated.

-----Original Message-----
From: Uppuluri, Rohini [mailto:rohini.uppuluri@corp.aol.com] 
Sent: 2009年8月4日 20:38
To: pig-user@hadoop.apache.org
Subject: Duplicates with GROUP statement

Hi all, 

In one of my pig scripts, I am using GROUP on a few fields. I observed
that after there were duplicates entries of the fields on which I have
grouped in the output of the GROUP statement.

For example consider a sample script
A = raw LOAD 'myfile.txt' as (url, urlid, ah, img)
B = GROUP A by (url,urlid, ah);
C =  FOREACH B GENERATE group, count($1);

In C, I found duplicate entries of the combination (url,urlid, ah). I
found this while working on large datasets. I couldnt replicate this
scenario on small datasets.

Can someone point me to any known issue like this with GROUP if any.

Thanks, 
-Rohini