You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Russell Jurney <ru...@gmail.com> on 2012/07/24 07:49:51 UTC

None. wtf is None?

Can someone explain this script to me? It is freaking me out. When did Pig
start spitting out 'None' in place of null?

register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar

define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();

rmf /tmp/sent_mails
rmf /tmp/replies

/* Get rid of emails with reply_to, as they confuse everything in mailing
lists. */
avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
clean_emails = filter avro_emails by froms is not null and reply_tos is
null;

/* Treat emails without in_reply_to as sent emails */
combined_emails = foreach clean_emails generate froms, tos, message_id;
*sent_mails = foreach combined_emails generate flatten(froms.address) as
from, *
*                                              flatten(tos.address) as to, *
*                                              message_id;*
store sent_mails into '/tmp/sent_mails';

/* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
nulls */
*replies = filter clean_emails by in_reply_to is not null;*
*replies = foreach replies generate flatten(froms.address) as from,*
*                                   flatten(tos.address) as to,*
*                                   in_reply_to;*
store replies into '/tmp/replies';


Despite filtering replies to emails that only have the 'in_reply_to'
field... I get the same number of records in both relations I store:

russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
   17431
russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
   17431


Investigating shows me:

cat /tmp/replies/part-00001

crna@hotmail.com russell.jurney@gmail.com None
crna@hotmail.com russell.jurney@gmail.com
<CANSvDjqLTC=NOXiup9SABZ40j7Bfy=EMUT5t=LyWcuzJmg7AVQ@mail.gmail.com
voice-noreply@google.com russell.jurney@gmail.com None


Where did *None* come from? I thought FLATTEN would prune records with
empty columns, and I'm ok with it not but... what operators does None
respond to? It is not null. How do I prune these?
-- 
Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com

Re: None. wtf is None?

Posted by Robert Yerex <ro...@civitaslearning.com>.
Whats in part-r-00000?



On Tue, Jul 24, 2012 at 9:30 AM, Russell Jurney <ru...@gmail.com>wrote:

> No. No python UDF.
>
> Russell Jurney http://datasyndrome.com
>
> On Jul 24, 2012, at 6:50 AM, Robert Yerex
> <ro...@civitaslearning.com> wrote:
>
> > Python UDF? That would explain the None instead of null
> >
> > On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney
> > <ru...@gmail.com>wrote:
> >
> >> Can someone explain this script to me? It is freaking me out. When did
> Pig
> >> start spitting out 'None' in place of null?
> >>
> >> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> >> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> >> register /me/pig/contrib/piggybank/java/piggybank.jar
> >>
> >> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> >>
> >> rmf /tmp/sent_mails
> >> rmf /tmp/replies
> >>
> >> /* Get rid of emails with reply_to, as they confuse everything in
> mailing
> >> lists. */
> >> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> >> clean_emails = filter avro_emails by froms is not null and reply_tos is
> >> null;
> >>
> >> /* Treat emails without in_reply_to as sent emails */
> >> combined_emails = foreach clean_emails generate froms, tos, message_id;
> >> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> >> from, *
> >> *                                              flatten(tos.address) as
> to,
> >> *
> >> *                                              message_id;*
> >> store sent_mails into '/tmp/sent_mails';
> >>
> >> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> >> nulls */
> >> *replies = filter clean_emails by in_reply_to is not null;*
> >> *replies = foreach replies generate flatten(froms.address) as from,*
> >> *                                   flatten(tos.address) as to,*
> >> *                                   in_reply_to;*
> >> store replies into '/tmp/replies';
> >>
> >>
> >> Despite filtering replies to emails that only have the 'in_reply_to'
> >> field... I get the same number of records in both relations I store:
> >>
> >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
> >>   17431
> >> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
> >>   17431
> >>
> >>
> >> Investigating shows me:
> >>
> >> cat /tmp/replies/part-00001
> >>
> >> crna@hotmail.com russell.jurney@gmail.com None
> >> crna@hotmail.com russell.jurney@gmail.com
> >> <CANSvDjqLTC=NOXiup9SABZ40j7Bfy=EMUT5t=LyWcuzJmg7AVQ@mail.gmail.com
> >> voice-noreply@google.com russell.jurney@gmail.com None
> >>
> >>
> >> Where did *None* come from? I thought FLATTEN would prune records with
> >> empty columns, and I'm ok with it not but... what operators does None
> >> respond to? It is not null. How do I prune these?
> >> --
> >> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
> >> datasyndrome.com
> >>
> >
> >
> >
> > --
> > Robert Yerex
> > Data Scientist
> > Civitas Learning
> > www.civitaslearning.com
>



-- 
Robert Yerex
Data Scientist
Civitas Learning
www.civitaslearning.com

Re: None. wtf is None?

Posted by Russell Jurney <ru...@gmail.com>.
No. No python UDF.

Russell Jurney http://datasyndrome.com

On Jul 24, 2012, at 6:50 AM, Robert Yerex
<ro...@civitaslearning.com> wrote:

> Python UDF? That would explain the None instead of null
>
> On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney
> <ru...@gmail.com>wrote:
>
>> Can someone explain this script to me? It is freaking me out. When did Pig
>> start spitting out 'None' in place of null?
>>
>> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
>> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
>> register /me/pig/contrib/piggybank/java/piggybank.jar
>>
>> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>>
>> rmf /tmp/sent_mails
>> rmf /tmp/replies
>>
>> /* Get rid of emails with reply_to, as they confuse everything in mailing
>> lists. */
>> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
>> clean_emails = filter avro_emails by froms is not null and reply_tos is
>> null;
>>
>> /* Treat emails without in_reply_to as sent emails */
>> combined_emails = foreach clean_emails generate froms, tos, message_id;
>> *sent_mails = foreach combined_emails generate flatten(froms.address) as
>> from, *
>> *                                              flatten(tos.address) as to,
>> *
>> *                                              message_id;*
>> store sent_mails into '/tmp/sent_mails';
>>
>> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
>> nulls */
>> *replies = filter clean_emails by in_reply_to is not null;*
>> *replies = foreach replies generate flatten(froms.address) as from,*
>> *                                   flatten(tos.address) as to,*
>> *                                   in_reply_to;*
>> store replies into '/tmp/replies';
>>
>>
>> Despite filtering replies to emails that only have the 'in_reply_to'
>> field... I get the same number of records in both relations I store:
>>
>> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>>   17431
>> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>>   17431
>>
>>
>> Investigating shows me:
>>
>> cat /tmp/replies/part-00001
>>
>> crna@hotmail.com russell.jurney@gmail.com None
>> crna@hotmail.com russell.jurney@gmail.com
>> <CANSvDjqLTC=NOXiup9SABZ40j7Bfy=EMUT5t=LyWcuzJmg7AVQ@mail.gmail.com
>> voice-noreply@google.com russell.jurney@gmail.com None
>>
>>
>> Where did *None* come from? I thought FLATTEN would prune records with
>> empty columns, and I'm ok with it not but... what operators does None
>> respond to? It is not null. How do I prune these?
>> --
>> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
>> datasyndrome.com
>>
>
>
>
> --
> Robert Yerex
> Data Scientist
> Civitas Learning
> www.civitaslearning.com

Re: None. wtf is None?

Posted by Robert Yerex <ro...@civitaslearning.com>.
Python UDF? That would explain the None instead of null

On Tue, Jul 24, 2012 at 12:49 AM, Russell Jurney
<ru...@gmail.com>wrote:

> Can someone explain this script to me? It is freaking me out. When did Pig
> start spitting out 'None' in place of null?
>
> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> register /me/pig/contrib/piggybank/java/piggybank.jar
>
> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
>
> rmf /tmp/sent_mails
> rmf /tmp/replies
>
> /* Get rid of emails with reply_to, as they confuse everything in mailing
> lists. */
> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> clean_emails = filter avro_emails by froms is not null and reply_tos is
> null;
>
> /* Treat emails without in_reply_to as sent emails */
> combined_emails = foreach clean_emails generate froms, tos, message_id;
> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> from, *
> *                                              flatten(tos.address) as to,
> *
> *                                              message_id;*
> store sent_mails into '/tmp/sent_mails';
>
> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> nulls */
> *replies = filter clean_emails by in_reply_to is not null;*
> *replies = foreach replies generate flatten(froms.address) as from,*
> *                                   flatten(tos.address) as to,*
> *                                   in_reply_to;*
> store replies into '/tmp/replies';
>
>
> Despite filtering replies to emails that only have the 'in_reply_to'
> field... I get the same number of records in both relations I store:
>
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>    17431
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>    17431
>
>
> Investigating shows me:
>
> cat /tmp/replies/part-00001
>
> crna@hotmail.com russell.jurney@gmail.com None
> crna@hotmail.com russell.jurney@gmail.com
> <CANSvDjqLTC=NOXiup9SABZ40j7Bfy=EMUT5t=LyWcuzJmg7AVQ@mail.gmail.com
> voice-noreply@google.com russell.jurney@gmail.com None
>
>
> Where did *None* come from? I thought FLATTEN would prune records with
> empty columns, and I'm ok with it not but... what operators does None
> respond to? It is not null. How do I prune these?
> --
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com
> datasyndrome.com
>



-- 
Robert Yerex
Data Scientist
Civitas Learning
www.civitaslearning.com

Re: None. wtf is None?

Posted by Alan Gates <ga...@hortonworks.com>.
Can you attach a sample of the input data?  I'm guessing None came from the input data.  

Alan.

On Jul 23, 2012, at 10:49 PM, Russell Jurney wrote:

> Can someone explain this script to me? It is freaking me out. When did Pig
> start spitting out 'None' in place of null?
> 
> register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
> register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
> register /me/pig/contrib/piggybank/java/piggybank.jar
> 
> define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
> 
> rmf /tmp/sent_mails
> rmf /tmp/replies
> 
> /* Get rid of emails with reply_to, as they confuse everything in mailing
> lists. */
> avro_emails = load '/me/tmp/thu_emails' using AvroStorage();
> clean_emails = filter avro_emails by froms is not null and reply_tos is
> null;
> 
> /* Treat emails without in_reply_to as sent emails */
> combined_emails = foreach clean_emails generate froms, tos, message_id;
> *sent_mails = foreach combined_emails generate flatten(froms.address) as
> from, *
> *                                              flatten(tos.address) as to, *
> *                                              message_id;*
> store sent_mails into '/tmp/sent_mails';
> 
> /* Treat in_reply_tos separately, as our FLATTEN() will filter otu the
> nulls */
> *replies = filter clean_emails by in_reply_to is not null;*
> *replies = foreach replies generate flatten(froms.address) as from,*
> *                                   flatten(tos.address) as to,*
> *                                   in_reply_to;*
> store replies into '/tmp/replies';
> 
> 
> Despite filtering replies to emails that only have the 'in_reply_to'
> field... I get the same number of records in both relations I store:
> 
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/sent_mails/p*|wc -l
>   17431
> russell-jurneys-macbook-pro:pig rjurney$ cat /tmp/replies/p*|wc -l
>   17431
> 
> 
> Investigating shows me:
> 
> cat /tmp/replies/part-00001
> 
> crna@hotmail.com russell.jurney@gmail.com None
> crna@hotmail.com russell.jurney@gmail.com
> <CANSvDjqLTC=NOXiup9SABZ40j7Bfy=EMUT5t=LyWcuzJmg7AVQ@mail.gmail.com
> voice-noreply@google.com russell.jurney@gmail.com None
> 
> 
> Where did *None* come from? I thought FLATTEN would prune records with
> empty columns, and I'm ok with it not but... what operators does None
> respond to? It is not null. How do I prune these?
> -- 
> Russell Jurney twitter.com/rjurney russell.jurney@gmail.com datasyndrome.com