You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by "Mustafi, Priyo" <pm...@paypal.com> on 2012/04/23 21:05:10 UTC

Problem with dereferencing and alias

Hi All,
I am pretty new to pig and am having some issues with dereferencing. My data in simplified form looks like below

data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, pagevar)});

cat visitevent   (note there is tab in between the visit and the events)
(vr1,vi1,ff)	{((pagea,eb1)),((pageb,eb3))}
(vr1,vi2,ff)	{((pageb,eb2))}
(vr2,vi3,ff)	{((pageb,eb4))}
(vr3,vi4,ie)	{((pagec,eb3)),((pagea,eb5))}


My task is the following
1)  Generate count(visitid) and count(distinct visitorid) by browser
2)  Generate count(events), count(visitid) and count(distinct visitorid) by pagename


I have issues with the first task.  I tried the below after flattening visit and it worked.

data = load 'c:/shared/visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, pagevar)});
data2 = foreach data generate FLATTEN(visit);
data3 = group data2 by browser;
dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate group, COUNT(d2), COUNT(d1);};
describe dc;
dump dc;


I don't understand why I would need to flatten visit.  I tried the below without flattening and whatever I try it doesn't work. Not sure why.  

data = load 'c:/shared/visitevent' using PigStorage() AS (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename, pagevar)});
data2 = foreach data generate visit;
data3 = group data2 by browser;
#  describe data3  produces below
#       data3: {group: bytearray,data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}}
#  none of the below work as somehow it doesn't find the alias.  Why?
dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate group, COUNT(d2), COUNT(d1);};
dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate group, COUNT(d2), COUNT(d1);};

What am I doing wrong?  Since my task #2 is going to group by pagename which is in a bag->tuple, do I have to flatten that one twice to get this working? Are there any documentation on dereferencing complex and nested structures?  Any help appreciated.  
	
Thanks 
Priyo




RE: Problem with dereferencing and alias

Posted by "Mustafi, Priyo" <pm...@paypal.com>.
Thanks Gianmarco!  I see why it makes sense now.  I guess when I see multiple levels of nesting, I should flatten for ease of processing.  


-----Original Message-----
From: Gianmarco De Francisci Morales [mailto:gdfm@apache.org] 
Sent: Monday, April 23, 2012 1:10 PM
To: user@pig.apache.org
Subject: Re: Problem with dereferencing and alias

Hi,

the fact is that visit is a nested tuple inside the tuples that make your
original relation.
If you describe the data2 relation it should get clear:

WITH FLATTEN
grunt> describe data2
data2: {visit::visitorid: bytearray,visit::visitid:
bytearray,visit::browser: bytearray}

WITHOUT FLATTEN
grunt> data2 = foreach data generate visit;
grunt> describe data2
data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}

If you don't want to flatten (for whichever reason), you need to modify
your script like this:

data3 = group data2 by visit.browser;

But then you have a double nesting which I find cumbersome to work with.
grunt> describe data3
data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid:
bytearray,browser: bytearray))}}

Now you have data3 which is a bag with a nested bag data2 with a nested
tuple which contains a 3 element tuple.

That's why flattening comes handy in this case.

I hope it helps.

Cheers,
--
Gianmarco



On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <pm...@paypal.com> wrote:

> Hi All,
> I am pretty new to pig and am having some issues with dereferencing. My
> data in simplified form looks like below
>
> data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid,
> visitid, browser), events:bag{event:tuple(pagename, pagevar)});
>
> cat visitevent   (note there is tab in between the visit and the events)
> (vr1,vi1,ff)    {((pagea,eb1)),((pageb,eb3))}
> (vr1,vi2,ff)    {((pageb,eb2))}
> (vr2,vi3,ff)    {((pageb,eb4))}
> (vr3,vi4,ie)    {((pagec,eb3)),((pagea,eb5))}
>
>
> My task is the following
> 1)  Generate count(visitid) and count(distinct visitorid) by browser
> 2)  Generate count(events), count(visitid) and count(distinct visitorid)
> by pagename
>
>
> I have issues with the first task.  I tried the below after flattening
> visit and it worked.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate FLATTEN(visit);
> data3 = group data2 by browser;
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> describe dc;
> dump dc;
>
>
> I don't understand why I would need to flatten visit.  I tried the below
> without flattening and whatever I try it doesn't work. Not sure why.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate visit;
> data3 = group data2 by browser;
> #  describe data3  produces below
> #       data3: {group: bytearray,data2: {visit: (visitorid:
> bytearray,visitid: bytearray,browser: bytearray)}}
> #  none of the below work as somehow it doesn't find the alias.  Why?
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
>
> What am I doing wrong?  Since my task #2 is going to group by pagename
> which is in a bag->tuple, do I have to flatten that one twice to get this
> working? Are there any documentation on dereferencing complex and nested
> structures?  Any help appreciated.
>
> Thanks
> Priyo
>
>
>
>

Re: Problem with dereferencing and alias

Posted by Gianmarco De Francisci Morales <gd...@apache.org>.
Hi,

the fact is that visit is a nested tuple inside the tuples that make your
original relation.
If you describe the data2 relation it should get clear:

WITH FLATTEN
grunt> describe data2
data2: {visit::visitorid: bytearray,visit::visitid:
bytearray,visit::browser: bytearray}

WITHOUT FLATTEN
grunt> data2 = foreach data generate visit;
grunt> describe data2
data2: {visit: (visitorid: bytearray,visitid: bytearray,browser: bytearray)}

If you don't want to flatten (for whichever reason), you need to modify
your script like this:

data3 = group data2 by visit.browser;

But then you have a double nesting which I find cumbersome to work with.
grunt> describe data3
data3: {group: bytearray,data2: {(visit: (visitorid: bytearray,visitid:
bytearray,browser: bytearray))}}

Now you have data3 which is a bag with a nested bag data2 with a nested
tuple which contains a 3 element tuple.

That's why flattening comes handy in this case.

I hope it helps.

Cheers,
--
Gianmarco



On Mon, Apr 23, 2012 at 21:05, Mustafi, Priyo <pm...@paypal.com> wrote:

> Hi All,
> I am pretty new to pig and am having some issues with dereferencing. My
> data in simplified form looks like below
>
> data = load 'visitevent' using PigStorage() AS (visit:tuple(visitorid,
> visitid, browser), events:bag{event:tuple(pagename, pagevar)});
>
> cat visitevent   (note there is tab in between the visit and the events)
> (vr1,vi1,ff)    {((pagea,eb1)),((pageb,eb3))}
> (vr1,vi2,ff)    {((pageb,eb2))}
> (vr2,vi3,ff)    {((pageb,eb4))}
> (vr3,vi4,ie)    {((pagec,eb3)),((pagea,eb5))}
>
>
> My task is the following
> 1)  Generate count(visitid) and count(distinct visitorid) by browser
> 2)  Generate count(events), count(visitid) and count(distinct visitorid)
> by pagename
>
>
> I have issues with the first task.  I tried the below after flattening
> visit and it worked.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate FLATTEN(visit);
> data3 = group data2 by browser;
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> describe dc;
> dump dc;
>
>
> I don't understand why I would need to flatten visit.  I tried the below
> without flattening and whatever I try it doesn't work. Not sure why.
>
> data = load 'c:/shared/visitevent' using PigStorage() AS
> (visit:tuple(visitorid, visitid, browser), events:bag{event:tuple(pagename,
> pagevar)});
> data2 = foreach data generate visit;
> data3 = group data2 by browser;
> #  describe data3  produces below
> #       data3: {group: bytearray,data2: {visit: (visitorid:
> bytearray,visitid: bytearray,browser: bytearray)}}
> #  none of the below work as somehow it doesn't find the alias.  Why?
> dc = foreach data3 {d1 = data2.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
> dc = foreach data3 {d1 = visit.visitorid; d2 = distinct d1; generate
> group, COUNT(d2), COUNT(d1);};
>
> What am I doing wrong?  Since my task #2 is going to group by pagename
> which is in a bag->tuple, do I have to flatten that one twice to get this
> working? Are there any documentation on dereferencing complex and nested
> structures?  Any help appreciated.
>
> Thanks
> Priyo
>
>
>
>