You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by di...@email.com on 2010/06/30 08:37:23 UTC

Help with writing Pig Query

Hi 
I'm absolutely new with using Pig, only just picked it up like 3 days ago, and still trying to wrap my head around it. I'm stuck with putting together a query.


A DUMP of my sample dataset is as follows, 


log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
DUMP log;



(User1,a)
(User1,b)
(User2,f)
(User3,b)
(User2,a)
(User1,e)
(User2,b)
(User2,c)
(User3,d)
(User1,d)
(User2,e)
(User2,a)
(User3,c)
(User1,d)
(User2,c)
(User3,a)
(User1,d)
(User2,b)
(User2,e)
(User3,c)


What I'm trying to do is to say, Users visiting page 'a' also visited this list of other pages ranked by number of times the page was visited. Can anyone help or give me some guidance?


Thanks
Leslie

Re: Help with writing Pig Query

Posted by di...@email.com.

Hi Thejas
Thanks, with your input I managed to work it out. Here's my solution. Hope it's useful to someone.



/* load in the log file */
log = LOAD 'example-user-tracking.txt' AS (user:chararray, page:chararray);


/* generate the list of unique users grouped by page */
unique_log = DISTINCT log;
g_users = GROUP unique_log BY page;
l_page_users = FOREACH g_users GENERATE group AS page, FLATTEN(unique_log.user) AS user;


/* generate a list of pages grouped by users, and add on a counter */
l_counted = FOREACH log GENERATE user, page, 1 AS counter;
g_userpages = GROUP l_counted BY (user,page);
l_userpages = FOREACH g_userpages GENERATE FLATTEN (group), COUNT(l_counted) AS occurs;


/* joined the 2 lists together using user as key */
joined = JOIN l_page_users BY user, l_userpages BY group::user;
l_page_page = FOREACH joined GENERATE l_page_users::page AS page1, l_userpages::group::page AS page2, l_userp
ages::occurs AS occurs;
g_page_page = GROUP l_page_page BY (page1, page2);
/* generate a list showing the number of occurence of starting page moving to landing page */
result = FOREACH g_page_page GENERATE group, SUM(l_page_page.occurs) AS occurs;


/* display the result */
DUMP result;


If someone is able to optimize this, please do share. Not sure if my version is the best way to achieve the result.
Thanks
L.





-----Original Message-----
From: Thejas Nair <te...@yahoo-inc.com>
To: pig-user@hadoop.apache.org <pi...@hadoop.apache.org>; diagnostix@email.com
Sent: Thu, Jul 1, 2010 6:10 am
Subject: Re: Help with writing Pig Query


Does this do what you want ? -

L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
 -- as of current version of pig, you need to use two different loads for
self join 

J = join L1 by page, L2 by page; -- self join on page
F1 = foreach j generate L1::page as p1, L2::page as p2;
G = group F1 by p1,p2;
F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as
visitcount; -- now you have the number of times user who visited p1 has
visited p2

O = order F2 by p1, visitcount;
dump O; -- you results


I haven't checked the syntax of above query.

One optimization you can do to reduce the output size of join, is to do a
group-by on user,page , then generate the count. Then do self-join on that
result, replace COUNT(F1) in F2(above) with SUM(F1.cnt)

-Thejas


On 6/29/10 11:37 PM, "diagnostix@email.com" <di...@email.com> wrote:

> Hi 
> I'm absolutely new with using Pig, only just picked it up like 3 days ago, and
> still trying to wrap my head around it. I'm stuck with putting together a
> query.
> 
> 
> A DUMP of my sample dataset is as follows,
> 
> 
> log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
> DUMP log;
> 
> 
> 
> (User1,a)
> (User1,b)
> (User2,f)
> (User3,b)
> (User2,a)
> (User1,e)
> (User2,b)
> (User2,c)
> (User3,d)
> (User1,d)
> (User2,e)
> (User2,a)
> (User3,c)
> (User1,d)
> (User2,c)
> (User3,a)
> (User1,d)
> (User2,b)
> (User2,e)
> (User3,c)
> 
> 
> What I'm trying to do is to say, Users visiting page 'a' also visited this
> list of other pages ranked by number of times the page was visited. Can anyone
> help or give me some guidance?
> 
> 
> Thanks
> Leslie
> 
> 
> 
> 
>

Re: Help with writing Pig Query

Posted by Thejas Nair <te...@yahoo-inc.com>.

Does this do what you want ? -

L1 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
L2 = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
 -- as of current version of pig, you need to use two different loads for
self join 

J = join L1 by page, L2 by page; -- self join on page
F1 = foreach j generate L1::page as p1, L2::page as p2;
G = group F1 by p1,p2;
F2 = foreach G generate group.p1 as p1, group.p2 as p2 , COUNT(F1) as
visitcount; -- now you have the number of times user who visited p1 has
visited p2

O = order F2 by p1, visitcount;
dump O; -- you results


I haven't checked the syntax of above query.

One optimization you can do to reduce the output size of join, is to do a
group-by on user,page , then generate the count. Then do self-join on that
result, replace COUNT(F1) in F2(above) with SUM(F1.cnt)

-Thejas


On 6/29/10 11:37 PM, "diagnostix@email.com" <di...@email.com> wrote:

> Hi 
> I'm absolutely new with using Pig, only just picked it up like 3 days ago, and
> still trying to wrap my head around it. I'm stuck with putting together a
> query.
> 
> 
> A DUMP of my sample dataset is as follows,
> 
> 
> log = LOAD 'example-users.txt' AS (user:chararray, page:chararray);
> DUMP log;
> 
> 
> 
> (User1,a)
> (User1,b)
> (User2,f)
> (User3,b)
> (User2,a)
> (User1,e)
> (User2,b)
> (User2,c)
> (User3,d)
> (User1,d)
> (User2,e)
> (User2,a)
> (User3,c)
> (User1,d)
> (User2,c)
> (User3,a)
> (User1,d)
> (User2,b)
> (User2,e)
> (User3,c)
> 
> 
> What I'm trying to do is to say, Users visiting page 'a' also visited this
> list of other pages ranked by number of times the page was visited. Can anyone
> help or give me some guidance?
> 
> 
> Thanks
> Leslie
> 
> 
> 
> 
>