You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent <vi...@gmail.com> on 2010/09/06 16:06:03 UTC
Matches on tuples
Good afternoon,
I am using pig on server logs to make statistics on visited pages.
For now I am able to do such matches:
- one user has visited a given page matching a given aim.
- one user has visited a given page belonging to one of the page's aim.
*
aim.id file:*
1@1@add_to_kart --> aim 1 can be add_to_cart
1@1@browse --> or aim 1 can be browse
1@2@paid --> aim 2 is only paid
*site.log file:*
user1,1,www.site.com/browse
user1,1,www.site.com/browse
user2,1,www.site.com/add_to_kart
user2,1,www.site.com/add_to_kart
user2,1,www.site.com/paid
user2,1,www.site.com/browse
user3,1,www.site.com/browse
*Pig script:*
register 'piggybank.jar';
-- load aim id database:
aim_ids = LOAD 'aim.id' USING PigStorage('@') AS (aim_site_id : int, aim_id
: int , aim_url:chararray);
DUMP aim_ids;
-- load site log:
site = LOAD 'site.log' USING PigStorage(',') AS (user_id : chararray ,
site_id : int, url : chararray);
site_all_aims = JOIN site BY site_id, aim_ids BY aim_site_id;
site_match = FOREACH site_all_aims GENERATE user_id, site_id, aim_id,
org.apache.pig.piggybank.evaluation.string.INDEXOF(url, aim_url) AS match;
site_aims = FILTER site_match BY (match != -1) AND (match IS NOT null);
DUMP site_aims;
*results:*
(user1,1,1,13) --> user 1 achieved aim 1
(user1,1,1,13) --> user 1 achieved aim 1
(user2,1,1,13) --> user 2 achieved aim 1
(user2,1,1,13) --> user 2 achieved aim 1
(user2,1,2,13) --> user 2 achieved aim 2
(user2,1,1,13) --> user 2 achieved aim 1
(user3,1,1,13) --> user 3 achieved aim 1
Now I would like to check that a user has visited several pages to achieve
one aim. Like for a user to achieve aim 3, he needs to visit "browse" AND
"add_to_kart" AND "paid".
My idea was to load tuples of aim:
1@3@{(browse),(add_to_kart),(paid)}
And to write an UDF to compare aim URL tuple, with user's visited URL bag
for the site.
But I am not able to load tuples with an undefined number of elements. As
aims might be:
1@3@{(browse),(add_to_kart),(paid)}
1@4
@{(browse_something_else),(add_to_kart_something_else),(paid_something_else),(another_page)}
So finally, I am stuck with this problem right now, still searching for
another way to write this script and aim.id file.
If any of you as any idea, mail me.
Thanks
Vincent HERVIEUX