You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent <vi...@gmail.com> on 2010/09/06 16:06:03 UTC

Matches on tuples

Good afternoon,

I am using pig on server logs to make statistics on visited pages.

For now I am able to do such matches:

- one user has visited a given page matching a given aim.
- one user has visited a given page belonging to one of the page's aim.
*
aim.id file:*

1@1@add_to_kart              --> aim 1 can be add_to_cart
1@1@browse                     --> or aim 1 can be browse
1@2@paid                         --> aim 2 is only paid

*site.log file:*

user1,1,www.site.com/browse
user1,1,www.site.com/browse
user2,1,www.site.com/add_to_kart
user2,1,www.site.com/add_to_kart
user2,1,www.site.com/paid
user2,1,www.site.com/browse
user3,1,www.site.com/browse

*Pig script:*

register 'piggybank.jar';

-- load aim id database:
aim_ids = LOAD 'aim.id' USING PigStorage('@') AS (aim_site_id : int, aim_id
: int , aim_url:chararray);

DUMP aim_ids;

-- load site log:
site = LOAD 'site.log' USING PigStorage(',') AS (user_id : chararray ,
site_id : int, url : chararray);

site_all_aims = JOIN site BY site_id, aim_ids BY aim_site_id;

site_match = FOREACH site_all_aims GENERATE user_id, site_id, aim_id,
org.apache.pig.piggybank.evaluation.string.INDEXOF(url, aim_url) AS match;

site_aims = FILTER site_match BY (match != -1) AND (match IS NOT null);

DUMP site_aims;

*results:*

(user1,1,1,13) --> user 1 achieved aim 1
(user1,1,1,13) --> user 1 achieved aim 1
(user2,1,1,13) --> user 2 achieved aim 1
(user2,1,1,13) --> user 2 achieved aim 1
(user2,1,2,13) --> user 2 achieved aim 2
(user2,1,1,13) --> user 2 achieved aim 1
(user3,1,1,13) --> user 3 achieved aim 1


Now I would like to check that a user has visited several pages to achieve
one aim. Like for a user to achieve aim 3, he needs to visit "browse" AND
"add_to_kart" AND "paid".

My idea was to load tuples of aim:

1@3@{(browse),(add_to_kart),(paid)}

And to write an UDF to compare aim URL tuple, with user's visited URL bag
for the site.

But I am not able to load tuples with an undefined number of elements. As
aims might be:

1@3@{(browse),(add_to_kart),(paid)}
1@4
@{(browse_something_else),(add_to_kart_something_else),(paid_something_else),(another_page)}

So finally, I am stuck with this problem right now, still searching for
another way to write this script and aim.id file.

If any of you as any idea, mail me.

Thanks

Vincent HERVIEUX