You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/11/11 20:33:04 UTC
[Pig Wiki] Update of "PigExercise1" by breed
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The following page has been changed by breed:
http://wiki.apache.org/pig/PigExercise1
New page:
In this exercise we will work through the example shown in the presentation. We have two datasets: users and pages. {{{users}}} contains the userid and age of every user using some service. {{{pages}}} contains the userid and url visited by that user. We are going to work through this exercise using the interactive shell: java -jar pig.jar -
We start off by loading the users dataset.
{{{
Users = load ‘/data/users’ as (name, age);
Pages = load ‘data/pages’ as (user, url);
}}}
What is the format of this data? (use {{{describe Users;}}} or {{{dump Users;}}} to answer the question.
Now we filter:
{{{
Fltrd = filter Users by
age >= 18 and age <= 25;
}}}
Now lets do the join.
{{{
Jnd = join Fltrd by name, Pages by user;
}}}
What does this data look like? You can use describe to verify your answer.
{{{
Grpd = group Jnd by url;
}}}
How does group differ from join? Again use describe.
{{{
Smmd = foreach Grpd generate group,
COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top100 = limit Srtd 100;
store Top100 into ‘top100sites’;
}}}
Finish it up. Does top100sites contain what you expect?