You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2008/11/11 20:33:04 UTC

[Pig Wiki] Update of "PigExercise1" by breed

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by breed:
http://wiki.apache.org/pig/PigExercise1

New page:
In this exercise we will work through the example shown in the presentation. We have two datasets: users and pages. {{{users}}} contains the userid and age of every user using some service. {{{pages}}} contains the userid and url visited by that user. We are going to work through this exercise using the interactive shell: java -jar pig.jar -

We start off by loading the users dataset. 

{{{
Users = load ‘/data/users’ as (name, age);
Pages = load ‘data/pages’ as (user, url);
}}}

What is the format of this data? (use {{{describe Users;}}} or {{{dump Users;}}} to answer the question.

Now we filter:

{{{
Fltrd = filter Users by 
        age >= 18 and age <= 25;

}}}

Now lets do the join.

{{{
Jnd = join Fltrd by name, Pages by user;
}}}

What does this data look like? You can use describe to verify your answer.

{{{
Grpd = group Jnd by url;
}}}

How does group differ from join? Again use describe.

{{{
Smmd = foreach Grpd generate group,
       COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top100 = limit Srtd 100;
store Top100 into ‘top100sites’;
}}}

Finish it up. Does top100sites contain what you expect?