You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by Apache Wiki <wi...@apache.org> on 2009/03/03 22:30:25 UTC

[Pig Wiki] Trivial Update of "PigTutorial" by CorinneC

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.

The following page has been changed by CorinneC:
http://wiki.apache.org/pig/PigTutorial

------------------------------------------------------------------------------
   * '''Local Mode''': To run the scripts in local mode, no Hadoop or HDFS installation is required. All files are installed and run from your local host and file system.
   * '''Hadoop Mode''': To run the scripts in hadoop (mapreduce) mode, you need access to a Hadoop cluster and HDFS installation.
  
- The Pig tutorial file (attachment:pigtutorial.tar.gz or the tutorial/pigtutorial.tar.gz file in the pig distribution) includes the Pig JAR file (pig.jar) and the tutorial files (tutorial.jar, Pigs scripts, log files). These files work with Hadoop 0.17 and provide everything you need to run the Pig scripts. To get started, follow these basic steps: 
+ The Pig tutorial file (attachment:pigtutorial.tar.gz or the tutorial/pigtutorial.tar.gz file in the pig distribution) includes the Pig JAR file (pig.jar) and the tutorial files (tutorial.jar, Pigs scripts, log files). These files work with Hadoop 0.18 and provide everything you need to run the Pig scripts. To get started, follow these basic steps: 
  
   1. Install Java.
   1. Download the Pig tutorial file and install Pig.
@@ -112, +112 @@

  REGISTER ./tutorial.jar; 
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields '''user''', '''time''', and '''query'''. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_PigStorage_ PigStorage] function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields '''user''', '''time''', and '''query'''. 
  {{{
  raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
  }}}
@@ -138, +138 @@

  ngramed1 = FOREACH houred GENERATE user, hour, flatten(org.apache.pig.tutorial.NGramGenerator(query)) as ngram;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data DISTINCT] command to get the unique n-grams for all records. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_DISTINCT DISTINCT] command to get the unique n-grams for all records. 
  {{{ 
  ngramed2 = DISTINCT ngramed1;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group records by n-gram and hour.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_GROUP GROUP] command to group records by n-gram and hour.
  {{{ 
  hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigBuiltins COUNT] function to get the count (occurrences) of each n-gram. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_COUNT COUNT] function to get the count (occurrences) of each n-gram. 
  {{{ 
  hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_GROUP GROUP] command to group records by n-gram only. Each group now corresponds to a distinct n-gram and has the count for each hour.
  {{{ 
  uniq_frequency1 = GROUP hour_frequency2 BY group::ngram;
  }}}
@@ -163, +163 @@

  uniq_frequency2 = FOREACH uniq_frequency1 GENERATE flatten($0), flatten(org.apache.pig.tutorial.ScoreGenerator($1));
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_FOREACH_…_GENERATE FOREACH-GENERATE] command to assign names to the fields. 
  {{{ 
  uniq_frequency3 = FOREACH uniq_frequency2 GENERATE $1 as hour, $0 as ngram, $2 as score, $3 as count, $4 as mean;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to move all records with a score less than or equal to 2.0.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_FILTER_ FILTER] command to move all records with a score less than or equal to 2.0.
  {{{ 
  filtered_uniq_frequency = FILTER uniq_frequency3 BY score > 2.0;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#ORDER:_Sorting_data_according_to_some_field ORDER] command to sort the remaining records by hour and score. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_ORDER_ ORDER] command to sort the remaining records by hour and score. 
  {{{ 
  ordered_uniq_frequency = ORDER filtered_uniq_frequency BY (hour, score);
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_PigStorage_ PigStorage] function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''ngram''', '''score''', '''count''', '''mean'''.
  {{{ 
  STORE ordered_uniq_frequency INTO '/tmp/tutorial-results' USING PigStorage(); 
  }}}
@@ -194, +194 @@

  REGISTER ./tutorial.jar;
  }}}
   
-  * Use the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields '''user''', '''time''', and '''query'''.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_PigStorage_ PigStorage] function to load the excite log file (excite.log or excite-small.log) into the “raw” bag as an array of records with the fields '''user''', '''time''', and '''query'''.
  {{{
  raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query);
  }}}
@@ -223, +223 @@

  }}}
  
   
-  * Use the [http://wiki.apache.org/pig/PigLatin#DISTINCT:_Eliminating_duplicates_in_data DISTINCT] command to get the unique n-grams for all records. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_DISTINCT DISTINCT] command to get the unique n-grams for all records. 
  {{{
  ngramed2 = DISTINCT ngramed1;
  }}}
  
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#COGROUP:_Getting_the_relevant_data_together GROUP] command to group the records by n-gram and hour. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_GROUP GROUP] command to group the records by n-gram and hour. 
  {{{
  hour_frequency1 = GROUP ngramed2 BY (ngram, hour);
  }}}
  
  
-  * Use the [http://wiki.apache.org/pig/PigBuiltins COUNT] function to get the count (occurrences) of each n-gram. 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_COUNT COUNT] function to get the count (occurrences) of each n-gram. 
  {{{
  hour_frequency2 = FOREACH hour_frequency1 GENERATE flatten($0), COUNT($1) as count;
  }}}
  
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to assign names to the fields.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_FOREACH_…_GENERATE FOREACH-GENERATE] command to assign names to the fields.
  {{{
  hour_frequency3 = FOREACH hour_frequency2 GENERATE $0 as ngram, $1 as hour, $2 as count;
  }}}
  
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour ‘00’ 
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_FILTER_ FILTER] command to get the n-grams for hour ‘00’ 
  {{{
  hour00 = FILTER hour_frequency2 BY hour eq '00';
  }}}
  
  
-  * Uses the [http://wiki.apache.org/pig/PigLatin#FILTER:_Getting_rid_of_data_you_are_not_interested_in_ FILTER] command to get the n-grams for hour ‘12’
+  * Uses the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_FILTER_ FILTER] command to get the n-grams for hour ‘12’
  {{{
  hour12 = FILTER hour_frequency3 BY hour eq '12';
  }}}
  
   
-  * Use the [http://wiki.apache.org/pig/PigLatin#Joining JOIN] command to get the n-grams that appear in both hours.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_JOIN JOIN] command to get the n-grams that appear in both hours.
  {{{
  same = JOIN hour00 BY $0, hour12 BY $0;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigLatin#FOREACH_..._GENERATE:_Applying_transformations_to_the_data FOREACH-GENERATE] command to record their frequency.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_FOREACH_…_GENERATE FOREACH-GENERATE] command to record their frequency.
  {{{
  same1 = FOREACH same GENERATE hour_frequency2::hour00::group::ngram as ngram, $2 as count00, $5 as count12;
  }}}
  
-  * Use the [http://wiki.apache.org/pig/PigBuiltins PigStorage] function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''count00''', '''count12'''.
+  * Use the [http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.htm#_PigStorage_ PigStorage] function to store the results. The output file contains a list of n-grams with the following fields: '''hour''', '''count00''', '''count12'''.
  {{{
  STORE same1 INTO '/tmp/tutorial-join-results' USING PigStorage();
  }}}