You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by ol...@apache.org on 2009/11/12 19:43:46 UTC
svn commit: r835496 - in /hadoop/pig/trunk: ./ src/docs/src/documentation/content/xdocs/

Author: olga
Date: Thu Nov 12 18:43:45 2009
New Revision: 835496

URL: http://svn.apache.org/viewvc?rev=835496&view=rev
Log:
PIG-1089: Pig 0.6.0 Documentation  (chandec via olgan)

Modified:
    hadoop/pig/trunk/CHANGES.txt
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml

Modified: hadoop/pig/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Thu Nov 12 18:43:45 2009
@@ -26,6 +26,8 @@
 
 IMPROVEMENTS
 
+PIG-1089: Pig 0.6.0 Documentation  (chandec via olgan)
+
 PIG-958: Splitting output data on key field (ankur via pradeepkth)
 
 PIG-1058: FINDBUGS: remaining "Correctness Warnings" (olgan)

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_reference.xml Thu Nov 12 18:43:45 2009
@@ -5412,7 +5412,7 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL | BY expression â¦] [PARALLEL n];</para>
+               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL | BY expression â¦]  [USING "collected"] [PARALLEL n];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5454,6 +5454,27 @@
                <para>A tuple expression. This is the group key or key field. If the result of the tuple expression is a single field, the key will be the value of the first field rather than a tuple with one field.</para>
             </entry>
          </row>
+         
+         <row>
+            <entry>
+               <para>USING</para>
+            </entry>
+            <entry>
+               <para>Keyword</para>
+            </entry>
+         </row>
+         <row>
+            <entry>
+               <para>"collected"</para>
+            </entry>
+            <entry>
+               <para>Allows for more efficient computation of a group if the loader guarantees that the data for the 
+               same key is continuous and is given to a single map. As of this release, only the Zebra loader makes this 
+               guarantee. The efficiency is achieved by performing the group operation in map
+               rather than reduce (see <ulink url="piglatin_users.html#Integration+with+Zebra">Integration with Zebra</ulink>). This feature cannot be used with the COGROUP operator.</para>
+            </entry>
+         </row>         
+         
 
          <row>
             <entry>
@@ -5553,6 +5574,10 @@
 (19,{(Mary)})
 (20,{(Bill)})
 </programlisting>
+
+   </section>
+   <section>
+   <title>Example</title>
    
    <para>Suppose we have relation A.</para>
 <programlisting>
@@ -5629,6 +5654,18 @@
 </programlisting>
    
    </section>
+   
+   <section>
+   <title>Example</title>
+<para>This example shows a map-side group.</para>   
+<programlisting>
+ register zebra.jar;
+ A = LOAD 'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('name, age, gpaâ, 'sorted');
+ B = GROUP A BY name USING "collected";
+ C = FOREACH b GENERATE group, MAX(a.age), COUNT_STAR(a);
+</programlisting>
+    </section>
+
    </section>
    
    <section>
@@ -5793,7 +5830,8 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column [PARALLEL n];Â  </para>
+               <para>alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column 
+               [USING "replicated" | "skewed"] [PARALLEL n];Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable>
@@ -5865,6 +5903,34 @@
             </entry>
          </row>
 
+  <row>
+            <entry>
+               <para>USING</para>
+            </entry>
+            <entry>
+               <para>Keyword</para>
+            </entry>
+         </row>
+         <row>
+            <entry>
+               <para>"replicated"</para>
+            </entry>
+            <entry>
+               <para>Use to perform fragment replicate joins (see <ulink url="piglatin_users.html#Fragment+Replicate+Joins">Fragment Replicate Joins</ulink>).</para>
+               <para>Only left outer join is supported for replicated outer join.</para>
+            </entry>
+         </row>
+         
+                  <row>
+            <entry>
+               <para>"skewed"</para>
+            </entry>
+            <entry>
+               <para>Use to perform skewed joins (see <ulink url="piglatin_users.html#Skewed+Joins">Skewed Joins</ulink>).</para>
+            </entry>
+         </row>
+
+
          <row>
             <entry>
                <para>PARALLEL n</para>
@@ -5929,6 +5995,21 @@
 B = LOAD 'b.txt' AS (n:chararray, m:chararray);
 C = JOIN A BY $0 FULL, B BY $0;
 </programlisting>
+
+<para>This example shows a replicated left outer join.</para>
+<programlisting>
+A = LOAD âlargeâ;
+B = LOAD âtinyâ;
+C= JOIN A BY $0 LEFT, B BY $0 USING "replicated";
+</programlisting>
+
+<para>This example shows a skewed full outer join.</para>
+<programlisting>
+A = LOAD  âstudenttabâ as (name, age, gpa);
+B = LOAD  'votertab' as (name, age, registration, contribution);
+C = JOIN A BY name FULL, B BY name USING "skewed";
+</programlisting>
+
 </section>
 </section>  
   
@@ -8739,12 +8820,78 @@
 </programlisting>
    </section></section></section>
    
+      <!-- Shell COMMANDS-->
+   <section>
+   <title>Shell Commands</title>
+   
+      <section>
+   <title>fs</title>
+   <para>Invokes any FSShell command from within a Pig script or the Grunt shell.</para>
+   
+   <section>
+   <title>Syntax </title>
+   <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>fs subcommand subcommand_parameters </para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+      <tgroup cols="2">
+      <tbody>
+      <row>
+            <entry>
+               <para>subcommand</para>
+            </entry>
+            <entry>
+               <para>The FSShell command.</para>
+            </entry>
+         </row>
+               <row>
+            <entry>
+               <para>subcommand_parameters</para>
+            </entry>
+            <entry>
+               <para>The FSShell command parameters.</para>
+            </entry>
+         </row>
+         </tbody>
+         </tgroup>
+   </informaltable>
    
+   </section>
    
+   <section>
+   <title>Usage</title>
+   <para>Use the fs command to invoke any FSShell command from within a Pig script or Grunt shell. 
+   The fs command greatly extends the set of supported file system commands and the capabilities
+   supported for existing commands such as ls that will now support globing. For a complete list of
+   FSShell commands, see 
+   <ulink url="http://hadoop.apache.org/common/docs/current/hdfs_shell.html">HDFS File System Shell Guide</ulink></para>
+   </section>
+   
+   <section>
+   <title>Examples</title>
+   <para>In these examples, a directory is created, a file is copied, a file is listed.</para>
+<programlisting>
+fs -mkdir /tmp
+fs -copyFromLocal file-x file-y
+fs -ls file-y
+</programlisting>
+   </section>
+    </section>
+        </section>
+    
    <!-- FILE COMMANDS-->
    <section>
    <title>File Commands</title>
-
+   <para>Note: Beginning with Pig 0.6.0, the file commands are now deprecated and will be removed in a future release. 
+   Start using Pig's -fs command to invoke the shell commands <ulink url="piglatin_reference.html#Shell+Commands">shell commands</ulink>.   
+   </para>
    <section>
    <title>cat</title>
    <para>Prints the content of one or more files to the screen.</para>
@@ -8786,7 +8933,8 @@
 john adams
 anne white
 </programlisting>
-   </section></section>
+   </section>
+   </section>
    
    <section>
    <title>cd</title>
@@ -8984,96 +9132,7 @@
 </programlisting>
    </section></section>
    
-   <section>
-   <title>exec</title>
-   <para>Run a Pig script.</para>
-   
-   <section>
-   <title>Syntax</title>
-   <informaltable frame="all">
-      <tgroup cols="1"><tbody><row>
-            <entry>
-               <para>exec [âparam param_name = param_value] [âparam_file file_name] scriptÂ  </para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Terms</title>
-   <informaltable frame="all">
-   <tgroup cols="2"><tbody>
-        <row>
-            <entry>
-               <para>âparam param_name = param_value</para>
-            </entry>
-            <entry>
-               <para>See Parameter Substitution.</para>
-            </entry>
-        </row>
-
-        <row>
-            <entry>
-               <para>âparam_file file_name</para>
-            </entry>
-            <entry>
-               <para>See Parameter Substitution. </para>
-            </entry>
-        </row>
-   
-      <row>
-            <entry>
-               <para>script</para>
-            </entry>
-            <entry>
-               <para>The name of a Pig script.</para>
-            </entry>
-         </row>
-         
-   </tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Usage</title>
-   <para>Use the exec command to run a Pig script with no interaction between the script and the Grunt shell (batch mode). Aliases defined in the script are not available to the shell; however, the files produced as the output of the script and stored on the system are visible after the script is run. Aliases defined via the shell are not available to the script. </para>
-   <para>With the exec command, store statements will not trigger execution; rather, the entire script is parsed before execution starts. Unlike the run command, exec does not change the command history or remembers the handles used inside the script. Exec without any parameters can be used in scripts to force execution up to the point in the script where the exec occurs. </para>
-   <para>For comparison, see the run command. Both the exec and run commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as they allow you to reuse existing components.</para>
-   </section>
-   
-   <section>
-   <title>Examples</title>
-   <para>In this example the script is displayed and run.</para>
-
-<programlisting>
-grunt&gt; cat myscript.pig
-a = LOAD 'student' AS (name, age, gpa);
-b = LIMIT a 3;
-DUMP b;
-
-grunt&gt; exec myscript.pig
-(alice,20,2.47)
-(luke,18,4.00)
-(holly,24,3.27)
-</programlisting>
-
-   <para>In this example parameter substitution is used with the exec command.</para>
-<programlisting>
-grunt&gt; cat myscript.pig
-a = LOAD 'student' AS (name, age, gpa);
-b = ORDER a BY name;
-
-STORE b into '$out';
-
-grunt&gt; exec âparam out=myoutput myscript.pig
-</programlisting>
-
-      <para>In this example multiple parameters are specified.</para>
-<programlisting>
-grunt&gt; exec âparam p1=myparam1 âparam p2=myparam2 myscript.pig
-</programlisting>
-
-   </section>
-   
-   </section>
+ 
    
    <section>
    <title>ls</title>
@@ -9343,8 +9402,15 @@
 </programlisting>
    </section></section>
    
+
+   </section>
+   
+   
    <section>
-   <title>run</title>
+   <title>Utility Commands</title>
+   
+  <section>
+   <title>exec</title>
    <para>Run a Pig script.</para>
    
    <section>
@@ -9352,7 +9418,7 @@
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>run [âparam param_name = param_value] [âparam_file file_name] scriptÂ </para>
+               <para>exec [âparam param_name = param_value] [âparam_file file_name] scriptÂ  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -9361,23 +9427,24 @@
    <title>Terms</title>
    <informaltable frame="all">
    <tgroup cols="2"><tbody>
-         <row>
+        <row>
             <entry>
                <para>âparam param_name = param_value</para>
             </entry>
             <entry>
                <para>See Parameter Substitution.</para>
             </entry>
-         </row>
+        </row>
 
-         <row>
+        <row>
             <entry>
                <para>âparam_file file_name</para>
             </entry>
             <entry>
                <para>See Parameter Substitution. </para>
             </entry>
-         </row>
+        </row>
+   
       <row>
             <entry>
                <para>script</para>
@@ -9392,49 +9459,47 @@
    
    <section>
    <title>Usage</title>
-   <para>Use the run command to run a Pig script that can interact with the Grunt shell (interactive mode). The script has access to aliases defined externally via the Grunt shell. The Grunt shell has access to aliases defined within the script. All commands from the script are visible in the command history. </para>   
-	<para>With the run command, every store triggers execution. The statements from the script are put into the command history and all the aliases defined in the script can be referenced in subsequent statements after the run command has completed. Issuing a run command on the grunt command line has basically the same effect as typing the statements manually. </para>   
-   <para>For comparison, see the exec command. Both the run and exec commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as they allow you to reuse existing components.</para>
-  </section>
+   <para>Use the exec command to run a Pig script with no interaction between the script and the Grunt shell (batch mode). Aliases defined in the script are not available to the shell; however, the files produced as the output of the script and stored on the system are visible after the script is run. Aliases defined via the shell are not available to the script. </para>
+   <para>With the exec command, store statements will not trigger execution; rather, the entire script is parsed before execution starts. Unlike the run command, exec does not change the command history or remembers the handles used inside the script. Exec without any parameters can be used in scripts to force execution up to the point in the script where the exec occurs. </para>
+   <para>For comparison, see the run command. Both the exec and run commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as they allow you to reuse existing components.</para>
+   </section>
    
    <section>
-   <title>Example</title>
-   <para>In this example the script interacts with the results of commands issued via the Grunt shell.</para>
+   <title>Examples</title>
+   <para>In this example the script is displayed and run.</para>
+
 <programlisting>
 grunt&gt; cat myscript.pig
-b = ORDER a BY name;
-c = LIMIT b 10;
-
-grunt&gt; a = LOAD 'student' AS (name, age, gpa);
-
-grunt&gt; run myscript.pig
-
-grunt&gt; d = LIMIT c 3;
+a = LOAD 'student' AS (name, age, gpa);
+b = LIMIT a 3;
+DUMP b;
 
-grunt&gt; DUMP d;
+grunt&gt; exec myscript.pig
 (alice,20,2.47)
-(alice,27,1.95)
-(alice,36,2.27)
+(luke,18,4.00)
+(holly,24,3.27)
 </programlisting>
-   
-   
-   <para>In this example parameter substitution is used with the run command.</para>
-<programlisting>
-grunt&gt; a = LOAD 'student' AS (name, age, gpa);
 
+   <para>In this example parameter substitution is used with the exec command.</para>
+<programlisting>
 grunt&gt; cat myscript.pig
+a = LOAD 'student' AS (name, age, gpa);
 b = ORDER a BY name;
+
 STORE b into '$out';
 
-grunt&gt; run âparam out=myoutput myscript.pig
+grunt&gt; exec âparam out=myoutput myscript.pig
 </programlisting>
-   
-   </section></section>
+
+      <para>In this example multiple parameters are specified.</para>
+<programlisting>
+grunt&gt; exec âparam p1=myparam1 âparam p2=myparam2 myscript.pig
+</programlisting>
+
    </section>
    
+   </section>   
    
-   <section>
-   <title>Utility Commands</title>
    
    <section>
    <title>help</title>
@@ -9557,6 +9622,97 @@
 </programlisting>
    </section></section>
    
+   
+   <section>
+   <title>run</title>
+   <para>Run a Pig script.</para>
+   
+   <section>
+   <title>Syntax</title>
+   <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>run [âparam param_name = param_value] [âparam_file file_name] scriptÂ </para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+   <tgroup cols="2"><tbody>
+         <row>
+            <entry>
+               <para>âparam param_name = param_value</para>
+            </entry>
+            <entry>
+               <para>See Parameter Substitution.</para>
+            </entry>
+         </row>
+
+         <row>
+            <entry>
+               <para>âparam_file file_name</para>
+            </entry>
+            <entry>
+               <para>See Parameter Substitution. </para>
+            </entry>
+         </row>
+      <row>
+            <entry>
+               <para>script</para>
+            </entry>
+            <entry>
+               <para>The name of a Pig script.</para>
+            </entry>
+         </row>
+         
+   </tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Usage</title>
+   <para>Use the run command to run a Pig script that can interact with the Grunt shell (interactive mode). The script has access to aliases defined externally via the Grunt shell. The Grunt shell has access to aliases defined within the script. All commands from the script are visible in the command history. </para>   
+	<para>With the run command, every store triggers execution. The statements from the script are put into the command history and all the aliases defined in the script can be referenced in subsequent statements after the run command has completed. Issuing a run command on the grunt command line has basically the same effect as typing the statements manually. </para>   
+   <para>For comparison, see the exec command. Both the run and exec commands are useful for debugging because you can modify a Pig script in an editor and then rerun the script in the Grunt shell without leaving the shell. Also, both commands promote Pig script modularity as they allow you to reuse existing components.</para>
+  </section>
+   
+   <section>
+   <title>Example</title>
+   <para>In this example the script interacts with the results of commands issued via the Grunt shell.</para>
+<programlisting>
+grunt&gt; cat myscript.pig
+b = ORDER a BY name;
+c = LIMIT b 10;
+
+grunt&gt; a = LOAD 'student' AS (name, age, gpa);
+
+grunt&gt; run myscript.pig
+
+grunt&gt; d = LIMIT c 3;
+
+grunt&gt; DUMP d;
+(alice,20,2.47)
+(alice,27,1.95)
+(alice,36,2.27)
+</programlisting>
+   
+   
+   <para>In this example parameter substitution is used with the run command.</para>
+<programlisting>
+grunt&gt; a = LOAD 'student' AS (name, age, gpa);
+
+grunt&gt; cat myscript.pig
+b = ORDER a BY name;
+STORE b into '$out';
+
+grunt&gt; run âparam out=myoutput myscript.pig
+</programlisting>
+   
+   </section></section>   
+   
+   
+   
    <section>
    <title>set</title>
    <para>Assigns values to keys used in Pig.</para>

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_users.xml Thu Nov 12 18:43:45 2009
@@ -158,12 +158,12 @@
    <title>Increasing Parallelism</title>
    <p>To increase the parallelism of a job, include the PARALLEL clause with the COGROUP, CROSS, DISTINCT, GROUP, JOIN and ORDER operators. 
    PARALLEL controls the number of reducers only; the number of maps is determined by the input data 
-   (see the <a href="http://wiki.apache.org/pig/PigUserCookbook">Pig User Cookbook</a>).</p>
+   (see the <a href="cookbook.html">Pig Cookbook</a>).</p>
    </section>
    
    <section><title>Increasing Performance</title>
    <p>You can increase or optimize the performance of your Pig Latin scripts by following a few simple rules 
-   (see the <a href="http://wiki.apache.org/pig/PigUserCookbook">Pig User Cookbook</a>).</p>
+   (see the <a href="cookbook.html">Pig Cookbook</a>).</p>
    </section>
    
    <section>
@@ -420,8 +420,8 @@
 <title>Specialized Joins</title>
 <p>
 Pig Latin includes three "specialized" joins: fragement replicate joins, skewed joins, and merge joins. 
-These joins are performed using the <a href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
-Currently, these joins <strong>cannot</strong> be performed using outer joins.
+Replicate, skewed, and merge joins can be performed using the <a href="piglatin_reference.html#JOIN">JOIN</a> operator (inner, equijoins).
+Replicate and skewed joins can be performed using the the the <a href="piglatin_reference.html#JOIN%2C+OUTER">Outer Join</a> syntax.
 </p>
 
 <!-- FRAGMENT REPLICATE JOINS-->
@@ -434,7 +434,7 @@
  
 <section>
 <title>Usage</title>
-<p>Perform a fragment replicate join with the USING clause (see the <a href="piglatin_reference.html#JOIN">JOIN</a> operator).
+<p>Perform a fragment replicate join with the USING clause (see <a href="piglatin_reference.html#JOIN">JOIN</a> and <a href="piglatin_reference.html#JOIN%2C+OUTER">JOIN, OUTER</a>).
 In this example, a large relation is joined with two smaller relations. Note that the large relation comes first followed by the smaller relations; 
 and, all small relations together must fit into main memory, otherwise an error is generated. </p>
 <source>
@@ -478,11 +478,11 @@
 
 <section>
 <title>Usage</title>
-<p>Perform a skewed join with the USING clause (see the <a href="piglatin_reference.html#JOIN">JOIN</a> operator). </p>
+<p>Perform a skewed join with the USING clause (see <a href="piglatin_reference.html#JOIN">JOIN</a> and <a href="piglatin_reference.html#JOIN%2C+OUTER">JOIN, OUTER</a>). </p>
 <source>
 big = LOAD 'big_data' AS (b1,b2,b3);
 massive = LOAD 'massive_data' AS (m1,m2,m3);
-c = JOIN big BY b1, massive BY m1 USING "skewed";
+C = JOIN big BY b1, massive BY m1 USING "skewed";
 </source>
 </section>
 
@@ -683,9 +683,92 @@
 D = JOIN C BY $1, B BY $1;
 </source>
 </section>
+</section> <!-- END OPTIMIZATION RULES -->
+
+ <!-- MEMORY MANAGEMENT -->
+<section>
+<title>Memory Management</title>
+<p>For Pig 0.6.0 we changed how Pig decides when to spill bags to disk. In the past, Pig tried to figure out when an application was getting close to memory limit and then spill at that time. However, because Java does not include an accurate way to determine when to spill, Pig often ran out of memory. </p>
 
+<p>In the current version, we allocate a fix amount of memory to store bags and spill to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
 
-</section> <!-- END OPTIMIZATION RULES -->
+<p>The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 10% of available memory. Note that this memory is shared across all large bags used by the application.</p>
+
+</section> <!-- END MEMORY MANAGEMENT  -->
+
+ <!-- ZEBRA INTEGRATION -->
+<section>
+<title>Integration with Zebra</title>
+ <p>This version of Pig is integrated with Zebra storage format. Zebra is a recent contrib project of Pig and the details can be found at http://wiki.apache.org/pig/zebra. Pig can now: </p>
+ <ul>
+ <li>Load data in Zebra format</li>
+  <li>Take advantage of sorted Zebra tables in case of map-side group and merge join.</li>
+  <li>Store data in Zebra format</li>
+ </ul>
+ <p></p>
+ <p>To load data in Zebra format using TableLoader, do the following:</p>
+ <source>
+register /zebra.jar;
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
+B = FOREACH A GENERATE name, age, gpa;
+</source>
+  
+ <p>There are a couple of things to note:</p>
+ <ol>
+ <li>You need to register a Aebra jar file the same way you would do it for any other UDF.</li>
+ <li>You need to place the jar on your classpath.</li>
+ <li>Zebra data is self-described and always contains schema. This means that the AS clause is unnecessary as long as 
+  you know what the column names and types are. To determine the column names and types, you can run the DESCRIBE statement right after the load:
+ <source>
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader();
+DESCRIBE A;
+a: {name: chararray,age: int,gpa: float}
+</source>
+ </li>
+ </ol>
+   
+<p>You can provide alternative names to the columns with AS clause. You can also provide types as long as the 
+ original type can be converted to the new type.</p>
+ 
+<p>You can provide multiple, comma-separated files to the loader:</p>
+<source>
+A = LOAD 'studenttab, votertab' USING org.apache.hadoop.zebra.pig.TableLoader();
+</source>
+
+<p>TableLoader supports efficient column selection. The current version of Pig does not support automatically pushing 
+ projections down to the loader. (The work is in progress and will be done after beta.) 
+ Meanwhile, the loader allows passing columns down via a list of arguments. This example tells the loader to only return two columns, name and age.</p>
+<source>
+A = LOAD 'studenttab' USING org.apache.hadoop.zebra.pig.TableLoader('name, age');
+</source>
+
+<p>If the input data is globally sorted, map-side group or merge join can be used. Please, notice the âsortedâ argument passed to the loader. This lets the loader know that the data is expected to be globally sorted and that a single key must be given to the same map.</p>
+
+<p>Here is an example of the merge join. Note that the first argument to the loader is left empty to indicate that all columns are requested.</p>
+<source>
+A = LOAD'studentsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
+B = LOAD 'votersortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');
+G = JOIN A BY $0, B By $0 USING "merge";
+</source>
+
+<p>Here is an example of a map-side group. Note that multiple sorted files are passed to the loader and that the loader will perform sort preserving merge to make sure that the data is globally sorted.</p>
+<source>
+A = LOAD 'studentsortedtab, studentnullsortedtab' using org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 'sorted');
+B = GROUP A BY $0 USING "collected";
+C = FOREACH B GENERATE group, MAX(a.$1);
+</source>
+
+<p>You can also write data in Zebra format. Note that, since Zebra requires a schema to be stored with the data, the relation that is stored must have a name assigned (via alias) to every column in the relation.</p>
+<source>
+A = LOAD 'studentsortedtab, studentnullsortedtab' USING org.apache.hadoop.zebra.pig.TableLoader('name, age, gpa, source_table', 'sorted');
+B = GROUP A BY $0 USING "collected";
+C = FOREACH B GENERATE group, MAX(a.$1) AS max_val;
+STORE C INTO 'output' USING org.apache.hadoop.zebra.pig.TableStorer('');
+</source>
+
+ </section> <!-- END ZEBRA INTEGRATION  -->
+ 
+ 
  
  </body>
  </document>

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/tabs.xml Thu Nov 12 18:43:45 2009
@@ -32,6 +32,6 @@
   -->
   <tab label="Project" href="http://hadoop.apache.org/pig/" type="visible" /> 
   <tab label="Wiki" href="http://wiki.apache.org/pig/" type="visible" /> 
-  <tab label="Pig 0.5.0 Documentation" dir="" type="visible" /> 
+  <tab label="Pig 0.6.0 Documentation" dir="" type="visible" /> 
 
 </tabs>

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml?rev=835496&r1=835495&r2=835496&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml Thu Nov 12 18:43:45 2009
@@ -866,8 +866,89 @@
 </section>
 
 <section>
-<title>Advanced Topics</title>
+<title>Accumulate Interface</title>
+
+<p>In Pig, problems with memory usage can occur when data, which results from a group or cogroup operation, needs to be placed in a bag  and passed in its entirety to a UDF.</p>
+
+<p>This problem is partially addressed by Algebraic UDFs that use the combiner and can deal with data being passed to them incrementally during different processing phases (map, combiner, and reduce.) However, there are a number of UDFs that are not Algebraic, don't use the combiner, but still donât need to be given all data at once. </p>
+
+<p>The new Accumulator interface is designed to decrease memory usage by targeting such UDFs. For the functions that implement this interface, Pig guarantees that the data for the same key is passed continuously but in small increments. To work with incremental data, here is the interface a UDF needs to implement:</p>
+<source>
+public interface Accumulator &lt;T&gt; {
+   /**
+    * Process tuples. Each DataBag may contain 0 to many tuples for current key
+    */
+    public void accumulate(Tuple b) throws IOException;
+    /**
+     * Called when all tuples from current key have been passed to accumulate.
+     * @return the value for the UDF for this key.
+     */
+    public T getValue();
+    /**
+     * Called after getValue() to prepare processing for next key. 
+     */
+    public void cleanup();
+}
+</source>
+
+<p>There are several things to note here:</p>
+
+<ol>
+	<li>Each UDF must extend the EvalFunc class and implement all necessary functions there.</li>
+	<li>If a function is algebraic but can be used in a FOREACH statement with accumulator functions, it needs to implement the Accumulator interface in addition to the Algebraic interface.</li>
+	<li>The interface is parameterized with the return type of the function.</li>
+	<li>The accumulate function is guaranteed to be called one or more times, passing one or more tuples in a bag, to the UDF. (Note that the tuple that is passed to the accumulator has the same content as the one passed to exec â all the parameters passed to the UDF â one of which should be a bag).</li>
+	<li>The getValue function is called after all the tuples for a particular key have been processed to retrieve the final value.</li>
+	<li>The cleanup function is called after getValue but before the next value is processed.</li>
+</ol>
+
+
+<p>Here us a code snippet of the integer version of the MAX function that implements the interface:</p>
+<source>
+public class IntMax extends EvalFunc&lt;Integer&gt; implements Algebraic, Accumulator&lt;Integer&gt; {
+    â¦â¦.
+    /* Accumulator interface */
+    
+    private Integer intermediateMax = null;
+    
+    @Override
+    public void accumulate(Tuple b) throws IOException {
+        try {
+            Integer curMax = max(b);
+            if (curMax == null) {
+                return;
+            }
+            /* if bag is not null, initialize intermediateMax to negative infinity */
+            if (intermediateMax == null) {
+                intermediateMax = Integer.MIN_VALUE;
+            }
+            intermediateMax = java.lang.Math.max(intermediateMax, curMax);
+        } catch (ExecException ee) {
+            throw ee;
+        } catch (Exception e) {
+            int errCode = 2106;
+            String msg = "Error while computing max in " + this.getClass().getSimpleName();
+            throw new ExecException(msg, errCode, PigException.BUG, e);           
+        }
+    }
+
+    @Override
+    public void cleanup() {
+        intermediateMax = null;
+    }
+
+    @Override
+    public Integer getValue() {
+        return intermediateMax;
+    }
+}
+</source>
+
+</section>
+
 
+<section>
+<title>Advanced Topics</title>
 
 <section>
 <title>Function Instantiation</title>