You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by ol...@apache.org on 2010/01/05 18:25:58 UTC

svn commit: r896137 - in /hadoop/pig/branches/branch-0.6: CHANGES.txt src/docs/src/documentation/content/xdocs/piglatin_reference.xml src/docs/src/documentation/content/xdocs/piglatin_users.xml

Author: olga
Date: Tue Jan  5 17:25:58 2010
New Revision: 896137

URL: http://svn.apache.org/viewvc?rev=896137&view=rev
Log:
PIG-1175: Pig 0.6 Docs - Store v. Dump (chandec via olgan)

Modified:
    hadoop/pig/branches/branch-0.6/CHANGES.txt
    hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
    hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml

Modified: hadoop/pig/branches/branch-0.6/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/CHANGES.txt?rev=896137&r1=896136&r2=896137&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.6/CHANGES.txt (original)
+++ hadoop/pig/branches/branch-0.6/CHANGES.txt Tue Jan  5 17:25:58 2010
@@ -26,6 +26,8 @@
 
 IMPROVEMENTS
 
+PIG-1175: Pig 0.6 Docs - Store v. Dump (chandec via olgan)
+
 PIG-1162: Pig 0.6.0 - UDF doc (chandec via olgan)
 
 PIG-1163: Pig/Zebra 0.6.0 release (chandec via olgan)

Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml?rev=896137&r1=896136&r2=896137&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml (original)
+++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_reference.xml Tue Jan  5 17:25:58 2010
@@ -4919,58 +4919,7 @@
 
    </section></section>
    
-   <section>
-   <title>DUMP</title>
-   <para>Displays the contents of a relation.</para>
-   
-   <section>
-   <title>Syntax</title>
-   <informaltable frame="all">
-      <tgroup cols="1"><tbody><row>
-            <entry>
-               <para>DUMP alias;        </para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Terms</title>
-   <informaltable frame="all">
-      <tgroup cols="2"><tbody><row>
-            <entry>
-               <para>alias</para>
-            </entry>
-            <entry>
-               <para>The name of a relation.</para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Usage</title>
-   <para>Use the DUMP operator to run (execute) a Pig Latin statement and to display the contents of an alias. You can use DUMP as a debugging device to make sure the results you are expecting are being generated.</para></section>
-   
-   <section>
-   <title>Example</title>
-   <para>In this example a dump is performed after each statement.</para>
-<programlisting>
-A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
-
-DUMP A;
-(John,18,4.0F)
-(Mary,19,3.7F)
-(Bill,20,3.9F)
-(Joe,22,3.8F)
-(Jill,20,4.0F)
-
-B = FILTER A BY name matches 'J.+';
-
-DUMP B;
-(John,18,4.0F)
-(Joe,22,3.8F)
-(Jill,20,4.0F)
-</programlisting>
-</section></section>
+  
    
    <section>
    <title>FILTER </title>
@@ -6521,7 +6470,7 @@
    
    <section>
    <title>STORE </title>
-   <para>Stores data to the file system.</para>
+   <para>Stores or saves results to the file system.</para>
    
    <section>
    <title>Syntax</title>
@@ -6591,7 +6540,10 @@
    
    <section>
    <title>Usage</title>
-   <para>Use the STORE operator to run (execute) Pig Latin statements and to store data on the file system. </para></section>
+   <para>Use the STORE operator to run (execute) Pig Latin statements and save (persist) results to the file system. Use STORE for production scripts and batch mode processing.</para>
+   
+   <para>Note: To debug scripts during development, you can use <ulink url="piglatin_reference.html#DUMP">DUMP</ulink> to check intermediate results.</para>
+</section>
    
    <section>
    <title>Examples</title>
@@ -6962,6 +6914,68 @@
    
    </section></section>
    
+   
+ <section>
+   <title>DUMP</title>
+   <para>Dumps or displays results to screen.</para>
+   
+   <section>
+   <title>Syntax</title>
+   <informaltable frame="all">
+      <tgroup cols="1"><tbody><row>
+            <entry>
+               <para>DUMP alias;        </para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Terms</title>
+   <informaltable frame="all">
+      <tgroup cols="2"><tbody><row>
+            <entry>
+               <para>alias</para>
+            </entry>
+            <entry>
+               <para>The name of a relation.</para>
+            </entry>
+         </row></tbody></tgroup>
+   </informaltable></section>
+   
+   <section>
+   <title>Usage</title>
+   <para>Use the DUMP operator to run (execute) Pig Latin statements and display the results to your screen. DUMP is meant for interactive mode; statements are executed immediately and the results are not saved (persisted). You can use DUMP as a debugging device to make sure that the results you are expecting are actually generated. </para>
+   
+   <para>
+   Note that production scripts <emphasis>should not</emphasis> use DUMP as it will disable multi-query optimizations and is likely to slow down execution 
+   (see <ulink url="piglatin_users.html#Store+vs.+Dump">Store vs. Dump</ulink>).
+   </para>
+   </section>
+   
+   <section>
+   <title>Example</title>
+   <para>In this example a dump is performed after each statement.</para>
+<programlisting>
+A = LOAD 'student' AS (name:chararray, age:int, gpa:float);
+
+DUMP A;
+(John,18,4.0F)
+(Mary,19,3.7F)
+(Bill,20,3.9F)
+(Joe,22,3.8F)
+(Jill,20,4.0F)
+
+B = FILTER A BY name matches 'J.+';
+
+DUMP B;
+(John,18,4.0F)
+(Joe,22,3.8F)
+(Jill,20,4.0F)
+</programlisting>
+</section></section>   
+   
+   
+   
    <section>
    <title>EXPLAIN</title>
    <para>Displays execution plans.</para>

Modified: hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml?rev=896137&r1=896136&r2=896137&view=diff
==============================================================================
--- hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml (original)
+++ hadoop/pig/branches/branch-0.6/src/docs/src/documentation/content/xdocs/piglatin_users.xml Tue Jan  5 17:25:58 2010
@@ -54,7 +54,7 @@
   
    <section>
    <title>Running Pig Latin </title>
-   <p>You can execute Pig Latin statements interactively or in batch mode using Pig scripts (see the EXEC and RUN operators).</p>
+   <p>You can execute Pig Latin statements interactively or in batch mode using Pig scripts (see the <a href="piglatin_reference.html#exec">exec</a> and <a href="piglatin_reference.html#run">run</a> commands).</p>
    
    <p>Grunt Shell, Interactive or Batch Mode</p>
    <source>
@@ -228,15 +228,12 @@
 <!-- MULTI-QUERY EXECUTION-->
 <section>
 <title>Multi-Query Execution</title>
-<p>With multi-query execution Pig processes an entire script or a batch of statements at once 
-(as opposed to processing statements when a DUMP or STORE is encountered). </p>
-
-
+<p>With multi-query execution Pig processes an entire script or a batch of statements at once.</p>
 
 <section>
 	<title>Turning Multi-Query Execution On or Off</title>	
 	<p>Multi-query execution is turned on by default. 
-	To turn it off and revert to Pi'gs "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p>
+	To turn it off and revert to Pig's "execute-on-dump/store" behavior, use the "-M" or "-no_multiquery" options. </p>
 	<p>To run script "myscript.pig" without the optimization, execute Pig as follows: </p>
 <source>
 $ pig -M myscript.pig
@@ -253,7 +250,8 @@
 <li>
 <p>For batch mode execution, the entire script is first parsed to determine if intermediate tasks 
 can be combined to reduce the overall amount of work that needs to be done; execution starts only after the parsing is completed 
-(see the EXPLAIN operator and the EXEC and RUN commands). </p>
+(see the <a href="piglatin_reference.html#EXPLAIN">EXPLAIN</a> operator and the <a href="piglatin_reference.html#exec">exec</a> and <a href="piglatin_reference.html#run">run</a> commands). </p>
+
 </li>
 <li>
 <p>Two run scenarios are optimized, as explained below: explicit and implicit splits, and storing intermediate results.</p>
@@ -316,7 +314,32 @@
 </section>
 </section>
 
+<section>
+	<title>Store vs. Dump</title>
+	<p>With multi-query exection, you want to use <a href="piglatin_reference.html#STORE">STORE</a> to save (persist) your results. 
+	You do not want to use <a href="piglatin_reference.html#DUMP">DUMP</a> as it will disable multi-query execution and is likely to slow down execution. (If you have included DUMP statements in your scripts for debugging purposes, you should remove them.) </p>
+	
+	<p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A > B > DUMP while the second job will execute A > B > C > STORE.</p>
+	
+<source>
+A = LOAD ‘input’ AS (x, y, z);
+B = FILTER A BY x > 5;
+DUMP B;
+C = FOREACH B GENERATE y, z;
+STORE C INTO ‘output’;
+</source>
+	
+	<p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p>
+	
+<source>
+A = LOAD ‘input’ AS (x, y, z);
+B = FILTER A BY x > 5;
+STORE B INTO ‘output1’;
+C = FOREACH B GENERATE y, z;
+STORE C INTO ‘output2’;	
+</source>
 
+</section>
 <section>
 	<title>Error Handling</title>
 	<p>With multi-query execution Pig processes an entire script or a batch of statements at once. 
@@ -352,10 +375,10 @@
 	<title>Backward Compatibility</title>
 	
 	<p>Most existing Pig scripts will produce the same result with or without the multi-query execution. 
-	There are cases though were this is not true. Path names and schemes are discussed here.</p>
+	There are cases though where this is not true. Path names and schemes are discussed here.</p>
 	
 	<p>Any script is parsed in it's entirety before it is sent to execution. Since the current directory can change 
-	throughout the script any path used in load or store is translated to a fully qualified and absolute path.</p>
+	throughout the script any path used in LOAD or STORE statement is translated to a fully qualified and absolute path.</p>
 		
 	<p>In map-reduce mode, the following script will load from "hdfs://&lt;host&gt;:&lt;port&gt;/data1" and store into "hdfs://&lt;host&gt;:&lt;port&gt;/tmp/out1". </p>
 <source>
@@ -375,7 +398,7 @@
 		<li><p>Specify a custom scheme for the LoadFunc/Slicer </p></li>
 	</ol>	
 	
-	<p>Arguments used in a load statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
+	<p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
 	<p>In the SQL case, the SQLLoader function is invoked with "sql://mytable". </p>
 
 <source>
@@ -416,7 +439,7 @@
 
 <section>
 	<title>Example</title>
-<p>In this script, the store/load operators have different file paths; however, the load operator depends on the store operator.</p>
+<p>In this script, the STORE/LOAD operators have different file paths; however, the LOAD operator depends on the STORE operator.</p>
 <source>
 A = LOAD '/user/xxx/firstinput' USING PigStorage();
 B = group ....