You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by da...@apache.org on 2011/04/23 02:45:00 UTC

svn commit: r1096096 - in /pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs: basic.xml cont.xml func.xml perf.xml start.xml udf.xml

Author: daijy
Date: Sat Apr 23 00:45:00 2011
New Revision: 1096096

URL: http://svn.apache.org/viewvc?rev=1096096&view=rev
Log:
PIG-1772: Pig 090 Documentation (pig-1772-beta2-1.patch)

Modified:
    pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/basic.xml
    pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/cont.xml
    pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/func.xml
    pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/perf.xml
    pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/start.xml
    pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/udf.xml

Modified: pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/basic.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/basic.xml?rev=1096096&r1=1096095&r2=1096096&view=diff
==============================================================================
--- pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/basic.xml (original)
+++ pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/basic.xml Sat Apr 23 00:45:00 2011
@@ -239,11 +239,6 @@
             <td> <p>-- V, W, X, Y, Z </p> </td>
             <td> <p> </p> </td>
          </tr>  
-                    
-        <tr>
-            <td> <p>-- Symbols</p> </td>
-            <td> <p>= =   !=   &lt;  &gt;   &lt;=   &gt;=   +   -   *   /   %   ?   $   .   #   ::   ( )   [ ]   { } </p> </td>
-         </tr> 
             
    </table>
    </section>
@@ -1440,20 +1435,87 @@ X = FILTER A BY (f1==8) OR (NOT (f2+f3 &
 
       <section id="sexp">
           <title>Star expression</title>
-          <p>The star symbol, *, can be used to represent all the fields of a tuple. It is equivalent to writing out the fields explicitly. In the following example the definition of B and C are exactly the same, and MyUDF will be invoked with exactly the same arguments in both cases.</p>
+          <p>Star expressions ( * ) can be used to represent all the fields of a tuple. It is equivalent to writing out the fields explicitly. In the following example the definition of B and C are exactly the same, and MyUDF will be invoked with exactly the same arguments in both cases.</p>
           <source>
 A = LOAD 'data' USING MyStorage() AS (name:chararray, age: int);
 B = FOREACH A GENERATE *, MyUDF(name, age);
 C = FOREACH A GENERATE name, age, MyUDF(*);
           </source>
-          <p>A common error when using the star expression is the following:</p>
+          <p>A common error when using the star expression is shown below. In this example, the programmer really wants to count the number of elements in the bag in the second field: COUNT($1).</p>
           <source>
 G = GROUP A BY $0;
 C = FOREACH G GENERATE COUNT(*)
           </source>
-          <p>In this example, the programmer really wants to count the number of elements in the bag in the second field: COUNT($1).</p>
+        
+<p>There are some restrictions on use of the star expression when the input schema is unknown (null):</p>
+<ul>
+<li>For GROUP/COGROUP, you can't include a star expression in a GROUP BY column. </li>
+<li>For ORDER BY, if you have project-star as ORDER BY column, you can’t have any other ORDER BY column in that statement. </li>
+</ul>
       </section>
 
+<section id="prexp">
+<title>Project-Range expressions</title>
+<p>Project-range ( .. ) expressions can be used to project a range of columns from input. For example:</p>
+<ul>
+<li>.. $x : projects columns $0 through $x, inclusive </li>
+<li>$x .. : projects columns through end, inclusive </li>
+<li>$x .. $y : projects columns through $y, inclusive </li>
+</ul>
+<p></p>
+
+<p>If the input relation has a schema, you can refer to columns by alias rather than by column position. You can also combine aliases and column positions in an expression; for example, "col1 .. $5" is valid. </p>
+
+<p>Project-range can be used in all cases where the <a href="#sexp">star expression</a> ( * ) is allowed, except as a UDF argument (support for this use case will be added in <a href="https://issues.apache.org/jira/browse/PIG-1938">PIG-1938</a>).</p>
+
+<p>Project-range can be used in the following statements:
+<a href="#FOREACH">FOREACH</a>, 
+<a href="#JOIN+%28inner%29">JOIN</a>,  
+<a href="#GROUP">GROUP</a>, 
+<a href="#COGROUP">COGROUP</a>, and  
+<a href="#ORDER+BY">ORDER BY</a> (also when ORDER BY is used within a nested FOREACH block).</p>
+
+<p>A few examples are shown here:</p>
+<source>
+..... 
+grunt> F = foreach IN generate (int)col0, col1 .. col3; 
+grunt> describe F; 
+F: {col0: int,col1: bytearray,col2: bytearray,col3: bytearray} 
+..... 
+..... 
+grunt> SORT = order IN by col2 .. col3, col0, col4 ..; 
+..... 
+..... 
+J = join IN1 by $0 .. $3, IN2 by $0 .. $3; 
+..... 
+..... 
+g = group l1 by b .. c; 
+..... 
+</source>
+
+<p>There are some restrictions on the use of project-to-end form of project-range (eg "x .. ") when the input schema is unknown (null): </p>
+<ul>
+<li>For GROUP/COGROUP, the project-to-end form of project-range is not allowed.</li>
+<li>For ORDER BY, the project-to-end form of project-range is supported only as the last sort column.
+<source>
+..... 
+grunt> describe IN; 
+Schema for IN unknown. 
+
+/* This statement is supported */
+SORT = order IN by $2 .. $3, $6 ..; 
+
+/* This statement is NOT supported */ 
+SORT = order IN by $2 .. $3, $6 ..; 
+..... 
+</source>
+
+
+</li>
+</ul>
+</section>
+      
+      
       <section id="bexp">
           <title>Boolean expressions</title>
           <p>Boolean expressions can be made up of UDFs that return a boolean value or boolean operators 
@@ -1485,13 +1547,7 @@ C = FOREACH G GENERATE COUNT(*)
    
    <p>Schemas are defined with the <a href="#LOAD">LOAD</a>, <a href="#STREAM">STREAM</a>, and <a href="#FOREACH">FOREACH</a> operators using the AS clause. If you define a schema using the LOAD operator, then it is the load function that enforces the schema 
    (see <a href="#LOAD">LOAD</a> and <a href="udf.html">User Defined Functions</a> for more information).</p>
-   
-   <p></p>
-   <p><strong>Known Schema Handling</strong></p>
-   <p>Schemas enable you to assign names to fields and declare types for fields. Schemas are optional but we encourage you to use them whenever possible; type declarations result in better parse-time error checking and more efficient code execution. </p>
-   <p>Schemas are defined for the <a href="#LOAD">LOAD</a>, <a href="#STREAM">STREAM</a>, and <a href="#FOREACH">FOREACH</a> operators using the AS clause. If you define a schema using the LOAD operator, then it is the load function that enforces the schema 
-   (see <a href="#LOAD">LOAD</a> and <a href="udf.html">User Defined Functions</a> for more information).</p>
-  
+
    <p></p>
    <p><strong>Known Schema Handling</strong></p>
    <p>Note the following:</p>
@@ -1506,11 +1562,40 @@ C = FOREACH G GENERATE COUNT(*)
    <p><strong>Unknown Schema Handling</strong></p>
       <p>Note the following:</p>
    <ul>
-      <li>When you JOIN/COGROUP/CROSS multiple relations, if any relation has a null schema (no defined schema), the schema for the resulting relation is null. </li>
+      <li>When you JOIN/COGROUP/CROSS multiple relations, if any relation has an unknown schema (or no defined schema, also referred to as a null schema), the schema for the resulting relation is null. </li>
       <li>If you FLATTEN a bag with empty inner schema, the schema for the resulting relation is null.</li>
       <li>If you UNION two relations with incompatible schema, the schema for resulting relation is null.</li>
       <li>If the schema is null, Pig treats all fields as bytearray (in the backend, Pig will determine the real type for the fields dynamically) </li>
     </ul>      
+    <p>See the examples below. If a field's data type is not specified, Pig will use bytearray to denote an unknown type. If the number of fields is not known, Pig will derive an unknown schema.</p>
+    
+ <source>
+/* The field data types are not specified ... */
+a = load '1.txt' as (a0, b0);
+a: {a0: bytearray,b0: bytearray}
+
+/* The number of fields is not known ... */
+a = load '1.txt';
+a: Schema for a unknown
+</source>
+
+   <p></p>
+   <p><strong>How Pig Handles Schema</strong></p>
+   
+   <p>As shown above, with a few exceptions Pig can infer the schema of a relationship up front. You can examine the schema of particular relation using <a href="test.html#DESCRIBE">DESCRIBE</a>. Pig enforces this computed schema during the actual execution by casting the input data to the expected data type. If the process is successful the results are returned to the user; otherwise, a warning is generated for each record that failed to convert.  Note that Pig does not know the actual types of the fields in the input data prior to the execution; rather, Pig determines the data types and performs the right conversions on the fly.</p>
+  
+<p>Having a deterministic schema is very powerful; however, sometimes it comes at the cost of performance. Consider the following example:</p>  
+  
+<source>
+A = load ‘input’ as (x, y, z);
+B = foreach A generate x+y;
+</source>
+
+ <p>If you do <a href="test.html#DESCRIBE">DESCRIBE</a> on B, you will see a single column of type double. This is because Pig makes the safest choice and uses the largest numeric type when the schema is not know. In practice, the input data could contain integer values; however, Pig will cast the data to double and make sure that a double result is returned.</p>
+
+ <p>If the schema of a relationship can’t be inferred, Pig will just use the runtime data as is and propagate it through the pipeline.</p>
+
+
     
    <section id="schema-load">
    <title>Schemas with LOAD and STREAM Statements</title>
@@ -4903,7 +4988,7 @@ DUMP X;
    <table>
       <tr> 
             <td>
-               <p>alias  = FOREACH { gen_blk | nested_gen_blk };</p>
+               <p>alias  = FOREACH { block | nested_block };</p>
             </td>
          </tr> 
    </table>
@@ -4922,10 +5007,10 @@ DUMP X;
          </tr>
          <tr>
             <td>
-               <p>gen_blk</p>
+               <p>block</p>
             </td>
             <td>
-               <p>FOREACH…GENERATE used with a relation (outer bag). Use this syntax:</p>
+               <p>FOREACH…GENERATE block used with a relation (outer bag). Use this syntax:</p>
                <p></p>
                <p>alias = FOREACH alias GENERATE expression [AS schema] [expression [AS schema]….];</p>
                <p>See <a href="#schemas">Schemas</a></p>
@@ -4934,13 +5019,13 @@ DUMP X;
          </tr>
          <tr>
             <td>
-               <p>nested_gen_blk</p>
+               <p>nested_block</p>
             </td>
             <td>
-               <p>FOREACH...GENERATE used with a inner bag. Use this syntax:</p>
+               <p>Nested FOREACH...GENERATE block used with a inner bag. Use this syntax:</p>
                <p></p>
                <p>alias = FOREACH nested_alias {</p>
-               <p>   alias = nested_op; [alias = nested_op; …]</p>
+               <p>   alias = {nested_op | nested_exp}; [{alias = {nested_op | nested_exp}; …]</p>
                <p>   GENERATE expression [AS schema] [expression [AS schema]….]</p>
                <p>};</p>
                <p></p>
@@ -4948,6 +5033,7 @@ DUMP X;
                <p>The nested block is enclosed in opening and closing brackets { … }. </p>
                <p>The GENERATE keyword must be the last statement within the nested block.</p>
                <p>See <a href="#schemas">Schemas</a></p>
+               <p>Macros are NOT alllowed inside a nested block.</p>
             </td>
          </tr>
          <tr>
@@ -4978,6 +5064,14 @@ DUMP X;
          </tr>
          <tr>
             <td>
+               <p>nested_exp</p>
+            </td>
+            <td>
+               <p>Any arbitrary, supported expression.</p>
+            </td>
+         </tr>
+         <tr>
+            <td>
                <p>AS</p>
             </td>
             <td>
@@ -5027,48 +5121,6 @@ X = FOREACH B {
    </section>
    
    <section>
-   <title>Examples</title>
-   <p>Suppose we have relations A, B, and C (see the GROUP operator for information about the field names in relation C).</p>
-<source>
-A = LOAD 'data1' AS (a1:int,a2:int,a3:int);
-
-DUMP A;
-(1,2,3)
-(4,2,1)
-(8,3,4)
-(4,3,3)
-(7,2,5)
-(8,4,3)
-
-B = LOAD 'data2' AS (b1:int,b2:int);
-
-DUMP B;
-(2,4)
-(8,9)
-(1,3)
-(2,7)
-(2,9)
-(4,6)
-(4,9)
-
-C = COGROUP A BY a1 inner, B BY b1 inner;
-
-DUMP C;
-(1,{(1,2,3)},{(1,3)})
-(4,{(4,2,1),(4,3,3)},{(4,6),(4,9)})
-(8,{(8,3,4),(8,4,3)},{(8,9)})
-
-ILLUSTRATE C;
-<em>etc ... </em>
---------------------------------------------------------------------------------------
-| c     | group: int | a: bag({a1: int,a2: int,a3: int}) | B: bag({b1: int,b2: int}) |
---------------------------------------------------------------------------------------
-|       | 1          | {(1, 2, 3)}                       | {(1, 3)}                  |
--------------------------------------------------------------------------------------
-</source>
-</section>
-   
-   <section>
    <title>Example: Projection</title>
    <p>In this example the asterisk (*) is used to project all tuples from relation A to relation X. Relation A and X are identical.</p>
 <source>
@@ -5407,7 +5459,7 @@ IMPORT 'my_macro.pig';
                <p>No other operations can be done between the LOAD and COGROUP statements.</p>
                </li>
                <li>
-               <p>Data must be sorted on the cogroup key for all tables in ascending (ASC) order.</p>
+               <p>Data must be sorted on the COGROUP key for all tables in ascending (ASC) order.</p>
                </li> 
                 <li>
                <p>Nulls are considered smaller than evertyhing. If data contains null keys, they should occur before anything else.</p>
@@ -5606,15 +5658,6 @@ X: {group: chararray,A: {owner: chararra
 (Jane,{},{(Paul,Jane)})
 </source>
    
-   <p>In this example tuples are co-grouped and the INNER keyword is used to ensure that only bags with at least one tuple are returned. </p>
-<source>
-X = COGROUP A BY owner INNER, B BY friend2 INNER;
-
-DUMP X;
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-</source>
-   
    <p>In this example tuples are co-grouped and the INNER keyword is used asymmetrically on only one of the relations.</p>
 <source>
 X = COGROUP A BY owner, B BY friend2 INNER;
@@ -5626,21 +5669,6 @@ DUMP X;
 </source>
    </section>
    
-   
- <section>
-   <title>Example</title>
-   <p>This example shows how to compute the number of tuples in an inner join between two relations.</p>   
-<source>
-A = LOAD …
-B = LOAD …
-C = COGROUP A BY f1 INNER, B BY f2 INNER;
-D = FOREACH C GENERATE group, COUNT(A)*COUNT(B) AS count;   -- cross product in each co-group
-E = GROUP D ALL;
-F = FOREACH E GENERATE SUM(D.count) AS sum;  -- sum of cross products
-DUMP F;
-</source>
-</section>
-   
    <section>
    <title>Example</title>
 <p>This example shows how to group using multiple keys.</p>   
@@ -5663,7 +5691,7 @@ DUMP F;
     
        <section>
    <title>Example</title>
-<p>This example shows how to cogroup using the merge keyword.</p>   
+<p>This example shows how to use COGROUP with the merge keyword.</p>   
 <source>
  register zebra.jar;
  A = LOAD 'data1' USING org.apahce.hadoop.zebra.pig.TableLoader('id:int', 'sorted');
@@ -7211,6 +7239,8 @@ DUMP U;
 <!-- +++++++++++++++++++++++++++++++++++++++++++++++ --> 
    <p><strong>Macro Definition</strong></p>
    <p>A macro definition can appear anywhere in a script as long as it appears prior to the first use. A macro definition can include references to other macros as long as the referenced macros are defined prior to the macro definition. Recursive references are not allowed. </p>
+   
+   <p>Note that Macros are NOT allowed inside a <a href="#FOREACH">FOREACH</a> nested block.</p>
 
 <p>In this example the macro is named my_macro. Note that only aliases A and C are visible from the outside; alias B is not visible from the outside.</p>
 <source>

Modified: pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/cont.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/cont.xml?rev=1096096&r1=1096095&r2=1096096&view=diff
==============================================================================
--- pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/cont.xml (original)
+++ pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/cont.xml Sat Apr 23 00:45:00 2011
@@ -21,18 +21,18 @@
   </header>
   <body>
   
- <!-- ++++++++++++++++++++++++++++++++++ -->    
+ <!-- ============================================ -->    
    <section>
    <title>Pig Macros</title> 
    <p>Pig Latin supports the definition, expansion, and import of macros.</p>
    <p>See <a href="basic.html#define-macros">DEFINE (macros)</a> and <a href="/basic.html#IMPORT">IMPORT</a>.</p>
    </section> 
   
- <!-- ++++++++++++++++++++++++++++++++++ -->    
-   <section>
-   <title>Embedded Pig and Control Flow </title>
+  <!-- ============================================ -->       
+   <section id="embed-python">
+   <title>Embedded Pig - Python and JavaScript </title>
    
-<p>To enable control flow, you can embed Pig Latin scripts in a host scripting language via a JDBC-like compile, bind, run model.  Supported host languages include Python (via Jython) and JavaScript (via Rhino). You must make sure that the Jython jar and/or the Rhino jar are included in your class path if you want to used embedded Pig. At runtime Pig will automatically detect the usage of a scripting UDF in the Pig script and will ship the corresponding scripting jar to the backend.</p>  
+<p>To enable control flow, you can embed Pig Latin scripts in a host scripting language via a JDBC-like compile, bind, run model. This section discusses Python (via Jython) and JavaScript (via Rhino). You must make sure that the Jython jar and/or the Rhino jar are included in your class path if you want to used embedded Pig. At runtime Pig will automatically detect the usage of a scripting UDF in the Pig script and will ship the corresponding scripting jar to the backend.</p>  
 
 <p>Note: Currently, more Python than JavaScript examples are shown below.</p>
    
@@ -271,7 +271,6 @@ Pig.compile(...).bind(...).runSingle(pro
 
 </section> 
 
-
 <section>
 <title>Embedded Pig and Pig Runner API</title>
 
@@ -741,8 +740,95 @@ public abstract class PigStats {
 </section>  
 </section>    
 </section> 
+
+
+ <!-- ============================================ -->    
+<section id="embed-java">
+<title>Embedded Pig - Java </title>
+<p>Currently, <a href="http://pig.apache.org/docs/r0.9.0/api/org/apache/pig/PigServer.html">PigServer</a> is the main interface point for embedding Pig in Java. PigServer can now be instantiated from multiple threads. (In the past, PigServer contained references to static data that prevented multiple instances of the object to be created from different threads within your application.) Please note that PigServer is not thread safe; the same object can't be shared across multiple threads. </p>
+
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<p><strong>Local Mode</strong></p>
+<p>From your current working directory, compile the program. (Note that idlocal.class is written to your current working directory. Include “.” in the class path when you run the program.) </p>
+<source>
+$ javac -cp pig.jar idlocal.java
+</source>
+<p> </p>
+<p>From your current working directory, run the program. To view the results, check the output file, id.out.</p>
+<source>
+Unix:   $ java -cp pig.jar:. idlocal
+Cygwin: $ java –cp ‘.;pig.jar’ idlocal
+</source>
+
+<p>idlocal.java - The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file. 
+Copy the /etc/passwd file to your local working directory.</p>
+<source>
+import java.io.IOException;
+import org.apache.pig.PigServer;
+public class idlocal{ 
+public static void main(String[] args) {
+try {
+    PigServer pigServer = new PigServer("local");
+    runIdQuery(pigServer, "passwd");
+    }
+    catch(Exception e) {
+    }
+ }
+public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
+    pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');");
+    pigServer.registerQuery("B = foreach A generate $0 as id;");
+    pigServer.store("B", "id.out");
+ }
+}
+</source>
+<p> </p>
+
+<!-- ++++++++++++++++++++++++++++++++++ -->
+<p><strong>Mapreduce Mode</strong></p>
+<p>Point $HADOOPDIR to the directory that contains the hadoop-site.xml file. Example: 
+</p>
+<source>
+$ export HADOOPDIR=/yourHADOOPsite/conf 
+</source>
+<p>From your current working directory, compile the program. (Note that idmapreduce.class is written to your current working directory. Include “.” in the class path when you run the program.)
+</p>
+<source>
+$ javac -cp pig.jar idmapreduce.java
+</source>
+<p></p>
+<p>From your current working directory, run the program. To view the results, check the idout directory on your Hadoop system. </p>
+<source>
+Unix:   $ java -cp pig.jar:.:$HADOOPDIR idmapreduce
+Cygwin: $ java –cp ‘.;pig.jar;$HADOOPDIR’ idmapreduce
+</source>
+
+<p>idmapreduce.java - The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file. 
+Copy the /etc/passwd file to your local working directory.</p>
+<source>
+import java.io.IOException;
+import org.apache.pig.PigServer;
+public class idmapreduce{
+   public static void main(String[] args) {
+   try {
+     PigServer pigServer = new PigServer("mapreduce");
+     runIdQuery(pigServer, "passwd");
+   }
+   catch(Exception e) {
+   }
+}
+public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
+   pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');")
+   pigServer.registerQuery("B = foreach A generate $0 as id;");
+   pigServer.store("B", "idout");
+   }
+}
+</source>
+</section>
+
+
   
- <!-- ++++++++++++++++++++++++++++++++++ -->    
+ <!-- =========================================== -->    
    <section>
    <title>Parameter Substitution</title>
    <section>

Modified: pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/func.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/func.xml?rev=1096096&r1=1096095&r2=1096096&view=diff
==============================================================================
--- pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/func.xml (original)
+++ pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/func.xml Sat Apr 23 00:45:00 2011
@@ -511,10 +511,10 @@ SSN = load 'ssn.txt' using PigStorage() 
 
 SSN_NAME = load 'students.txt' using PigStorage() as (ssn:long, name:chararray);
 
--- do a left out join of SSN with SSN_Name
-X = cogroup SSN by ssn inner, SSN_NAME by ssn;
+/* do a left outer join of SSN with SSN_Name */
+X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn;
 
--- only keep those ssn's for which there is no name
+/* only keep those ssn's for which there is no name */
 Y = filter X by IsEmpty(SSN_NAME);
 </source>
    </section></section>    

Modified: pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/perf.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/perf.xml?rev=1096096&r1=1096095&r2=1096096&view=diff
==============================================================================
--- pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/perf.xml (original)
+++ pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/perf.xml Sat Apr 23 00:45:00 2011
@@ -700,7 +700,7 @@ F = filter E by C.t == 1;
 <source>
 A = load 'data' as (in: map[]);
 -- get key out of the map
-B = foreach A generate in#k1 as k1, in#k2 as k2;
+B = foreach A generate in#'k1' as k1, in#'k2' as k2;
 -- concatenate the keys
 C = foreach B generate CONCAT(k1, k2);
 .......
@@ -710,7 +710,7 @@ C = foreach B generate CONCAT(k1, k2);
 <source>
 A = load 'data' as (in: map[]);
 -- concatenate the keys from the map
-B = foreach A generate CONCAT(in#k1, in#k2);
+B = foreach A generate CONCAT(in#'k1', in#'k2');
 ....
 </source>
 
@@ -1106,20 +1106,17 @@ A = load 'data1' using org.apache.hadoop
 B = load 'data2' using org.apache.hadoop.zebra.pig.TableLoader('id:int', 'sorted'); 
 C = join A by id left, B by id using 'merge'; 
 </source>
-
-<p></p>
-<p><strong>Both Conditions</strong></p>
-<p>
-For optimal performance, each part file of the left (sorted) input of the join should have a size of at least 
-1 hdfs block size (for example if the hdfs block size is 128 MB, each part file should be less than 128 MB). 
-If the total input size (including all part files) is greater than blocksize, then the part files should be uniform in size 
-(without large skews in sizes). The main idea is to eliminate skew in the amount of input the final map 
-job performing the merge-join will process. 
-</p>
-
 </section>
-</section><!-- END MERGE JOIN -->
-
+</section>
+<!-- END MERGE JOIN -->
+<section>
+<title>Performance Considerations</title>
+<p>Note the following:</p>
+<ul>
+<li>If one of the data sets is small enough to fit into memory, a Replicated Join is very likely to provide better performance.</li>
+<li>You will also see better performance if the data in the left table is partitioned evenly across part files (no significant skew and each part file contains at least one full block of data).</li>
+</ul>
+</section>
 <!-- END SPECIALIZED JOINS--> 
    
 	</section>

Modified: pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/start.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/start.xml?rev=1096096&r1=1096095&r2=1096096&view=diff
==============================================================================
--- pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/start.xml (original)
+++ pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/start.xml Sat Apr 23 00:45:00 2011
@@ -109,7 +109,7 @@ Test the Pig installation with this simp
     <td>yes</td>
 	</tr>
 	<tr>
-	<td>Embedded Programs (embed statements in a host language)</td>
+	<td>Embedded Pig (embed statements in a host language)</td>
     <td>yes</td>
     <td>yes</td>
 	</tr>
@@ -255,89 +255,8 @@ $ pig -x mapreduce id.pig
 
 <!-- ++++++++++++++++++++++++++++++++++ -->
 <section>
-<title>Embedded Programs</title>
-<p>Use the embedded option to embed Pig statements in a host language. Currently Java and Python are supported.</p>
-
-<section>
-<title>Java Example</title>
-
-<!-- ++++++++++++++++++++++++++++++++++ -->
-<p><strong>Local Mode</strong></p>
-<p>From your current working directory, compile the program. (Note that idlocal.class is written to your current working directory. Include “.” in the class path when you run the program.) </p>
-<source>
-$ javac -cp pig.jar idlocal.java
-</source>
-<p> </p>
-<p>From your current working directory, run the program. To view the results, check the output file, id.out.</p>
-<source>
-Unix:   $ java -cp pig.jar:. idlocal
-Cygwin: $ java –cp ‘.;pig.jar’ idlocal
-</source>
-
-<p>idlocal.java - The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file. 
-Copy the /etc/passwd file to your local working directory.</p>
-<source>
-import java.io.IOException;
-import org.apache.pig.PigServer;
-public class idlocal{ 
-public static void main(String[] args) {
-try {
-    PigServer pigServer = new PigServer("local");
-    runIdQuery(pigServer, "passwd");
-    }
-    catch(Exception e) {
-    }
- }
-public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
-    pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');");
-    pigServer.registerQuery("B = foreach A generate $0 as id;");
-    pigServer.store("B", "id.out");
- }
-}
-</source>
-<p> </p>
-
-<!-- ++++++++++++++++++++++++++++++++++ -->
-<p><strong>Mapreduce Mode</strong></p>
-<p>Point $HADOOPDIR to the directory that contains the hadoop-site.xml file. Example: 
-</p>
-<source>
-$ export HADOOPDIR=/yourHADOOPsite/conf 
-</source>
-<p>From your current working directory, compile the program. (Note that idmapreduce.class is written to your current working directory. Include “.” in the class path when you run the program.)
-</p>
-<source>
-$ javac -cp pig.jar idmapreduce.java
-</source>
-<p></p>
-<p>From your current working directory, run the program. To view the results, check the idout directory on your Hadoop system. </p>
-<source>
-Unix:   $ java -cp pig.jar:.:$HADOOPDIR idmapreduce
-Cygwin: $ java –cp ‘.;pig.jar;$HADOOPDIR’ idmapreduce
-</source>
-
-<p>idmapreduce.java - The sample code is based on Pig Latin statements that extract all user IDs from the /etc/passwd file. 
-Copy the /etc/passwd file to your local working directory.</p>
-<source>
-import java.io.IOException;
-import org.apache.pig.PigServer;
-public class idmapreduce{
-   public static void main(String[] args) {
-   try {
-     PigServer pigServer = new PigServer("mapreduce");
-     runIdQuery(pigServer, "passwd");
-   }
-   catch(Exception e) {
-   }
-}
-public static void runIdQuery(PigServer pigServer, String inputFile) throws IOException {
-   pigServer.registerQuery("A = load '" + inputFile + "' using PigStorage(':');")
-   pigServer.registerQuery("B = foreach A generate $0 as id;");
-   pigServer.store("B", "idout");
-   }
-}
-</source>
-</section>
+<title>Embedded Pig</title>
+<p>You can embed Pig statements in a host language. Supported languages include Python, JavaScript, and Java (see <a href="cont.html">Control Structures</a>). </p>
 
 </section>
 </section>
@@ -465,11 +384,28 @@ However, in a production environment you
 <!-- PIG PROPERTIES -->
 <section>
 <title>Pig Properties</title>
-   <p>
-The Pig "-propertyfile" option enables you to pass a set of Pig or Hadoop properties to a Pig job. If the value is present in both the property file passed from the command line as well as in default property file bundled into pig.jar, the properties passed from command line take precedence. This property, as well as all other properties defined in Pig, are available to your UDFs via UDFContext.getClientSystemProps()API call (see <a href="udf.html">User Defined Functions</a>.)  </p>
-
-<p>You can retrieve a list of all properties using the <a href="cmds.html#help">help properties</a> command.</p>
-<p>You can set properties using the <a href="cmds.html#set">set</a> command.</p>
+   <p>Pig supports a number of Java properties that you can use to customize Pig behavior. You can retrieve a list of the properties using the <a href="cmds.html#help">help properties</a> command. All of these properties are optional; none are required. </p>
+<p></p>
+<p>To specify Pig properties use one of these mechanisms:</p>
+<ul>
+	<li>The pig.properties file (add the directory that contains the pig.properties file to the classpath)</li>
+	<li>The -D command line option and a Pig property (pig -Dpig.tmpfilecompression=true)</li>
+	<li>The -P command line option and a properties file (pig -P mypig.properties)</li>
+	<li>The <a href="cmds.html#set">set</a> command (set pig.exec.nocombiner true)</li>
+</ul>
+<p><strong>Note:</strong> The properties file uses standard Java property file format.</p>
+<p>The following precedence order is supported: pig.properties &gt; -D Pig property &gt; -P properties file &gt; set command. This means that if the same property is provided using the –D command line option as well as the –P command line option and a properties file, the value of the property in the properties file will take precedence.</p>
+<p>To specify Hadoop properties you can use the same mechanisms:</p>
+<ul>
+	<li>The hadoop-site.xml file (add the directory that contains the hadoop-site.xml file to the classpath)</li>
+	<li>The -D command line option and a Hadoop property (pig –Dmapreduce.task.profile=true) </li>
+	<li>The -P command line option and a property file (pig -P property_file)</li>
+	<li>The <a href="cmds.html#set">set</a> command (set mapred.map.tasks.speculative.execution false)</li>
+</ul>
+<p></p>
+<p>The same precedence holds: hadoop-site.xml &gt; -D Hadoop property &gt; -P properties_file &gt; set command.</p>
+<p>Hadoop properties are not interpreted by Pig but are passed directly to Hadoop. Any Hadoop property can be passed this way. </p>
+<p>All properties that Pig collects, including Hadoop properties, are available to any UDF via the UDFContext. To get access to the properties, you can call the getJobConf method.</p>
 </section>  
 
 

Modified: pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/udf.xml
URL: http://svn.apache.org/viewvc/pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/udf.xml?rev=1096096&r1=1096095&r2=1096096&view=diff
==============================================================================
--- pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ pig/branches/branch-0.9/src/docs/src/documentation/content/xdocs/udf.xml Sat Apr 23 00:45:00 2011
@@ -759,7 +759,7 @@ has methods to deal with metadata - most
 <li><a href="http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup">LoadPushDown</a> 
 has methods to push operations from Pig runtime into loader implementations. Currently only the pushProjection() method is called by Pig to communicate to the loader the exact fields that are required in the Pig script. The loader implementation can choose to honor the request (return only those fields required by Pig script) or not honor the request (return all fields in the data). If the loader implementation can efficiently honor the request, it should implement LoadPushDown to improve query performance. (Irrespective of whether the implementation can or cannot honor the request, if the implementation also implements getSchema(), the schema returned in getSchema() should describe the entire tuple of data.)
 <ul>
-	<li>pushProjection(): This method tells LoadFunc which fields are required in the Pig script, thus enabling LoadFunc to optimize performance by loading only those fields that are needed. pushProjection() takes a RequiredFieldList. RequiredFieldList includes a list of RequiredField: each RequiredField indicates a field required by the Pig script; each RequiredField includes index, alias, type (which is reserved for future use), and subFields. Pig will use the column index RequiredField.index to communicate with the LoadFunc about the fields required by the Pig script. If the required field is a map, Pig will optionally pass RequiredField.subFields which contains a list of keys that the Pig script needs for the map. For example, if the Pig script needs two keys for the map, "key1" and "key2", the subFields for that map will contain two RequiredField; the alias field for the first RequiredField will be "key1" and the alias for the second RequiredField will be "key2". LoadFunc 
 will use RequiredFieldResponse.requiredFieldRequestHonored to indicate whether the pushProjection() request is honored.
+	<li>pushProjection(): This method tells LoadFunc which fields are required in the Pig script, thus enabling LoadFunc to optimize performance by loading only those fields that are needed. pushProjection() takes a requiredFieldList. requiredFieldList is read only and cannot be changed by LoadFunc. requiredFieldList includes a list of requiredField: each requiredField indicates a field required by the Pig script; each requiredField includes index, alias, type (which is reserved for future use), and subFields. Pig will use the column index requiredField.index to communicate with the LoadFunc about the fields required by the Pig script. If the required field is a map, Pig will optionally pass requiredField.subFields which contains a list of keys that the Pig script needs for the map. For example, if the Pig script needs two keys for the map, "key1" and "key2", the subFields for that map will contain two requiredField; the alias field for the first RequiredField will be "key1" an
 d the alias for the second requiredField will be "key2". LoadFunc will use requiredFieldResponse.requiredFieldRequestHonored to indicate whether the pushProjection() request is honored.
 </li>
 </ul>
 </li>