You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by bi...@apache.org on 2012/09/10 06:37:27 UTC

svn commit: r1382638 - in /pig/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/basic.xml

Author: billgraham
Date: Mon Sep 10 04:37:27 2012
New Revision: 1382638

URL: http://svn.apache.org/viewvc?rev=1382638&view=rev
Log:
PIG-2901: Errors and lacks in document Pig Latin Basics (miyakawataku via billgraham)

Modified:
    pig/trunk/CHANGES.txt
    pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml

Modified: pig/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/pig/trunk/CHANGES.txt?rev=1382638&r1=1382637&r2=1382638&view=diff
==============================================================================
--- pig/trunk/CHANGES.txt (original)
+++ pig/trunk/CHANGES.txt Mon Sep 10 04:37:27 2012
@@ -25,6 +25,8 @@ PIG-1891 Enable StoreFunc to make intell
 
 IMPROVEMENTS
 
+PIG-2901: Errors and lacks in document "Pig Latin Basics" (miyakawataku via billgraham)
+
 PIG-2905: Improve documentation around REPLACE (cheolsoo via billgraham)
 
 PIG-2882: Use Deque instead of Stack (mkhadikov via dvryaboy)

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml?rev=1382638&r1=1382637&r2=1382638&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/basic.xml Mon Sep 10 04:37:27 2012
@@ -1155,7 +1155,7 @@ dump X;
 (,{(sam,,3.0),(bob,,3.5)})
    </source>
    
-<p>When using the GROUP (COGROUP) operator with multiple relations, records with a null group key are considered different and are grouped separately. In the example below note that there are two tuples in the output corresponding to the null group key: one that contains tuples from relation A (but not relation B) and one that contains tuples from relation B (but not relation A).</p>
+<p>When using the GROUP (COGROUP) operator with multiple relations, records with a null group key from different relations are considered different and are grouped separately. In the example below note that there are two tuples in the output corresponding to the null group key: one that contains tuples from relation A (but not relation B) and one that contains tuples from relation B (but not relation A).</p>
    
 <source>
 A = load 'student' as (name:chararray, age:int, gpa:float);
@@ -1367,7 +1367,7 @@ dump X;
          <p>A bag is a collection of tuples</p>
       </li>
       <li>
-         <p>A map key must be a scalar; a map value can be any data type</p>
+         <p>A map key must be a chararray; a map value can be any data type</p>
       </li>
    </ul>
    <p></p>
@@ -1597,7 +1597,7 @@ B = foreach A generate x+y;
 
  <p>If you do <a href="test.html#DESCRIBE">DESCRIBE</a> on B, you will see a single column of type double. This is because Pig makes the safest choice and uses the largest numeric type when the schema is not know. In practice, the input data could contain integer values; however, Pig will cast the data to double and make sure that a double result is returned.</p>
 
- <p>If the schema of a relationship can’t be inferred, Pig will just use the runtime data as is and propagate it through the pipeline.</p>
+ <p>If the schema of a relation can’t be inferred, Pig will just use the runtime data as is and propagate it through the pipeline.</p>
 
 
    <!-- ++++++++++++++++++++++++++++++++++ -->     
@@ -3351,7 +3351,7 @@ B = FOREACH A GENERATE $0 + 1, $1 + 1.0
    </ul>
    <ul>
       <li>
-         <p>When two bytearrays are used in arithmetic expressions or with built in aggregate functions (such as SUM) they are implicitly cast to double. If the underlying data is really int or long, you’ll get better performance by declaring the type or explicitly casting the data.</p>
+         <p>When two bytearrays are used in arithmetic expressions or a bytearray expression is used with built in aggregate functions (such as SUM) they are implicitly cast to double. If the underlying data is really int or long, you’ll get better performance by declaring the type or explicitly casting the data.</p>
       </li>
       <li>
          <p>Downcasts may cause loss of data. For example casting from long to int may drop bits.</p>
@@ -3499,7 +3499,7 @@ If the relation contains more than one t
  
 <p>The primary use case for casting relations to scalars is the ability to use the values of global aggregates in follow up computations. </p> 
  
-<p>In this example the percentage of clicks belonging to a particular user are computed. For the FOREACH statement, an explicit cast if used. If the SUM is not given a name, a position can be used as well (userid, clicks/(double)C.$0). </p>
+<p>In this example the percentage of clicks belonging to a particular user are computed. For the FOREACH statement, an explicit cast is used. If the SUM is not given a name, a position can be used as well (userid, clicks/(double)C.$0). </p>
 
 <source>
 A = load 'mydata' as (userid, clicks); 
@@ -3615,7 +3615,7 @@ dump E; 
             <td>
             <p>Takes an expression on the left and a string constant on the right.</p>
             <p><em>expression</em> matches <em>string-constant</em></p>
-            <p>Use the Java <a href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Pattern.html">format</a> for regular expressions.</p>
+            <p>Use the Java <a href="http://docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html">format</a> for regular expressions.</p>
 
             </td>
          </tr>
@@ -4034,8 +4034,8 @@ X = FILTER A BY (f1 matches '.*apache.*'
             </td>
          </tr>
    </table>
-   <p>Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 &lt;= i &lt; s A[i] = = B[i])</p>
-   <p>Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that k1 = = k2 and v1 = = v2)</p>
+   <p>Note 1: boolean (Tuple A is equal to tuple B if they have the same size s, and for all 0 &lt;= i &lt; s A[i] == B[i])</p>
+   <p>Note 2: boolean (Map A is equal to map B if A and B have the same number of entries, and for every key k1 in A with a value of v1, there is a key k2 in B with a value of v2, such that k1 == k2 and v1 == v2)</p>
 </section>
 
    <section id="types-table-not-equal">
@@ -4667,7 +4667,7 @@ Output (results):
 <p><strong>Tuple Example</strong></p>   
 <p>Suppose we have relation A.</p>
 <source>
-LOAD 'data' as (f1:int, f2:tuple(t1:int,t2:int,t3:int));
+A = LOAD 'data' as (f1:int, f2:tuple(t1:int,t2:int,t3:int));
 
 DUMP A;
 (1,(1,2,3))
@@ -4966,7 +4966,6 @@ A = LOAD 'data' as (x, y, z);
 
 B = FOREACH A GENERATE -x, y;
 </source>
-</section>
    
    </section>
    
@@ -5048,6 +5047,7 @@ B = FOREACH A GENERATE -x, y;
    </table>
    </section>
   
+</section>
 </section>   
 
 <!-- =================================================================== -->
@@ -5488,7 +5488,7 @@ X = FOREACH B {
    
    <section id="projection">
    <title>Example: Projection</title>
-   <p>In this example the asterisk (*) is used to project all tuples from relation A to relation X. Relation A and X are identical.</p>
+   <p>In this example the asterisk (*) is used to project all fields from relation A to relation X. Relation A and X are identical.</p>
 <source>
 X = FOREACH A GENERATE *;
 
@@ -5996,12 +5996,12 @@ DUMP B;
 (Paul,Jane)
 </source>
    
-   <p>In this example tuples are co-grouped using field “owner” from relation A and field “friend2” from relation B as the key fields. The DESCRIBE operator shows the schema for relation X, which has two fields, "group" and "A" (see the GROUP operator for information about the field names).</p>
+   <p>In this example tuples are co-grouped using field “owner” from relation A and field “friend2” from relation B as the key fields. The DESCRIBE operator shows the schema for relation X, which has three fields, "group", "A" and "B" (see the GROUP operator for information about the field names).</p>
 <source>
 X = COGROUP A BY owner, B BY friend2;
 
 DESCRIBE X;
-X: {group: chararray,A: {owner: chararray,pet: chararray},b: {firend1: chararray,friend2: chararray}}
+X: {group: chararray,A: {owner: chararray,pet: chararray},B: {friend1: chararray,friend2: chararray}}
 </source>
    
    <p>Relation X looks like this. A tuple is created for each unique key field. The tuple includes the key field and two bags. The first bag is the tuples from the first relation with the matching key field. The second bag is the tuples from the second relation with the matching key field. If no tuples match the key field, the bag is empty.</p>
@@ -6010,16 +6010,6 @@ X: {group: chararray,A: {owner: chararra
 (Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
 (Jane,{},{(Paul,Jane)})
 </source>
-   
-   <p>In this example tuples are co-grouped and the INNER keyword is used asymmetrically on only one of the relations.</p>
-<source>
-X = COGROUP A BY owner, B BY friend2 INNER;
-
-DUMP X;
-(Bob,{(Bob,dog),(Bob,cat)},{(Paul,Bob)})
-(Jane,{},{(Paul,Jane)})
-(Alice,{(Alice,turtle),(Alice,goldfish),(Alice,cat)},{(Cindy,Alice),(Mark,Alice)})
-</source>
    </section>
    
    <section>
@@ -6185,7 +6175,7 @@ public class SimpleCustomPartitioner ext
    <section>
    <title>Usage</title>
    <p>Use the JOIN operator to perform an inner, equijoin join of two or more relations based on common field values. 
-   The JOIN operator always performs an inner join. Inner joins ignore null keys, so it makes sense to filter them out before the join.</p>
+   Inner joins ignore null keys, so it makes sense to filter them out before the join.</p>
    
    <p>Note the following about the GROUP/COGROUP and JOIN operators:</p>
       <ul>
@@ -6255,7 +6245,7 @@ DUMP X;
 
 <section id="join-outer">
    <title>JOIN (outer) </title>
-   <p>Performs an outer join of two or more relations based on common field values.</p>
+   <p>Performs an outer join of two relations based on common field values.</p>
    
    <section>
    <title>Syntax</title>
@@ -6509,7 +6499,7 @@ C = JOIN A BY name FULL, B BY name USING
    
    <section>
    <title>Examples</title>
-   <p>In this example the lmit is express as a scalar.</p>
+   <p>In this example the limit is expressed as a scalar.</p>
  <source>
 a = load 'a.txt';
 b = group a all;
@@ -6693,7 +6683,7 @@ ILLUSTRATE A;
 ---------------------------------------
 </source>
    <p>
-      For examples of how to specify more complex schemas for use with the LOAD operator, see Schemas for Complex Data Types and Schemas for Multiple Types.
+      For examples of how to specify more complex schemas for use with the LOAD operator, see <a href="#schema-complex">Schemas for Complex Data Types</a> and <a href="#schema-multi">Schemas for Multiple Types</a>.
       </p></section></section>
       
 
@@ -6753,7 +6743,7 @@ ILLUSTRATE A;
             </td>
             <td>
                <p>See <a href="basic.html#LOAD">LOAD</a></p>
-               <p>After running mr1.jar's MapReduce job, load back the data from outputLocation into alias1 using loadFunc as schema.</p>
+               <p>After running mr.jar's MapReduce job, load back the data from outputLocation into alias1 using loadFunc as schema.</p>
             </td>
      </tr>
 
@@ -7744,29 +7734,29 @@ B = STREAM B THROUGH CMD;
 <source>
 interface PigToStream {
 
-        /**
-         * Given a tuple, produce an array of bytes to be passed to the streaming
-         * executable.
-         */
-        public byte[] serialize(Tuple t) throws IOException;
-    }
-
-    interface StreamToPig {
-
-        /**
-         *  Given a byte array from a streaming executable, produce a tuple.
-         */
-        public Tuple deserialize(byte[]) throws IOException;
-
-        /**
-         * This will be called on the front end during planning and not on the back
-         * end during execution.
-         *
-         * @return the {@link LoadCaster} associated with this object.
-         * @throws IOException if there is an exception during LoadCaster
-         */
-        public LoadCaster getLoadCaster() throws IOException;
-    }
+    /**
+     * Given a tuple, produce an array of bytes to be passed to the streaming
+     * executable.
+     */
+    public byte[] serialize(Tuple t) throws IOException;
+}
+
+interface StreamToPig {
+
+    /**
+     *  Given a byte array from a streaming executable, produce a tuple.
+     */
+    public Tuple deserialize(byte[]) throws IOException;
+
+    /**
+     * This will be called on the front end during planning and not on the back
+     * end during execution.
+     *
+     * @return the {@link LoadCaster} associated with this object.
+     * @throws IOException if there is an exception during LoadCaster
+     */
+    public LoadCaster getLoadCaster() throws IOException;
+}
 </source>  
    
    </section>
@@ -7786,7 +7776,7 @@ interface PigToStream {
 OP = stream IP through 'script';
 or
 DEFINE CMD 'script' ship('/a/b/script');
-OP = stream IP through 'CMD';
+OP = stream IP through CMD;
 </source>
 		</li>
 	    <li>
@@ -7897,8 +7887,11 @@ DEFINE Y 'stream.pl' stderr('&lt;dir&gt;
 
 X = STREAM A THROUGH Y;
 </source>
+</section>
 
    
+<section>
+<title>Examples: DEFINE a function</title>
 <p>In this example a function is defined for use with the FOREACH …GENERATE operator.</p>
 <source>
 REGISTER /src/myfunc.jar
@@ -7976,7 +7969,7 @@ pig -Dpig.additional.jars=my.jar:your.ja
 
 <p>In this example a JAR file stored in HDFS is registered.</p>
 <source>
-java -cp pig.jar org.apache.pig.Main  hdfs://nn.mydomain.com:9020/myscripts/script.pig
+pig -Dpig.additional.jars=hdfs://nn.mydomain.com:9020/myjars/my.jar script.pig
 </source>
 
 <p>This example shows how to specify a glob pattern using either a relative path or an absolute path.</p>