You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by ol...@apache.org on 2010/03/23 20:47:25 UTC
svn commit: r926752 [1/2] - in /hadoop/pig/trunk: ./ src/docs/src/documentation/content/xdocs/

Author: olga
Date: Tue Mar 23 19:47:24 2010
New Revision: 926752

URL: http://svn.apache.org/viewvc?rev=926752&view=rev
Log:
PIG-1320: documentation updates for Pig 0.7.0 (chandec via olgan)

Modified:
    hadoop/pig/trunk/CHANGES.txt
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/zebra_mapreduce.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/zebra_overview.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/zebra_pig.xml
    hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/zebra_reference.xml

Modified: hadoop/pig/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/CHANGES.txt?rev=926752&r1=926751&r2=926752&view=diff
==============================================================================
--- hadoop/pig/trunk/CHANGES.txt (original)
+++ hadoop/pig/trunk/CHANGES.txt Tue Mar 23 19:47:24 2010
@@ -68,6 +68,8 @@ manner (rding via pradeepkth)
 
 IMPROVEMENTS
 
+PIG-1320: documentation updates for Pig 0.7.0 (chandec via olgan)
+
 PIG-1325: Provide a way to exclude a testcase when running "ant test"
 (pradeepkth)
 

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml?rev=926752&r1=926751&r2=926752&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/cookbook.xml Tue Mar 23 19:47:24 2010
@@ -246,18 +246,29 @@ For more information see <a href="piglat
 <a href="piglatin_ref2.html#JOIN+%28inner%29">JOIN (inner)</a>, 
 <a href="piglatin_ref2.html#JOIN+%28outer%29">JOIN (outer)</a>, and
 <a href="piglatin_ref2.html#ORDER">ORDER</a>.
+</p>
  
-You can also set the value of PARALLEL for all scripts using the <a href="piglatin_ref2.html#set">set</a> command.</p>
-
-<p>Example</p>
+<p>You can also set the value of PARALLEL for all Pig scripts using the <a href="piglatin_ref2.html#set">set default parallel</a> command.</p>
 
+<p>In this example PARALLEL is used with the GROUP operator. </p>
 <source>
-A = load 'myfile' as (t, u, v);
-B = group A by t PARALLEL 18;
+A = LOAD 'myfile' AS (t, u, v);
+B = GROUP A BY t PARALLEL 18;
 .....
 </source>
+
+<p>In this example all the MapReduce jobs that get launched use 20 reducers.</p>
+<source>
+SET DEFAULT_PARALLEL 20;
+A = LOAD âmyfile.txtâ USING PigStorage() AS (t, u, v);
+B = GROUP A BY t;
+C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
+D = ORDER C BY mycount;
+STORE D INTO âmysortedcountâ USING PigStorage();
+</source>
 </section>
 
+
 <section>
 <title>Use the LIMIT Operator</title>
 

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml?rev=926752&r1=926751&r2=926752&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref1.xml Tue Mar 23 19:47:24 2010
@@ -310,21 +310,21 @@ With multi-query execution, the script w
 	<p>DUMP Example: In this script, because the DUMP command is interactive, the multi-query execution will be disabled and two separate jobs will be created to execute this script. The first job will execute A > B > DUMP while the second job will execute A > B > C > STORE.</p>
 	
 <source>
-A = LOAD âinputâ AS (x, y, z);
+A = LOAD 'input' AS (x, y, z);
 B = FILTER A BY x > 5;
 DUMP B;
 C = FOREACH B GENERATE y, z;
-STORE C INTO âoutputâ;
+STORE C INTO 'output';
 </source>
 	
 	<p>STORE Example: In this script, multi-query optimization will kick in allowing the entire script to be executed as a single job. Two outputs are produced: output1 and output2.</p>
 	
 <source>
-A = LOAD âinputâ AS (x, y, z);
+A = LOAD 'input' AS (x, y, z);
 B = FILTER A BY x > 5;
-STORE B INTO âoutput1â;
+STORE B INTO 'output1';
 C = FOREACH B GENERATE y, z;
-STORE C INTO âoutput2â;	
+STORE C INTO 'output2';	
 </source>
 
 </section>
@@ -387,10 +387,10 @@ STORE A INTO 'out1';
 	</ol>	
 	
 	<p>Arguments used in a LOAD statement that have a scheme other than "hdfs" or "file" will not be expanded and passed to the LoadFunc/Slicer unchanged.</p>
-	<p>In the SQL case, the SQLLoader function is invoked with "sql://mytable". </p>
+	<p>In the SQL case, the SQLLoader function is invoked with 'sql://mytable'. </p>
 
 <source>
-A = LOAD "sql://mytable" USING SQLLoader();
+A = LOAD 'sql://mytable' USING SQLLoader();
 </source>
 </section>
 
@@ -515,7 +515,7 @@ tiny = LOAD 'tiny_data' AS (t1,t2,t3);
 
 mini = LOAD 'mini_data' AS (m1,m2,m3);
 
-C = JOIN big BY b1, tiny BY t1, mini BY m1 USING "replicated";
+C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
 </source>
 </section>
 
@@ -538,7 +538,8 @@ Parallel joins are vulnerable to the pre
 If the underlying data is sufficiently skewed, load imbalances will swamp any of the parallelism gains. 
 In order to counteract this problem, skewed join computes a histogram of the key space and uses this 
 data to allocate reducers for a given key. Skewed join does not place a restriction on the size of the input keys. 
-It accomplishes this by splitting one of the input on the join predicate and streaming the other input. 
+It accomplishes this by splitting the left input on the join predicate and streaming the right input. The left input is 
+sampled to create the histogram.
 </p>
 
 <p>
@@ -553,7 +554,7 @@ associated with a given key is too large
 <source>
 big = LOAD 'big_data' AS (b1,b2,b3);
 massive = LOAD 'massive_data' AS (m1,m2,m3);
-C = JOIN big BY b1, massive BY m1 USING "skewed";
+C = JOIN big BY b1, massive BY m1 USING 'skewed';
 </source>
 </section>
 
@@ -604,7 +605,7 @@ and the right input of the join to be th
 <title>Usage</title>
 <p>Perform a merge join with the USING clause (see <a href="piglatin_ref2.html#JOIN+%28inner%29">inner joins</a>).</p>
 <source>
-C = JOIN A BY a1, B BY b1 USING "merge";
+C = JOIN A BY a1, B BY b1 USING 'merge';
 </source>
 </section>
 
@@ -622,8 +623,8 @@ key when read starting at a and ending i
 part-00002 and part-00003, the data should be sorted if the files are read in the sequence part-00000, part-00001, 
 part-00002 and part-00003. </li>
 <li>The merge join only has two inputs </li>
-<li>The loadfunc for the right input of the join should implement the SamplableLoader interface (PigStorage does 
-implement the SamplableLoader interface). </li>
+<li>The loadfunc for the right input of the join should implement the OrderedLoadFunc interface (PigStorage does 
+implement the OrderedLoadFunc interface). </li>
 <li>Only inner join will be supported </li>
 
 <li>Between the load of the sorted input and the merge join statement there can only be filter statements and 
@@ -759,9 +760,8 @@ D = JOIN C BY $1, B BY $1;
  <!-- MEMORY MANAGEMENT -->
 <section>
 <title>Memory Management</title>
-<p>For Pig 0.6.0 we changed how Pig decides when to spill bags to disk. In the past, Pig tried to figure out when an application was getting close to memory limit and then spill at that time. However, because Java does not include an accurate way to determine when to spill, Pig often ran out of memory. </p>
 
-<p>In the current version, we allocate a fix amount of memory to store bags and spill to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
+<p>Pig allocates a fix amount of memory to store bags and spills to disk as soon as the memory limit is reached. This is very similar to how Hadoop decides when to spill data accumulated by the combiner. </p>
 
 <p>The amount of memory allocated to bags is determined by pig.cachedbag.memusage; the default is set to 10% of available memory. Note that this memory is shared across all large bags used by the application.</p>
 

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml?rev=926752&r1=926751&r2=926752&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/piglatin_ref2.xml Tue Mar 23 19:47:24 2010
@@ -162,7 +162,7 @@ Also, be sure to review the information 
       
          <row>
             <entry> <para>-- B </para> </entry>
-            <entry> <para>bag, BinaryDeserializer, BinarySerializer, BinStorage, by, bytearray </para> </entry>
+            <entry> <para>bag, BinStorage, by, bytearray </para> </entry>
          </row>   
 
          <row>
@@ -1996,7 +1996,7 @@ $ pig âparam_file myparams script2.
    <title>Example: Specifying parameters using the declare statement</title>
    <para>In this example the command is executed and its stdout is used as the parameter value.</para>
 <programlisting>
-%declare CMD `generate_date`;
+%declare CMD 'generate_date';
 A = LOAD '/data/mydata/$CMD';
 B = FILTER A BY $0>'5';
 
@@ -2035,7 +2035,7 @@ $ pig âparam data=mydata myscript.p
    <title>Example: Specifying parameter values as a command</title>
    <para>In this example the command is enclosed in back ticks. First, the parameters mycmd and date are substituted when the declare statement is encountered. Then the resulting command is executed and its stdout is placed in the path before the load statement is run.</para>
 <programlisting>
-%declare CMD `$mycmd $date`;
+%declare CMD '$mycmd $date';
 A = LOAD '/data/mydata/$CMD';
 B = FILTER A BY $0>'5';
  
@@ -5309,8 +5309,8 @@ DUMP X;
 
    <para>Another FLATTEN example. Here, relations A and B both have a column x. When forming relation E,  you need to use the :: operator to identify which column x to use - either relation A column x (A::x) or relation B column x (B::x). This example uses relation A column x (A::x).</para>
 <programlisting>
-A = load âdataâ as (x, y);
-B = load âdataâ as (x, z);
+A = load 'data' as (x, y);
+B = load 'data' as (x, z);
 C = cogroup A by x, B by x;
 D = foreach C generate flatten(A), flatten(b);
 E = group D by A::x;
@@ -5368,7 +5368,7 @@ readability, programmers usually use GRO
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL | BY expression â¦]  [USING "collected"] [PARALLEL n];</para>
+               <para>alias = GROUP alias { ALL | BY expression}Â [, alias ALL | BY expression â¦]  [USING 'collected'] [PARALLEL n];</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5422,7 +5422,7 @@ readability, programmers usually use GRO
          </row>
          <row>
             <entry>
-               <para>"collected"</para>
+               <para>'collected'</para>
             </entry>
             <entry>
                <para>Allows for more efficient computation of a group if the loader guarantees that the data for the 
@@ -5641,7 +5641,7 @@ DUMP X;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = JOIN alias BY {expression|'('expression [, expression â¦]')'} (, alias BY {expression|'('expression [, expression â¦]')'} â¦) [USING "replicated" | "skewed" | "merge"] [PARALLEL n];Â  </para>
+               <para>alias = JOIN alias BY {expression|'('expression [, expression â¦]')'} (, alias BY {expression|'('expression [, expression â¦]')'} â¦) [USING 'replicated' | 'skewed' | 'merge'] [PARALLEL n];Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -5684,7 +5684,7 @@ DUMP X;
          </row>
          <row>
             <entry>
-               <para>"replicated"</para>
+               <para>'replicated'</para>
             </entry>
             <entry>
                <para>Use to perform replicated joins (see <ulink url="piglatin_ref1.html#Replicated+Joins">Replicated Joins</ulink>).</para>
@@ -5693,7 +5693,7 @@ DUMP X;
          
                   <row>
             <entry>
-               <para>"skewed"</para>
+               <para>'skewed'</para>
             </entry>
             <entry>
                <para>Use to perform skewed joins (see <ulink url="piglatin_ref1.html#Skewed+Joins">Skewed Joins</ulink>).</para>
@@ -5702,7 +5702,7 @@ DUMP X;
          
                   <row>
             <entry>
-               <para>"merge"</para>
+               <para>'merge'</para>
             </entry>
             <entry>
                <para>Use to perform merge joins (see <ulink url="piglatin_ref1.html#Merge+Joins">Merge Joins</ulink>).</para>
@@ -5793,7 +5793,7 @@ DUMP X;
       <tgroup cols="1"><tbody><row>
             <entry>
                <para>alias = JOIN left-alias BY left-alias-column [LEFT|RIGHT|FULL] [OUTER], right-alias BY right-alias-column 
-               [USING "replicated" | "skewed"] [PARALLEL n];Â  </para>
+               [USING 'replicated' | 'skewed'] [PARALLEL n];Â  </para>
             </entry>
          </row></tbody></tgroup>
    </informaltable>
@@ -5875,7 +5875,7 @@ DUMP X;
          </row>
          <row>
             <entry>
-               <para>"replicated"</para>
+               <para>'replicated'</para>
             </entry>
             <entry>
                <para>Use to perform replicated joins (see <ulink url="piglatin_ref1.html#Replicated+Joins">Replicated Joins</ulink>).</para>
@@ -5885,7 +5885,7 @@ DUMP X;
          
                   <row>
             <entry>
-               <para>"skewed"</para>
+               <para>'skewed'</para>
             </entry>
             <entry>
                <para>Use to perform skewed joins (see <ulink url="piglatin_ref1.html#Skewed+Joins">Skewed Joins</ulink>).</para>
@@ -5951,16 +5951,16 @@ C = JOIN A BY $0 FULL, B BY $0;
 
 <para>This example shows a replicated left outer join.</para>
 <programlisting>
-A = LOAD âlargeâ;
-B = LOAD âtinyâ;
-C= JOIN A BY $0 LEFT, B BY $0 USING "replicated";
+A = LOAD 'large';
+B = LOAD 'tiny';
+C= JOIN A BY $0 LEFT, B BY $0 USING 'replicated';
 </programlisting>
 
 <para>This example shows a skewed full outer join.</para>
 <programlisting>
-A = LOAD  âstudenttabâ as (name, age, gpa);
-B = LOAD  'votertab' as (name, age, registration, contribution);
-C = JOIN A BY name FULL, B BY name USING "skewed";
+A = LOAD 'studenttab' as (name, age, gpa);
+B = LOAD 'votertab' as (name, age, registration, contribution);
+C = JOIN A BY name FULL, B BY name USING 'skewed';
 </programlisting>
 
 </section>
@@ -6100,7 +6100,9 @@ DUMP X;
                <para>The load function. </para>
                <itemizedlist>
                   <listitem>
-                     <para>You can use a built-in function (see the load/store functions). PigStorage is the default load function and does not need to be specified (simply omit the USING clause).</para>
+                  
+                  
+                     <para>You can use a built-in function (see the <ulink url="#Load%2FStore+Functions">Load/Store Functions</ulink>). PigStorage is the default load function and does not need to be specified (simply omit the USING clause).</para>
                   </listitem>
                   <listitem>
                      <para>You can write your own load function  
@@ -6514,10 +6516,13 @@ DUMP Z;
                <para>The store function.</para>
                <itemizedlist>
                   <listitem>
-                     <para>You can use a built-in function (see the Load/Store Functions). PigStorage is the default load function and does not need to be specified (simply omit the USING clause).</para>
+                  
+                  
+                     <para>You can use a built-in function (see the <ulink url="#Load%2FStore+Functions">Load/Store Functions</ulink>). PigStorage is the default store function and does not need to be specified (simply omit the USING clause).</para>
                   </listitem>
                   <listitem>
-                     <para>You can write your own store function (see the User-Defined Function Manual) if your data is in a format that cannot be processed by the built-in functions.</para>
+                     <para>You can write your own store function  
+                     if your data is in a format that cannot be processed by the built-in functions (see the <ulink url="udf.html">Pig UDF Manual</ulink>).</para>
                   </listitem>
                </itemizedlist>
             </entry>
@@ -6545,7 +6550,7 @@ DUMP A;
 (7,2,5)
 (8,4,3)
 
-STORE A INTO âmyoutputâ USING PigStorage (â*â);
+STORE A INTO 'myoutput' USING PigStorage ('*');
 
 CAT myoutput;
 1*2*3
@@ -6600,7 +6605,7 @@ a:8,b:4,c:3
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>alias = STREAM alias [, alias â¦] THROUGH {`command` | cmd_alias }Â [AS schema] ;</para>
+               <para>alias = STREAM alias [, alias â¦] THROUGH {'command' | cmd_alias }Â [AS schema] ;</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -6626,7 +6631,7 @@ a:8,b:4,c:3
          </row>
          <row>
             <entry>
-               <para>`command`</para>
+               <para>'command'</para>
             </entry>
             <entry>
                <para>A command, including the arguments, enclosed in back tics (where a command is anything that can be executed).</para>
@@ -6665,13 +6670,13 @@ a:8,b:4,c:3
 <programlisting>
 A = LOAD 'data';
 
-B = STREAM A THROUGH `stream.pl -n 5`;
+B = STREAM A THROUGH 'stream.pl -n 5';
 </programlisting>
    <para>When used with a cmd_alias, a stream statement could look like this, where cmd is the defined alias.</para>
 <programlisting>
 A = LOAD 'data';
 
-DEFINE cmd `stream.pl ân 5`;
+DEFINE cmd 'stream.pl ân 5';
 
 B = STREAM A THROUGH cmd;
 </programlisting>
@@ -6700,7 +6705,7 @@ B = STREAM A THROUGH cmd;
 <programlisting>
 A = LOAD 'data';
 
-B = STREAM A THROUGH `stream.pl`;
+B = STREAM A THROUGH 'stream.pl';
 </programlisting>
    
    <para>In this example the data is grouped.</para>
@@ -6711,7 +6716,7 @@ B = GROUP A BY $1;
 
 C = FOREACH B FLATTEN(A);
 
-D = STREAM C THROUGH `stream.pl`
+D = STREAM C THROUGH 'stream.pl';
 </programlisting>
    
    <para>In this example the data is grouped and ordered.</para>
@@ -6725,7 +6730,7 @@ C = FOREACH B {
       GENERATE D;
 }
 
-E = STREAM C THROUGH `stream.pl`;
+E = STREAM C THROUGH 'stream.pl';
 </programlisting>
    </section>
    
@@ -6733,7 +6738,7 @@ E = STREAM C THROUGH `stream.pl`;
    <title>Example: Schemas</title>
    <para>In this example a schema is specified as part of the STREAM statement.</para>
 <programlisting>
-X = STREAM A THROUGH `stream.pl` as (f1:int, f2;int, f3:int);
+X = STREAM A THROUGH 'stream.pl' as (f1:int, f2;int, f3:int);
 </programlisting>
    </section>
    
@@ -7240,7 +7245,7 @@ ILLUSTRATE num_user_visits;
    <informaltable frame="all">
       <tgroup cols="1"><tbody><row>
             <entry>
-               <para>DEFINE alias {function | [`command` [input] [output] [ship] [cache]] };</para>
+               <para>DEFINE alias {function | ['command' [input] [output] [ship] [cache]] };</para>
             </entry>
          </row></tbody></tgroup>
    </informaltable></section>
@@ -7386,10 +7391,10 @@ ILLUSTRATE num_user_visits;
 		<listitem>
 			<para>It is safe only to ship files to be executed from the current working directory on the task on the cluster.</para>
 			<programlisting>
-OP = stream IP through `script`;
+OP = stream IP through 'script';
 or
-DEFINE CMD `script` ship('/a/b/script');
-OP = stream IP through CMD`;
+DEFINE CMD 'script' ship('/a/b/script');
+OP = stream IP through 'CMD';
 </programlisting>
 		</listitem>
 	    <listitem>
@@ -7423,9 +7428,9 @@ OP = stream IP through CMD`;
 		<listitem>
 			<para>If Pig determines that it needs to auto-ship an absolute path it will not ship it at all since there is no way to ship files to the necessary location (lack of permissions and so on). </para>
 			<programlisting>
-OP = stream IP through `/a/b/c/script`;
+OP = stream IP through '/a/b/c/script';
 or 
-OP = stream IP through `perl /a/b/c/script.pl`;
+OP = stream IP through 'perl /a/b/c/script.pl';
 </programlisting>
 		</listitem>
 	    <listitem>
@@ -7446,19 +7451,19 @@ OP = stream IP through `perl /a/b/c/scri
  <title>Example: Input/Output</title>
  <para>In this example PigStorage is the default serialization/deserialization function. The tuples from relation A are converted to tab-delimited lines that are passed to the script.</para>
 <programlisting>
-X = STREAM A THROUGH `stream.pl`;
+X = STREAM A THROUGH 'stream.pl';
 </programlisting>
    
    <para>In this example PigStorage is used as the serialization/deserialization function, but a comma is used as the delimiter.</para>
 <programlisting>
-DEFINE Y `stream.pl` INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING PigStorage(','));
+DEFINE Y 'stream.pl' INPUT(stdin USING PigStorage(',')) OUTPUT (stdout USING PigStorage(','));
 
 X = STREAM A THROUGH Y;
 </programlisting>
    
    <para>In this example user-defined serialization/deserialization functions are used with the script.</para>
 <programlisting>
-DEFINE Y `stream.pl` INPUT(stdin USING MySerializer) OUTPUT (stdout USING MyDeserializer);
+DEFINE Y 'stream.pl' INPUT(stdin USING MySerializer) OUTPUT (stdout USING MyDeserializer);
 
 X = STREAM A THROUGH Y;
 </programlisting>
@@ -7468,14 +7473,14 @@ X = STREAM A THROUGH Y;
    <title>Example: Ship/Cache</title>
    <para>In this example ship is used to send the script to the cluster compute nodes.</para>
 <programlisting>
-DEFINE Y `stream.pl` SHIP('/work/stream.pl');
+DEFINE Y 'stream.pl' SHIP('/work/stream.pl');
 
 X = STREAM A THROUGH Y;
 </programlisting>
    
    <para>In this example cache is used to specify a file located on the cluster compute nodes.</para>
 <programlisting>
-DEFINE Y `stream.pl data.gz` SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz');
+DEFINE Y 'stream.pl data.gz' SHIP('/work/stream.pl') CACHE('/input/data.gz#data.gz');
 
 X = STREAM A THROUGH Y;
 </programlisting>
@@ -7485,7 +7490,7 @@ X = STREAM A THROUGH Y;
    <title>Example: Logging</title>
    <para>In this example the streaming stderr is stored in the _logs/&lt;dir&gt; directory of the job's output directory. Because the job can have multiple streaming applications associated with it, you need to ensure that different directory names are used to avoid conflicts. Pig stores up to 100 tasks per streaming job.</para>
 <programlisting>
-DEFINE Y `stream.pl` stderr('&lt;dir&gt;' limit 100);
+DEFINE Y 'stream.pl' stderr('&lt;dir&gt;' limit 100);
 
 X = STREAM A THROUGH Y;
 </programlisting>
@@ -7506,9 +7511,9 @@ B = FOREACH A GENERATE myFunc($0);
 <programlisting>
 A = LOAD 'data';
 
-DEFINE cmd `stream_cmd âinput file.dat`
+DEFINE cmd 'stream_cmd âinput file.dat';
 
-B = STREAM A through cmd.
+B = STREAM A through cmd;
 </programlisting>
 </section>
 </section>
@@ -7542,7 +7547,8 @@ B = STREAM A through cmd.
    
    <section>
    <title>Usage</title>
-   <para>Use the REGISTER statement to specify the path of a Java JAR file containing UDFs.</para>
+   <para>Use the REGISTER statement inside a Pig script to specify the path of a Java JAR file containing UDFs. </para>
+   <para>You can register additional files (to use with your Pig script) via the command line using the -Dpig.additional.jars option.</para>
    <para>For more information about UDFs, see the User Defined Function Guide. Note that Pig currently only supports functions written in Java.</para></section>
    
    <section>
@@ -7552,14 +7558,17 @@ B = STREAM A through cmd.
 /src $ java -jar pig.jar â
 
 REGISTER /src/myfunc.jar;
-
 A = LOAD 'students';
-
 B = FOREACH A GENERATE myfunc.MyEvalFunc($0);
 </programlisting>
-   <para>
+   
+   <para>In this example additional jar files are registered via the command line.</para>
+<programlisting>
+pig -Dpig.additional.jars=my.jar:your.jar script.pig
+</programlisting>
+
 
-   </para></section></section>
+   <para></para></section></section>
    </section>
    
    <!-- BUILT-IN FUNCTIONS --> 
@@ -8575,90 +8584,11 @@ DUMP X;
    
    <section>
    <title>Load/Store Functions</title>
-   <para>Load/Store functions determine how data goes into Pig and comes out of Pig. In addition to the Pig built-in load/store functions, you can also write your functions (see the User-Defined Function Manual).</para>
-   
-   <section>
-   <title>BinarySerializer</title>
-   <para>Converts a file to a byte stream.</para>
-   
-   <section>
-   <title>Syntax</title>
-   <informaltable frame="all">
-      <tgroup cols="1"><tbody><row>
-            <entry>
-               <para>BinarySerializer()Â  Â  Â  Â  </para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Terms</title>
-   <informaltable frame="all">
-      <tgroup cols="2"><tbody><row>
-            <entry>
-               <para>none</para>
-            </entry>
-            <entry>
-               <para>no parameters</para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Usage</title>
-   <para>Use the BinarySerializer with the DEFINE operator to convert a file to a byte stream. No Formatting or interpretation takes place.</para></section>
+   <para>Load/Store functions determine how data goes into Pig and comes out of Pig. 
+   Pig provides a set of built-in load/store functions, described in the sections below. 
+   You can also write your own load/store functions  (see the <ulink url="udf.html#Load%2FStore+Functions">Pig UDF Manual</ulink>).</para>
    
-   <section>
-   <title>Example</title>
-   <para>In this example the BinarySerializer and BinaryDeserializer are use to convert data to and from streaming format.</para>
-<programlisting>
-DEFINE Y `stream.pl` INPUT(stdin USING BinarySerializer()) OUTPUT (stdout USING BinaryDeserializer());
 
-X = STREAM A THROUGH Y;
-</programlisting>
-   </section></section>
-   
-   <section>
-   <title>BinaryDeserializer</title>
-   <para>Converts a byte stream into a file.</para>
-   
-   <section>
-   <title>Syntax</title>
-   <informaltable frame="all">
-      <tgroup cols="1"><tbody><row>
-            <entry>
-               <para>BinarySerializer()Â  Â  Â  Â  </para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Terms</title>
-   <informaltable frame="all">
-      <tgroup cols="2"><tbody><row>
-            <entry>
-               <para>none</para>
-            </entry>
-            <entry>
-               <para>no parameters</para>
-            </entry>
-         </row></tbody></tgroup>
-   </informaltable></section>
-   
-   <section>
-   <title>Usage</title>
-   <para>Use the BinaryDeserializer with the DEFINE operator to convert a byte stream into a file. No Formatting or interpretation takes place.</para></section>
-   
-   <section>
-   <title>Example</title>
-   <para>In this example the BinarySerializer and BinaryDeserializer are use to convert data to and from streaming format.</para>
-<programlisting>
-DEFINE Y `stream.pl` INPUT(stdin USING BinarySerializer()) OUTPUT (stdout USING BinaryDeserializer());
-
-X = STREAM A THROUGH Y;
-</programlisting>
-   </section></section>
-   
    <section>
    <title>BinStorage</title>
    <para>Loads and stores data in machine-readable format.</para>
@@ -8689,6 +8619,7 @@ X = STREAM A THROUGH Y;
    <section>
    <title>Usage</title>
    <para>BinStorage works with data that is represented on disk in machine-readable format.</para>
+   <para>BinStorage does not support compression.</para>
    <para>BinStorage is used internally by Pig to store the temporary data that is created between multiple map/reduce jobs.</para></section>
    
    <section>
@@ -8732,15 +8663,19 @@ STORE X into 'output' USING BinStorage()
    
    <section>
    <title>Usage</title>
-   <para>PigStorage is the default function for the LOAD and STORE operators. PigStorage works with structured text files (in human-readable UTF-8 format) and bzip compressed text files. PigStorage also works with simple and complex data types.</para>
+   <para>PigStorage is the default function for the LOAD and STORE operators and works with both simple and complex data types. </para>
+   
+   <para>PigStorage supports structured text files (in human-readable UTF-8 format).</para>
+   
+   <para>PigStorage also supports gzip (.gz) and bzip(.bz or .bz2) compressed files. PigStorage correctly reads compressed files as long as they are NOT CONCATENATED files generated in this manner: cat *.gz > text/concat.gz  OR cat *.bz > text/concat.bz (OR cat *.bz2 > text/concat.bz2). If you use concatenated gzip or bzip files with your Pig jobs, you will not see a failure but the results will be INCORRECT.</para>
 
   <para>Load statements â PigStorage expects data to be formatted using field delimiters, either the tab character  ('\t') or other specified character.</para>
 
    <para>Store statements â PigStorage outputs data using field deliminters, either the tab character  ('\t') or other specified character, and the line feed record delimiter ('\n').  </para>
 
-   <para>Field Deliminters â For load and store statements the default field delimiter is the tab character ('\t'). You can use other characters as field delimiters, but separators such as ^A or Ctrl-A should be represented in Unicode (\u0001) using UTF-16 encoding (see Wikipedia <ulink url="http://en.wikipedia.org/wiki/ASCII">ASCII</ulink>, <ulink url="http://en.wikipedia.org/wiki/Unicode">Unicode</ulink>, and <ulink url="http://en.wikipedia.org/wiki/UTF-16">UTF-16</ulink>).</para>
+   <para>Field Delimiters â For load and store statements the default field delimiter is the tab character ('\t'). You can use other characters as field delimiters, but separators such as ^A or Ctrl-A should be represented in Unicode (\u0001) using UTF-16 encoding (see Wikipedia <ulink url="http://en.wikipedia.org/wiki/ASCII">ASCII</ulink>, <ulink url="http://en.wikipedia.org/wiki/Unicode">Unicode</ulink>, and <ulink url="http://en.wikipedia.org/wiki/UTF-16">UTF-16</ulink>).</para>
    
-   <para>Record Deliminters â For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter. For load and store statements, if the input file is a bzip file (ending in .bz or .bz2), Pig uses the line feed ('\n') character as the record delimiter.</para>
+   <para>Record Deliminters â For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter.</para>
    </section>
    
    <section>
@@ -8827,7 +8762,10 @@ STORE X INTO 'output' USING PigDump();
    
    <section>
    <title>Usage</title>
-   <para>TextLoader works with unstructured data in UTF8 format. Each resulting tuple contains a single field with one line of input text. TextLoader cannot be used to store data.</para></section>
+   <para>TextLoader works with unstructured data in UTF8 format. Each resulting tuple contains a single field with one line of input text. </para>
+   <para>Currently, TextLoader support for compression is limited.</para>  
+   <para>TextLoader cannot be used to store data.</para>
+   </section>
    
    <section>
    <title>Example</title>
@@ -9788,7 +9726,8 @@ grunt&gt; run âparam out=myoutput m
                <para>a whole number </para>
             </entry>
             <entry>
-               <para>Sets the number of reducers for all MapReduce jobs generated by Pig.</para>
+               <para>Sets the number of reducers for all MapReduce jobs generated by Pig 
+              (see  <ulink url="cookbook.html#Use+the+PARALLEL+Clause">Use the PARALLEL Clause</ulink>).</para>
             </entry>
          </row>
          <row>
@@ -9847,11 +9786,23 @@ grunt&gt; run âparam out=myoutput m
    <title>Example</title>
    <para>In this example debug is set on, the job is assigned a name, and the number of reducers is set to 100.</para>
 <programlisting>
-grunt&gt; set debug on
+grunt&gt; set debug 'on'
 grunt&gt; set job.name 'my job'
 grunt&gt; set default_parallel 100
 </programlisting>
-   </section></section>
+
+
+<para>In this example default_parallel is set in the Pig script; all MapReduce jobs that get launched will use 20 reducers.</para>
+<programlisting>
+SET DEFAULT_PARALLEL 20;
+A = LOAD 'myfile.txt' USING PigStorage() AS (t, u, v);
+B = GROUP A BY t;
+C = FOREACH B GENERATE group, COUNT(A.t) as mycount;
+D = ORDER C BY mycount;
+STORE D INTO 'mysortedcount' USING PigStorage();
+</programlisting>
+
+</section></section>
    
    
    </section>

Modified: hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml
URL: http://svn.apache.org/viewvc/hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml?rev=926752&r1=926751&r2=926752&view=diff
==============================================================================
--- hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml (original)
+++ hadoop/pig/trunk/src/docs/src/documentation/content/xdocs/udf.xml Tue Mar 23 19:47:24 2010
@@ -734,146 +734,395 @@ pig -cp sds.jar -Dudf.import.list=com.ya
 
 </section>
 
+<!-- BEGIN LOAD/STORE FUNCTIONS -->
 <section>
 <title> Load/Store Functions</title>
 
-<p>These user-defined functions control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case. </p>
-
+<p>The load/store user-defined functions control how data goes into Pig and comes out of Pig. Often, the same function handles both input and output but that does not have to be the case. </p>
+<p>
+With Pig 0.7.0, the Pig load/store API moves closer to using Hadoop's InputFormat and OutputFormat classes.
+This enables Pig users/developers to create new LoadFunc and StoreFunc implementation based on existing Hadoop InputFormat and OutputFormat classes with minimal code. The complexity of reading the data and creating a record will now lie in the InputFormat and likewise on the writing end, the complexity of writing will lie in the OutputFormat. This enables Pig to easily read/write data in new storage formats as and when an Hadoop InputFormat and OutputFormat is available for them. </p>
+<p>
+<strong>Note:</strong> Both the LoadFunc and StoreFunc implementations should use the Hadoop 20 API based classes (InputFormat/OutputFormat and related classes) under the <strong>new</strong> org.apache.hadoop.mapreduce package instead of the old org.apache.hadoop.mapred package. 
+</p>
 
 <section>
 <title> Load Functions</title>
+<p><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup">LoadFunc</a> 
+abstract class has the main methods for loading data and for most use cases it would suffice to extend it. There are 3 other optional interfaces which can be implemented to achieve extended functionality: </p>
+
+<ul>
+<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadMetadata.java?view=markup">LoadMetadata</a> 
+has methods to deal with metadata - most implementation of loaders don't need to implement this unless they interact with some metadata system. The getSchema() method in this interface provides a way for loader implementations to communicate the schema of the data back to pig. If a loader implementation returns data comprised of fields of real types (rather than DataByteArray fields), it should provide the schema describing the data returned through the getSchema() method. The other methods are concerned with other types of metadata like partition keys and statistics. Implementations can return null return values for these methods if they are not applicable for that implementation.</li>
+<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadPushDown.java?view=markup">LoadPushDown</a> 
+has methods to push operations from pig runtime into loader implementations - currently only projections .i.e the pushProjection() method is called by Pig to communicate to the loader what exact fields are required in the pig script. The loader implementation can choose to honor the request or respond that it will not honor the request and return all fields in the data. If a loader implementation is able to efficiently return only required fields, it should implement LoadPushDown to improve query performance. (Irrespective of whether the implementation can or cannot return only the required fields, if the implementation also implements getSchema(), the schema returned in getSchema() should be for the entire tuple of data.) </li>
+<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadCaster.java?view=markup">LoadCaster</a> 
+has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from DataByteArray fields to other types need to be supported. </li>
+</ul>
 
-<p>Every load function needs to implement the <code>LoadFunc</code> interface. An abbreviated version is shown below. The full definition can be seen <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/LoadFunc.java?view=markup"> here</a>. </p>
+ <p>The LoadFunc abstract class is the main class to extend for implementing a loader. The methods which need to be overriden are explained below:</p>
+ <ul>
+ <li>getInputFormat() :This method will be called by Pig to get the InputFormat used by the loader. The methods in the InputFormat (and underlying RecordReader) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom InputFormat, it should be implemented using the new API in org.apache.hadoop.mapreduce. If a custom loader using a text-based InputFormat or a file based InputFormat would like to read files in all subdirectories under a given input directory recursively, then it should use the PigFileInputFormat and PigTextInputFormat classes provided in org.apache.pig.backend.hadoop.executionengine.mapReduceLayer. This is to work around the current limitation in Hadoop's TextInputFormat and FileInputFormat which only read one level down from provided input directory. So for ex
 ample if the input in the load statement is 'dir1' and there are subdirs 'dir2' and 'dir2/dir3' underneath dir1, using Hadoop's TextInputFormat or FileInputFormat only files under 'dir1' can be read. Using PigFileInputFormat or PigTextInputFormat (or by extending them), files in all the directories can be read. </li>
+ <li>setLocation() :This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying InputFormat. This method is called multiple times by pig - implementations should bear this in mind and should ensure there are no inconsistent side effects due to the multiple calls. </li>
+ <li>prepareToRead() : Through this method the RecordReader associated with the InputFormat provided by the LoadFunc is passed to the LoadFunc. The RecordReader can then be used by the implementation in getNext() to return a tuple representing a record of data back to pig. </li>
+ <li>getNext() :The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in this method the implementation should use the the underlying RecordReader and construct the tuple to return. </li>
+ </ul>
+
+ <p>The following methods have default implementations in LoadFunc and should be overridden only if needed: </p>
+ <ul>
+ <li>setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the UDFContext any information which the Loader needs to store between various method invocations in the front end and back end. A use case is to store RequiredFieldList passed to it in LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before returning tuples in getNext(). The default implementation in LoadFunc has an empty body. This method will be called before other methods. </li>
+ <li>relativeToAbsolutePath():Pig runtime will call this method to allow the Loader to convert a relative load location to an absolute location. The default implementation provided in LoadFunc handles this for FileSystem locations. If the load source is something else, loader implementation may choose to override this.</li>
+ </ul>
 
+<p><strong>Example Implementation</strong></p>
+<p>
+The loader implementation in the example is a loader for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current PigStorage loader in Pig. The implementation uses an existing Hadoop supported !Inputformat - TextInputFormat as the underlying InputFormat.
+</p>
 <source>
-public interface LoadFunc {
-    public void bindTo(String fileName, BufferedPositionedInputStream is, long offset, long end) throws IOException;
-    public Tuple getNext() throws IOException;
-    // conversion functions
-    public Integer bytesToInteger(byte[] b) throws IOException;
-    public Long bytesToLong(byte[] b) throws IOException;
-    ......
-    public RequiredFieldResponse fieldsToRead(RequiredFieldList requiredFieldList) throws FrontendException;
-    public Schema determineSchema(String fileName, ExecType execType, DataStorage storage) throws IOException;
-</source>
+public class SimpleTextLoader extends LoadFunc {
+    protected RecordReader in = null;
+    private byte fieldDel = '\t';
+    private ArrayList&lt;Object&gt; mProtoTuple = null;
+    private TupleFactory mTupleFactory = TupleFactory.getInstance();
+    private static final int BUFFER_SIZE = 1024;
 
-<p><strong>bindTo</strong></p>
-<p>The <code>bindTo</code> function is called once by each Pig task before it starts processing data. It is intended to connect the function to its input. It provides the following information: </p>
-<ul>
-<li><p> <code>fileName</code> - The name of the file from which the data is read. Not used most of the time </p>
-</li>
-<li><p> <code>is</code> - The input stream from which the data is read. It is already positioned at the place where the function needs to start reading </p>
-</li>
-<li><p> <code>offset</code> - The offset into the stream from which to read. It is equivalent to <code>is.getPosition()</code> and not strictly needed </p>
-</li>
-<li><p> <code>end</code> - The position of the last byte that should be read by the function. </p>
-</li>
-</ul>
+    public SimpleTextLoader() {
+    }
 
-<p>In the Hadoop world, the input data is treated as a continuous stream of bytes. A <code>slicer</code>, discussed in the Advanced Topics section, is used to split the data into chunks with each chunk going to a particular task for processing. This chunk is what <code>bindTo</code> provides to the UDF. Note that unless you use a custom slicer, the default slicer is not aware of tuple boundaries. This means that the chunk you get can start and end in the middle of a particular tuple. One common approach is to skip the first partial tuple and continue past the end position to finish processing a tuple. This is what <code>PigStorage</code> does as the example later in this section shows. </p>
+    /**
+     * Constructs a Pig loader that uses specified character as a field delimiter.
+     *
+     * @param delimiter
+     *            the single byte character that is used to separate fields.
+     *            ("\t" is the default.)
+     */
+    public SimpleTextLoader(String delimiter) {
+        this();
+        if (delimiter.length() == 1) {
+            this.fieldDel = (byte)delimiter.charAt(0);
+        } else if (delimiter.length() &gt;  1 &amp; &amp; delimiter.charAt(0) == '\\') {
+            switch (delimiter.charAt(1)) {
+            case 't':
+                this.fieldDel = (byte)'\t';
+                break;
+
+            case 'x':
+               fieldDel =
+                    Integer.valueOf(delimiter.substring(2), 16).byteValue();
+               break;
+
+            case 'u':
+                this.fieldDel =
+                    Integer.valueOf(delimiter.substring(2)).byteValue();
+                break;
 
-<p><strong>getNext</strong></p>
-<p>The <code>getNext</code> function reads the input stream and constructs the next tuple. It returns <code>null</code> when it is done with processing and throws an <code>IOException</code> if it fails to process an input tuple. </p>
+            default:
+                throw new RuntimeException("Unknown delimiter " + delimiter);
+            }
+        } else {
+            throw new RuntimeException("PigStorage delimeter must be a single character");
+        }
+    }
 
-<p><strong>conversion routines</strong></p>
-<p>Next is a bunch of conversion routines that convert data from <code>bytearray</code> to the requested type. This requires further explanation. By default, we would like the loader to do as little per-tuple processing as possible. This is because many tuples can be thrown out during filtering or joins. Also, many fields might not get used because they get projected out. If the data needs to be converted into another form, we would like this conversion to happen as late as possible. The majority of the loaders should return the data as bytearrays and the Pig will request a conversion from bytearray to the actual type when needed. Let's looks at the example below: </p>
+    @Override
+    public Tuple getNext() throws IOException {
+        try {
+            boolean notDone = in.nextKeyValue();
+            if (!notDone) {
+                return null;
+            }
+            Text value = (Text) in.getCurrentValue();
+            byte[] buf = value.getBytes();
+            int len = value.getLength();
+            int start = 0;
+
+            for (int i = 0; i &lt; len; i++) {
+                if (buf[i] == fieldDel) {
+                    readField(buf, start, i);
+                    start = i + 1;
+                }
+            }
+            // pick up the last field
+            readField(buf, start, len);
 
-<source>
-A = load 'student_data' using PigStorage() as (name: chararray, age: int, gpa: float);
-B = filter A by age >25;
-C = foreach B generate name;
-dump C;
-</source>
+            Tuple t =  mTupleFactory.newTupleNoCopy(mProtoTuple);
+            mProtoTuple = null;
+            return t;
+        } catch (InterruptedException e) {
+            int errCode = 6018;
+            String errMsg = "Error while reading input";
+            throw new ExecException(errMsg, errCode,
+                    PigException.REMOTE_ENVIRONMENT, e);
+        }
 
-<p>In this query, only <code>age</code> needs to be converted to its actual type (=int=) right away. <code>name</code> only needs to be converted in the next step of processing where the data is likely to be much smaller. <code>gpa</code> is not used at all and will never need to be converted. </p>
+    }
 
-<p>This is the main reason for Pig to separate the reading of the data (which can happen immediately) from the converting of the data (to the right type, which can happen later). For ASCII data, Pig provides <code>Utf8StorageConverter</code> that your loader class can extend and will take care of all the conversion routines. The code for it can be found <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/Utf8StorageConverter.java?view=markup"> here</a>. </p>
+    private void readField(byte[] buf, int start, int end) {
+        if (mProtoTuple == null) {
+            mProtoTuple = new ArrayList&lt;Object&gt;();
+        }
 
-<p>Note that conversion rutines should return null values for data that can't be converted to the specified type. </p>
+        if (start == end) {
+            // NULL value
+            mProtoTuple.add(null);
+        } else {
+            mProtoTuple.add(new DataByteArray(buf, start, end));
+        }
+    }
 
-<p>Loaders that work with binary data like <code>BinStorage</code> are not going to use this model. Instead, they will produce objects of the appropriate types. However, they might still need to define conversion routines in case some of the fields in a tuple are of type <code>bytearray</code>. </p>
+    @Override
+    public InputFormat getInputFormat() {
+        return new TextInputFormat();
+    }
 
-<p><strong>fieldsToRead</strong></p>
-<p>
-The intent of the <code>fieldsToRead</code> function is to reduce the amount of data returned from the loader. Pig will evaluate the script and determine the minimal set of columns needed to execute it. This information will be passed to the <code>fieldsToRead</code> function of the loader in the <code>requiredFieldList</code> parameter. The parameter is of type <code>RequiredFieldList</code> that is defined as part of the
-<a href="http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/org/apache/pig/LoadFunc.java?view=markup">LoadFunc</a> interface. 
-If the loader chooses not to purge unneeded columns, it can use the following implementation:
-</p>
-<source>
-public LoadFunc.RequiredFieldResponse fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList) throws FrontendException {
-        return new LoadFunc.RequiredFieldResponse(false);
+    @Override
+    public void prepareToRead(RecordReader reader, PigSplit split) {
+        in = reader;
+    }
+
+    @Override
+    public void setLocation(String location, Job job)
+            throws IOException {
+        FileInputFormat.setInputPaths(job, location);
+    }
 }
 </source>
 
+</section>
+<!-- END LOAD FUNCTION -->
+
+<section>
+<title> Store Functions</title>
+
+<p><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreFunc.java?view=markup">StoreFunc</a> 
+abstract class has the main methods for storing data and for most use cases it should suffice to extend it. There is an optional interface which can be implemented to achieve extended functionality: </p>
+<ul>
+<li><a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/StoreMetadata.java?view=markup">StoreMetadata:</a> 
+This interface has methods to interact with metadata systems to store schema and store statistics. This interface is truely optional and should only be implemented if metadata needs to stored. </li>
+</ul>
+
+<p>The methods which need to be overridden in StoreFunc are explained below: </p>
+<ul>
+<li>getOutputFormat(): This method will be called by Pig to get the OutputFormat used by the storer. The methods in the OutputFormat (and underlying RecordWriter and OutputCommitter) will be called by pig in the same manner (and in the same context) as by Hadoop in a map-reduce java program. If the OutputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom OutputFormat, it should be implemented using the new API under org.apache.hadoop.mapreduce. The checkOutputSpecs() method of the OutputFormat will be called by pig to check the output location up-front. This method will also be called as part of the Hadoop call sequence when the job is launched. So implementations should ensure that this method can be called multiple times without inconsistent side effects. </li>
+<li>setStoreLocation(): This method is called by Pig to communicate the store location to the storer. The storer should use this method to communicate the same information to the underlying OutputFormat. This method is called multiple times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. </li>
+<li>prepareToWrite(): In the new API, writing of the data is through the OutputFormat provided by the StoreFunc. In prepareToWrite() the RecordWriter associated with the OutputFormat provided by the StoreFunc is passed to the StoreFunc. The RecordWriter can then be used by the implementation in putNext() to write a tuple representing a record of data in a manner expected by the RecordWriter. </li>
+<li>putNext(): The meaning of putNext() has not changed and is called by Pig runtime to write the next tuple of data - in the new API, this is the method wherein the implementation will use the the underlying RecordWriter to write the Tuple out.</li>
+</ul>
+
+<p>The following methods have default implementations in StoreFunc and should be overridden only if necessary: </p>
+<ul>
+<li>setStoreFunc!UDFContextSignature(): This method will be called by Pig both in the front end and back end to pass a unique signature to the Storer. The signature can be used to store into the UDFContext any information which the Storer needs to store between various method invocations in the front end and back end. The default implementation in StoreFunc has an empty body. This method will be called before other methods. 
+</li>
+<li>relToAbsPathForStoreLocation(): Pig runtime will call this method to allow the Storer to convert a relative store location to an absolute location. An implementation is provided in StoreFunc which handles this for FileSystem based locations. </li>
+<li>checkSchema(): A Store function should implement this function to check that a given schema describing the data to be written is acceptable to it. The default implementation in StoreFunc has an empty body. This method will be called before any calls to setStoreLocation(). </li>
+</ul>
+
+<p><strong>Example Implementation</strong></p>
 <p>
-This tells Pig that it should expect the entire column set from the loader. We expect that most loaders will stick to this implementation. In our tests of PigStorage, we saw about 5% improvement when selecting 5 columns out of 40. The loaders that should take advantage of this functionality are the ones, like Zebra, that can pass this information directly to the storage layer. For an example of <code>fieldsToRead</code> see the implementation in <a href="http://svn.apache.org/viewvc/hadoop/pig/branches/branch-0.6/src/org/apache/pig/builtin/PigStorage.java?view=markup"> PigStorage</a>.
+The storer implementation in the example is a storer for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - this is similar to current PigStorage storer in Pig. The implementation uses an existing Hadoop supported OutputFormat - TextOutputFormat as the underlying OutputFormat. 
 </p>
 
-<p><strong>determineSchema</strong></p>
-<p>The <code>determineSchema</code> function must be implemented by loaders that return real data types rather than <code>bytearray</code> fields. Other loaders should just return <code>null</code>. The idea here is that Pig needs to know the actual types it will be getting; Pig will call <code>determineSchema</code> on the client side to get this information. The function is provided as a way to sample the data to determine its schema.  </p>
+<source>
+public class SimpleTextStorer extends StoreFunc {
+    protected RecordWriter writer = null;
 
-<p>Here is the example of the function implemented by =BinStorage=: </p>
+    private byte fieldDel = '\t';
+    private static final int BUFFER_SIZE = 1024;
+    private static final String UTF8 = "UTF-8";
+    public PigStorage() {
+    }
 
-<source>
-public Schema determineSchema(String fileName, ExecType execType, DataStorage storage) throws IOException {
-    InputStream is = FileLocalizer.open(fileName, execType, storage);
-    bindTo(fileName, new BufferedPositionedInputStream(is), 0, Long.MAX_VALUE);
-        // get the first record from the input file and figure out the schema 
-        Tuple t = getNext();
-        if(t == null) return null;
-        int numFields = t.size();
-        Schema s = new Schema();
-        for (int i = 0; i  numFields; i++) {
+    public PigStorage(String delimiter) {
+        this();
+        if (delimiter.length() == 1) {
+            this.fieldDel = (byte)delimiter.charAt(0);
+        } else if (delimiter.length() > 1delimiter.charAt(0) == '\\') {
+            switch (delimiter.charAt(1)) {
+            case 't':
+                this.fieldDel = (byte)'\t';
+                break;
+
+            case 'x':
+               fieldDel =
+                    Integer.valueOf(delimiter.substring(2), 16).byteValue();
+               break;
+            case 'u':
+                this.fieldDel =
+                    Integer.valueOf(delimiter.substring(2)).byteValue();
+                break;
+
+            default:
+                throw new RuntimeException("Unknown delimiter " + delimiter);
+            }
+        } else {
+            throw new RuntimeException("PigStorage delimeter must be a single character");
+        }
+    }
+
+    ByteArrayOutputStream mOut = new ByteArrayOutputStream(BUFFER_SIZE);
+
+    @Override
+    public void putNext(Tuple f) throws IOException {
+        int sz = f.size();
+        for (int i = 0; i &lt; sz; i++) {
+            Object field;
             try {
-                s.add(DataType.determineFieldSchema(t.get(i)));
-            } catch (Exception e) {
-                throw WrappedIOException.wrap(e);
+                field = f.get(i);
+            } catch (ExecException ee) {
+                throw ee;
+            }
+
+            putField(field);
+
+            if (i != sz - 1) {
+                mOut.write(fieldDel);
             }
         }
-        return s;
+        Text text = new Text(mOut.toByteArray());
+        try {
+            writer.write(null, text);
+            mOut.reset();
+        } catch (InterruptedException e) {
+            throw new IOException(e);
+        }
     }
-</source>
 
-<p>Note that this approach assumes that the data has a uniform schema. The function needs to make sure that the data it produces conforms to the schema returned by <code>determineSchema</code>, otherwise the processing will fail. This means producing the right number of fields in the tuple (dropping fields or emitting null values if needed) and producing fields of the right type (again emitting null values as needed). </p>
-<p>For complete examples, see <a  href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/BinStorage.java?view=markup">BinStorage</a> and <a href="http://svn.apache.org/viewvc/hadoop/pig/trunk/src/org/apache/pig/builtin/PigStorage.java?view=markup"> PigStorage</a>. </p>
+    @SuppressWarnings("unchecked")
+    private void putField(Object field) throws IOException {
+        //string constants for each delimiter
+        String tupleBeginDelim = "(";
+        String tupleEndDelim = ")";
+        String bagBeginDelim = "{";
+        String bagEndDelim = "}";
+        String mapBeginDelim = "[";
+        String mapEndDelim = "]";
+        String fieldDelim = ",";
+        String mapKeyValueDelim = "#";
+
+        switch (DataType.findType(field)) {
+        case DataType.NULL:
+            break; // just leave it empty
+
+        case DataType.BOOLEAN:
+            mOut.write(((Boolean)field).toString().getBytes());
+            break;
+
+        case DataType.INTEGER:
+            mOut.write(((Integer)field).toString().getBytes());
+            break;
+
+        case DataType.LONG:
+            mOut.write(((Long)field).toString().getBytes());
+            break;
+
+        case DataType.FLOAT:
+            mOut.write(((Float)field).toString().getBytes());
+            break;
+
+        case DataType.DOUBLE:
+            mOut.write(((Double)field).toString().getBytes());
+            break;
+
+        case DataType.BYTEARRAY: {
+            byte[] b = ((DataByteArray)field).get();
+            mOut.write(b, 0, b.length);
+            break;
+                                 }
+
+        case DataType.CHARARRAY:
+            // oddly enough, writeBytes writes a string
+            mOut.write(((String)field).getBytes(UTF8));
+            break;
+
+        case DataType.MAP:
+            boolean mapHasNext = false;
+            Map&lt;String, Object&gt; m = (Map&lt;String, Object&gt;)field;
+            mOut.write(mapBeginDelim.getBytes(UTF8));
+            for(Map.Entry&lt;String, Object&gt; e: m.entrySet()) {
+                if(mapHasNext) {
+                    mOut.write(fieldDelim.getBytes(UTF8));
+                } else {
+                    mapHasNext = true;
+                }
+                putField(e.getKey());
+                mOut.write(mapKeyValueDelim.getBytes(UTF8));
+                putField(e.getValue());
+            }
+            mOut.write(mapEndDelim.getBytes(UTF8));
+            break;
 
-</section>
+        case DataType.TUPLE:
+            boolean tupleHasNext = false;
+            Tuple t = (Tuple)field;
+            mOut.write(tupleBeginDelim.getBytes(UTF8));
+            for(int i = 0; i &lt; t.size(); ++i) {
+                if(tupleHasNext) {
+                    mOut.write(fieldDelim.getBytes(UTF8));
+                } else {
+                    tupleHasNext = true;
+                }
+                try {
+                    putField(t.get(i));
+                } catch (ExecException ee) {
+                    throw ee;
+                }
+            }
+            mOut.write(tupleEndDelim.getBytes(UTF8));
+            break;
 
-<section>
-<title> Store Functions</title>
+        case DataType.BAG:
+            boolean bagHasNext = false;
+            mOut.write(bagBeginDelim.getBytes(UTF8));
+            Iterator&lt;Tuple&gt; tupleIter = ((DataBag)field).iterator();
+            while(tupleIter.hasNext()) {
+                if(bagHasNext) {
+                    mOut.write(fieldDelim.getBytes(UTF8));
+                } else {
+                    bagHasNext = true;
+                }
+                putField((Object)tupleIter.next());
+            }
+            mOut.write(bagEndDelim.getBytes(UTF8));
+            break;
 
-<p>All store functions need to implement the <code>StoreFunc</code> interface: </p>
+        default: {
+            int errCode = 2108;
+            String msg = "Could not determine data type of field: " + field;
+            throw new ExecException(msg, errCode, PigException.BUG);
+        }
 
-<source>
-public interface StoreFunc {
-    public abstract void bindTo(OutputStream os) throws IOException;
-    public abstract void putNext(Tuple f) throws IOException;
-    public abstract void finish() throws IOException;
-}
-</source>
+        }
+    }
 
-<p>The <code>bindTo</code> method is called in the beginning of the processing to connect the store function to the output stream it will write to. The <code>putNext</code> method is called for every tuple to be stored and is responsible for writing the tuple into the output. The <code>finish</code> function is called at the end of the processing to do all needed cleanup like flushing the output stream. </p>
-<p>Here is an example of a simple store function that writes data as a string returned from the <code>toString</code> function. </p>
+    @Override
+    public OutputFormat getOutputFormat() {
+        return new TextOutputFormat&lt;WritableComparable, Text&gt;();
+    }
 
-<source>
-public class StringStore implements StoreFunc {
-    OutputStream os;
-    private byte recordDel = (byte)'\n';
-    public void bindTo(OutputStream os) throws IOException
-    {
-        this.os = os;
-    }
-    public void putNext(Tuple t) throws IOException
-    {
-        os.write((t.toString() + (char)this.recordDel).getBytes("utf8"));
-    }
-    public void finish() throws IOException
-    {
-         os.flush();
+    @Override
+    public void prepareToWrite(RecordWriter writer) {
+        this.writer = writer;
+    }
+
+    @Override
+    public void setStoreLocation(String location, Job job) throws IOException {
+        job.getConfiguration().set("mapred.textoutputformat.separator", "");
+        FileOutputFormat.setOutputPath(job, new Path(location));
+        if (location.endsWith(".bz2")) {
+            FileOutputFormat.setCompressOutput(job, true);
+            FileOutputFormat.setOutputCompressorClass(job,  BZip2Codec.class);
+        }  else if (location.endsWith(".gz")) {
+            FileOutputFormat.setCompressOutput(job, true);
+            FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
+        }
     }
 }
 </source>
-</section></section>
+
+</section>
+<!-- END STORE FUNCTION -->
+</section>
+<!-- END LOAD/STORE FUNCTIONS -->
 
 <section>
 <title> Comparison Functions</title>