You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pig.apache.org by da...@apache.org on 2011/09/01 01:04:08 UTC
svn commit: r1163862 - in /pig/trunk: CHANGES.txt src/docs/src/documentation/content/xdocs/perf.xml src/docs/src/documentation/content/xdocs/pig-index.xml src/org/apache/pig/Main.java

Author: daijy
Date: Wed Aug 31 23:04:08 2011
New Revision: 1163862

URL: http://svn.apache.org/viewvc?rev=1163862&view=rev
Log:
PIG-2221: Couldnt find documentation for ColumnMapKeyPrune optimization rule

Modified:
    pig/trunk/CHANGES.txt
    pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml
    pig/trunk/src/docs/src/documentation/content/xdocs/pig-index.xml
    pig/trunk/src/org/apache/pig/Main.java

Modified: pig/trunk/CHANGES.txt
URL: http://svn.apache.org/viewvc/pig/trunk/CHANGES.txt?rev=1163862&r1=1163861&r2=1163862&view=diff
==============================================================================
--- pig/trunk/CHANGES.txt (original)
+++ pig/trunk/CHANGES.txt Wed Aug 31 23:04:08 2011
@@ -233,6 +233,8 @@ IMPROVEMENTS
 
 PIG-2213: Pig 0.9.1 Documentation (chandec via daijy)
 
+PIG-2221: Couldnt find documentation for ColumnMapKeyPrune optimization rule (chandec via daijy)
+
 BUG FIXES
 
 PIG-2231: Limit produce wrong number of records after foreach flatten (daijy)

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml?rev=1163862&r1=1163861&r2=1163862&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/perf.xml Wed Aug 31 23:04:08 2011
@@ -430,6 +430,7 @@ STORE Gtab INTO '/user/vxj/finalresult2'
  <!-- OPTIMIZATION RULES -->
 <section id="optimization-rules">
 <title>Optimization Rules</title>
+
 <p>Pig supports various optimization rules. By default optimization, and all optimization rules, are turned on. 
 To turn off optimiztion, use:</p>
 
@@ -440,28 +441,9 @@ pig -optimizer_off [opt_rule | all ]
 <p>Note that some rules are mandatory and cannot be turned off.</p>
 
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="ImplicitSplitInserter">
-<title>ImplicitSplitInserter</title>
-<p>Status: Mandatory</p>
-<p>
-<a href="basic.html#SPLIT">SPLIT</a> is the only operator that models multiple outputs in Pig. 
-To ease the process of building logical plans, all operators are allowed to have multiple outputs. As part of the 
-optimization, all non-split operators that have multiple outputs are altered to have a SPLIT operator as the output 
-and the outputs of the operator are then made outputs of the SPLIT operator. An example will illustrate the point. 
-Here, a split will be inserted after the LOAD and the split outputs will be connected to the FILTER (b) and the COGROUP (c).
-</p>
-<source>
-A = LOAD 'input';
-B = FILTER A BY $1 == 1;
-C = COGROUP A BY $0, B BY $0;
-</source>
-</section>
-
-<!-- +++++++++++++++++++++++++++++++ -->
-<section id="LogicalExpressionSimplifier">
-<title>LogicalExpressionSimplifier</title>
-<p>This rule contains several types of simplifications.</p>
-
+<section id="FilterLogicExpressionSimplifier">
+<title>FilterLogicExpressionSimplifier</title>
+<p>This rule simplifies the expression in filter statement.</p>
 <source>
 1) Constant pre-calculation 
 
@@ -505,53 +487,45 @@ is simplified to non-filtering 
 </source>
 </section>
 
-
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="MergeForEach">
-<title>MergeForEach</title>
-<p>The objective of this rule is to merge together two feach statements, if these preconditions are met:</p>
-<ul>
-	<li>The foreach statements are consecutive. </li>
-	<li>The first foreach statement does not contain flatten. </li>
-	<li>The second foreach is not nested. </li>
-</ul>
+<section id="SplitFilter">
+<title>SplitFilter</title>
+<p>Split filter conditions so that we can push filter more aggressively.</p>
 <source>
--- Original code: 
-
-A = LOAD 'file.txt' AS (a, b, c); 
-B = FOREACH A GENERATE a+b AS u, c-b AS v; 
-C = FOREACH B GENERATE $0+5, v; 
-
--- Optimized code: 
-
-A = LOAD 'file.txt' AS (a, b, c); 
-C = FOREACH A GENERATE a+b+5, c-b; 
+A = LOAD 'input1' as (a0, a1);
+B = LOAD 'input2' as (b0, b1);
+C = JOIN A by a0, B by b0;
+D = FILTER C BY a1&gt;0 and b1&gt;0;
+</source>
+<p>Here D will be splitted into:</p>
+<source>
+X = FILTER C BY a1&gt;0;
+D = FILTER X BY b1&gt;0;
 </source>
+<p>So "a1&gt;0" and "b1&gt;0" can be pushed up individually.</p>
 </section>
 
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="OpLimitOptimizer">
-<title>OpLimitOptimizer</title>
-<p>
-The objective of this rule is to push the <a href="basic.html#LIMIT">LIMIT</a> operator up the data flow graph 
-(or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY.
-</p>
+<section id="PushUpFilter">
+<title>PushUpFilter</title>
+<p>The objective of this rule is to push the FILTER operators up the data flow graph. As a result, the number of records that flow through the pipeline is reduced. </p>
 <source>
 A = LOAD 'input';
-B = ORDER A BY $0;
-C = LIMIT B 10;
+B = GROUP A BY $0;
+C = FILTER B BY $0 &lt; 10;
 </source>
 </section>
 
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="PushDownExplodes">
-<title>PushDownExplodes</title>
-<p>
-The objective of this rule is to reduce the number of records that flow through the pipeline by moving 
-<a href="basic.html#FOREACH">FOREACH</a> operators with a 
-<a href="basic.html#Flatten">FLATTEN</a> down the data flow graph. 
-In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation.
-</p>
+<section id="MergeFilter">
+<title>MergeFilter</title>
+<p>Merge filter conditions after PushUpFilter rule to decrease the number of filter statements.</p>
+</section>
+
+<!-- +++++++++++++++++++++++++++++++ -->
+<section id="PushDownForEachFlatten">
+<title>PushDownForEachFlatten</title>
+<p>The objective of this rule is to reduce the number of records that flow through the pipeline by moving FOREACH operators with a FLATTEN down the data flow graph. In the example shown below, it would be more efficient to move the foreach after the join to reduce the cost of the join operation.</p>
 <source>
 A = LOAD 'input' AS (a, b, c);
 B = LOAD 'input2' AS (x, y, z);
@@ -561,50 +535,88 @@ D = JOIN C BY $1, B BY $1;
 </section>
 
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="pushupfilters">
-<title>PushUpFilters</title>
-<p>
-The objective of this rule is to push the <a href="basic.html#FILTER">FILTER</a> operators up the data flow graph. 
-As a result, the number of records that flow through the pipeline is reduced. 
-</p>
+<section id="LimitOptimizer">
+<title>LimitOptimizer</title>
+<p>The objective of this rule is to push the LIMIT operator up the data flow graph (or down the tree for database folks). In addition, for top-k (ORDER BY followed by a LIMIT) the LIMIT is pushed into the ORDER BY.</p>
 <source>
 A = LOAD 'input';
-B = GROUP A BY $0;
-C = FILTER B BY $0 &lt; 10;
+B = ORDER A BY $0;
+C = LIMIT B 10;
 </source>
 </section>
 
+<!-- +++++++++++++++++++++++++++++++ -->
+<section id="ColumnMapKeyPrune">
+<title>ColumnMapKeyPrune</title>
+<p>Prune the loader to only load necessary columns. The performance gain is more significant if the corresponding loader support column pruning and only load necessary columns (See LoadPushDown.pushProjection). Otherwise, ColumnMapKeyPrune will insert a ForEach statement right after loader.</p>
+<source>
+A = load 'input' as (a0, a1, a2);
+B = ORDER A by a0;
+C = FOREACH B GENERATE a0, a1;
+</source>
+<p>a2 is irrelevant in this query, so we can prune it earlier. The loader in this query is PigStorage and it supports column pruning. So we only load a0 and a1 from the input file.</p>
+<p>ColumnMapKeyPrune also prunes unused map keys:</p>
+<source>
+A = load 'input' as (a0:map[]);
+B = FOREACH A generate a0#'key1';
+</source>
+</section>
 
+<!-- +++++++++++++++++++++++++++++++ -->
+<section id="AddForEach">
+<title>AddForEach</title>
+<p>Prune unused column as soon as possible. In addition to prune the loader in ColumnMapKeyPrune, we can prune a column as soon as it is not used in the rest of the script</p>
+<source>
+-- Original code: 
+
+A = LOAD 'input' AS (a0, a1, a2); 
+B = ORDER A BY a0;
+C = FILTER B BY a1&gt;0;
+</source>
+<p>We can only prune a2 from the loader. However, a0 is never used after "ORDER BY". So we can drop a0 right after "ORDER BY" statement.</p>
+<source>
+-- Optimized code: 
+
+A = LOAD 'input' AS (a0, a1, a2); 
+B = ORDER A BY a0;
+B1 = FOREACH B GENERATE a1;  -- drop a0
+C = FILTER B1 BY a1&gt;0;
+</source>
+</section>
 
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="StreamOptimizer">
-<title>StreamOptimizer</title>
-<p>
-Optimize when <a href="basic.html#LOAD">LOAD</a> precedes <a href="basic.html#STREAM">STREAM</a> 
-and the loader class is the same as the serializer for the stream. Similarly, optimize when STREAM is followed by 
-<a href="basic.html#STORE">STORE</a> and the deserializer class is same as the storage class. 
-For both of these cases the optimization is to replace the loader/serializer with BinaryStorage which just moves bytes 
-around and to replace the storer/deserializer with BinaryStorage.
-</p>
+<section id="MergeForEach">
+<title>MergeForEach</title>
+<p>The objective of this rule is to merge together two feach statements, if these preconditions are met:</p>
+<ul>
+<li>The foreach statements are consecutive.</li>
+<li>The first foreach statement does not contain flatten.</li>
+<li>The second foreach is not nested.</li>
+</ul>
+<source>
+-- Original code: 
+
+A = LOAD 'file.txt' AS (a, b, c); 
+B = FOREACH A GENERATE a+b AS u, c-b AS v; 
+C = FOREACH B GENERATE $0+5, v; 
+
+-- Optimized code: 
+
+A = LOAD 'file.txt' AS (a, b, c); 
+C = FOREACH A GENERATE a+b+5, c-b;
+</source>
 </section>
 
 <!-- +++++++++++++++++++++++++++++++ -->
-<section id="TypeCastInserter">
-<title>TypeCastInserter</title>
-<p>Status: Mandatory</p>
-<p>
-If you specify a <a href="basic.html#Schemas">schema</a> with the 
-<a href="basic.html#LOAD">LOAD</a> statement, the optimizer will perform a pre-fix projection of the columns 
-and <a href="basic.html#Cast">cast</a> the columns to the appropriate types. An example will illustrate the point. 
-The LOAD statement (a) has a schema associated with it. The optimizer will insert a FOREACH operator that will project columns 0, 1 and 2 
-and also cast them to chararray, int and float respectively. 
-</p>
+<section id="GroupByConstParallelSetter">
+<title>GroupByConstParallelSetter</title>
+<p>Force parallel "1" for "group all" statement. That's because even if we set parallel to N, only 1 reducer will be used in this case and all other reducer produce empty result.</p>
 <source>
-A = LOAD 'input' AS (name: chararray, age: int, gpa: float);
-B = FILER A BY $1 == 1;
-C = GROUP A By $0;
+A = LOAD 'input';
+B = GROUP A all PARALLEL 10;
 </source>
 </section>
+
 </section>
 
   

Modified: pig/trunk/src/docs/src/documentation/content/xdocs/pig-index.xml
URL: http://svn.apache.org/viewvc/pig/trunk/src/docs/src/documentation/content/xdocs/pig-index.xml?rev=1163862&r1=1163861&r2=1163862&view=diff
==============================================================================
--- pig/trunk/src/docs/src/documentation/content/xdocs/pig-index.xml (original)
+++ pig/trunk/src/docs/src/documentation/content/xdocs/pig-index.xml Wed Aug 31 23:04:08 2011
@@ -96,6 +96,8 @@
 
 <p><a href="func.html#acos">ACOS</a> function</p>
 
+<p> <a href="perf.html#AddForEach">AddForEach</a> optimization rule</p>
+
 <p><a href="udf.html#aggregate-functions">aggregate functions</a></p>
 
 <p><a href="udf.html#algebraic-interface">algebraic interface</a></p>
@@ -176,6 +178,8 @@
 
 <p><a href="basic.html#cogroup">COGROUP</a> operator</p>
 
+<p> <a href="perf.html#ColumnMapKeyPrune">ColumnMapKeyPrune</a> optimization rule</p>
+
 <p><a href="perf.html#combiner">combiner</a></p>
 
 <p><a href="start.html#comments">comments</a> (in Pig Scripts)</p>
@@ -335,6 +339,8 @@
 
 <p> <a href="udf.html#filter-functions">filter functions</a></p>
 
+<p> <a href="perf.html#FilterLogicExpressionSimplifier">FilterLogicExpressionSimplifier</a> optimization rule</p>
+
 <p><a href="basic.html#flatten">flatten operator</a></p>
 
 <p><a href="func.html#floor">FLOOR</a> function</p>
@@ -369,6 +375,8 @@
 
 <p><a href="basic.html#group">GROUP</a> operator</p>
 
+<p> <a href="perf.html#GroupByConstParallelSetter">GroupByConstParallelSetter</a> optimization rule</p>
+
 <p><a href="start.html#interactive-mode">grunt shell</a></p>
 
 <!-- ==== H ================================================================== -->
@@ -397,7 +405,6 @@
 
 <p><a href="test.html#illustrate">ILLUSTRATE</a> operator</p>
 
-<p><a href="perf.html#ImplicitSplitInserter">ImplicitSplitInserter</a> optimization rule</p>
 
 <p><a href="cont.html#import-macros">IMPORT (macros)</a> operator</p>
 
@@ -492,6 +499,8 @@
 
 <p><a href="perf.html#limit">LIMIT and optimization</a></p>
 
+<p> <a href="perf.html#LimitOptimizer">LimitOptimizer</a> optimization rule</p>
+
 <p><a href="basic.html#load">LOAD</a> operator</p>
 
 <p><a href="udf.html#LoadCaster">LoadCaster</a> interface</p>
@@ -528,7 +537,6 @@
 
 <p><a href="test.html#logical-plan">logical execution plan</a></p>
 
-<p> <a href="perf.html#LogicalExpressionSimplifier">LogicalExpressionSimplifier</a> optimization rule</p>
 
 <p><a href="func.html#lower">LOWER</a> function</p>
 
@@ -568,7 +576,9 @@
 
 <p><a href="perf.html#memory-management">memory management</a>. <em>See also</em> batch mode</p>
 
-<p><a href="perf.html#MergeForEach">MergeForEach</a> optimization rule</p>
+<p> <a href="perf.html#MergeFilter">MergeFilter</a> optimization rule</p>
+
+<p> <a href="perf.html#MergeForEach">MergeForEach</a> optimization rule</p>
 
 <p><a href="perf.html#merge-joins">merge joins</a></p>
 
@@ -603,18 +613,17 @@
 <p></p>
 <p id="o-index"><strong>O</strong> (<a href="#top">top</a>) ----------------------------------------------</p>
 
-
-<p> <a href="perf.html#OpLimitOptimizer">OpLimitOptimizer</a> optimization rule</p>
-
 <p><a href="perf.html#optimization-rules">optimization rules</a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#ImplicitSplitInserter">ImplicitSplitInserter </a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#LogicalExpressionSimplifier">LogicalExpressionSimplifier </a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#AddForEach">AddForEach </a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#ColumnMapKeyPrune">ColumnMapKeyPrune</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#FilterLogicExpressionSimplifier">FilterLogicExpressionSimplifier</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#GroupByConstParallelSetter">GroupByConstParallelSetter</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#LimitOptimizer">LimitOptimizer</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#MergeFilter">MergeFilter </a>
 <br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#MergeForEach">MergeForEach</a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#OpLimitOptimizer">OpLimitOptimizer</a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#PushDownExplodes">PushDownExplodes</a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#PushUpFilters">PushUpFilters </a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#StreamOptimizer">StreamOptimizer</a>
-<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#TypeCastInserter">TypeCastInserter</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#PushDownForEachFlatten">PushDownForEachFlatten</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#PushUpFilter">PushUpFilter</a>
+<br></br>&nbsp;&nbsp;&nbsp; <a href="perf.html#SplitFilter">SplitFilter</a>
 </p>
 
 <p><a href="basic.html#boolops">OR</a> (Boolean)</p>
@@ -745,11 +754,12 @@
 <br></br>&nbsp;&nbsp;&nbsp; <a href="start.html#pig-properties">specifying Pig properties</a>
 </p>
 
-<p> <a href="perf.html#PushDownExplodes">PushDownExplodes</a> optimization rule</p>
+<p> <a href="perf.html#PushDownForEachFlatten">PushDownForEachFlatten</a> optimization rule</p>
+
 
 <p><a href="udf.html#pushprojection">pushProjection</a> method</p>
 
-<p><a href="perf.html#PushUpFilters">PushUpFilters</a> optimization rule</p>
+<p> <a href="perf.html#PushUpFilter">PushUpFilter</a> optimization rule</p>
 
 <p><a href="udf.html#putNext">putNext</a> method</p>
 
@@ -878,6 +888,8 @@
 
 <p><a href="basic.html#split">SPLIT</a> operator</p>
 
+<p> <a href="perf.html#SplitFilter">SplitFilter</a> optimization rule</p>
+
 <p><a href="perf.html#splits">splits</a> (implicit, explicit)</p>
 
 <p><a href="func.html#sqrt">SQRT</a> function</p>
@@ -911,7 +923,7 @@
 
 <p><a href="basic.html#define-udfs">streaming</a> (DEFINE operator)</p>
 
-<p> <a href="perf.html#StreamOptimizer">StreamOptimizer</a> optimization rule</p>
+
 
 <p><a href="func.html#string-functions">string functions</a></p>
 
@@ -951,7 +963,7 @@
 <br></br>&nbsp;&nbsp;&nbsp; <a href="basic.html#tuple">syntax</a>
 </p>
 
-<p> <a href="perf.html#TypeCastInserter">TypeCastInserter</a> optimiztion rule</p>
+
 
 <p>type conversions. <em>See</em> casting types, types tables</p>
 

Modified: pig/trunk/src/org/apache/pig/Main.java
URL: http://svn.apache.org/viewvc/pig/trunk/src/org/apache/pig/Main.java?rev=1163862&r1=1163861&r2=1163862&view=diff
==============================================================================
--- pig/trunk/src/org/apache/pig/Main.java (original)
+++ pig/trunk/src/org/apache/pig/Main.java Wed Aug 31 23:04:08 2011
@@ -757,15 +757,16 @@ public static void usage()
         System.out.println("    -p, -param - Key value pair of the form param=val");
         System.out.println("    -r, -dryrun - Produces script with substituted parameters. Script is not executed.");
         System.out.println("    -t, -optimizer_off - Turn optimizations off. The following values are supported:");
+        System.out.println("            FilterLogicExpressionSimplifier - Simplify filter expressions");
         System.out.println("            SplitFilter - Split filter conditions");
-        System.out.println("            MergeFilter - Merge filter conditions");
         System.out.println("            PushUpFilter - Filter as early as possible");
+        System.out.println("            MergeFilter - Merge filter conditions");
         System.out.println("            PushDownForeachFlatten - Join or explode as late as possible");
-        System.out.println("            ColumnMapKeyPrune - Remove unused data");
         System.out.println("            LimitOptimizer - Limit as early as possible");
+        System.out.println("            ColumnMapKeyPrune - Remove unused data");
         System.out.println("            AddForEach - Add ForEach to remove unneeded columns");
         System.out.println("            MergeForEach - Merge adjacent ForEach");
-        System.out.println("            FilterLogicExpressionSimplifier - Combine multiple expressions");
+        System.out.println("            GroupByConstParallelSetter - Force parallel 1 for \"group all\" statement");
         System.out.println("            All - Disable all optimizations");
         System.out.println("        All optimizations are enabled by default. Optimization values are case insensitive.");
         System.out.println("    -v, -verbose - Print all error messages to screen");