You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by kelly zhang <li...@intel.com> on 2016/01/14 07:43:26 UTC
Re: Review Request 40743: PIG-4709 Improve performance of GROUPBY
operator on Spark
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40743/#review114423
-----------------------------------------------------------
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java (line 20)
<https://reviews.apache.org/r/40743/#comment175263>
You can refactor importing package sequence according to PIG-4604.
The correct package sequence is:
import java.io.Serializable;
import java.util.Comparator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.DataType;
import org.apache.pig.data.Tuple;
src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java (line 20)
<https://reviews.apache.org/r/40743/#comment175264>
the importing package sequence should be refactored according to PIG-4604
src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java (line 22)
<https://reviews.apache.org/r/40743/#comment175265>
The importing package sequence should be refactored by
PIG-4604.
src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java (line 20)
<https://reviews.apache.org/r/40743/#comment175266>
the package importing sequence should be refactored by PIG-4604
src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java (line 53)
<https://reviews.apache.org/r/40743/#comment175272>
"import java.util.ArrayList" is not needed.
- kelly zhang
On Dec. 18, 2015, 6:47 a.m., Pallavi Rao wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40743/
> -----------------------------------------------------------
>
> (Updated Dec. 18, 2015, 6:47 a.m.)
>
>
> Review request for pig, Mohit Sabharwal and Xuefu Zhang.
>
>
> Bugs: PIG-4709
> https://issues.apache.org/jira/browse/PIG-4709
>
>
> Repository: pig-git
>
>
> Description
> -------
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the grouped data is consumed by subsequent operations to perform algebraic operations, this is sub-optimal as there is lot of shuffle traffic.
> The Spark Plan must be optimized to use reduceBy, where possible, so that a combiner is used.
>
> Introduced a combiner optimizer that does the following:
> // Checks for algebraic operations and if they exist.
> // Replaces global rearrange (cogroup) with reduceBy as follows:
> // Input:
> // foreach (using algebraicOp)
> // -> packager
> // -> globalRearrange
> // -> localRearrange
> // Output:
> // foreach (using algebraicOp.Final)
> // -> reduceBy (uses algebraicOp.Intermediate)
> // -> foreach (using algebraicOp.Initial)
> // -> localRearrange
>
>
> Diffs
> -----
>
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java f8c1658
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java aca347d
> src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java a4dbadd
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java 5f74992
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java 9ce0492
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java 6b66ca1
> src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java 546d91e
> test/org/apache/pig/test/TestCombiner.java df44293
>
> Diff: https://reviews.apache.org/r/40743/diff/
>
>
> Testing
> -------
>
> The patch unblocked one UT in TestCombiner. Added another UT in the same class. Also did some manual testing.
>
>
> Thanks,
>
> Pallavi Rao
>
>
Re: Review Request 40743: PIG-4709 Improve performance of GROUPBY
operator on Spark
Posted by Pallavi Rao <pa...@inmobi.com>.
> On Jan. 14, 2016, 6:43 a.m., kelly zhang wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java, line 53
> > <https://reviews.apache.org/r/40743/diff/4/?file=1170584#file1170584line53>
> >
> > "import java.util.ArrayList" is not needed.
It is used in line 304.
- Pallavi
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/40743/#review114423
-----------------------------------------------------------
On Dec. 18, 2015, 6:47 a.m., Pallavi Rao wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/40743/
> -----------------------------------------------------------
>
> (Updated Dec. 18, 2015, 6:47 a.m.)
>
>
> Review request for pig, Mohit Sabharwal and Xuefu Zhang.
>
>
> Bugs: PIG-4709
> https://issues.apache.org/jira/browse/PIG-4709
>
>
> Repository: pig-git
>
>
> Description
> -------
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the grouped data is consumed by subsequent operations to perform algebraic operations, this is sub-optimal as there is lot of shuffle traffic.
> The Spark Plan must be optimized to use reduceBy, where possible, so that a combiner is used.
>
> Introduced a combiner optimizer that does the following:
> // Checks for algebraic operations and if they exist.
> // Replaces global rearrange (cogroup) with reduceBy as follows:
> // Input:
> // foreach (using algebraicOp)
> // -> packager
> // -> globalRearrange
> // -> localRearrange
> // Output:
> // foreach (using algebraicOp.Final)
> // -> reduceBy (uses algebraicOp.Intermediate)
> // -> foreach (using algebraicOp.Initial)
> // -> localRearrange
>
>
> Diffs
> -----
>
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POForEach.java f8c1658
> src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java aca347d
> src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java a4dbadd
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java 5f74992
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LocalRearrangeConverter.java 9ce0492
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PigSecondaryKeyComparatorSpark.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ReduceByConverter.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/spark/operator/POReduceBySpark.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/spark/optimizer/CombinerOptimizer.java PRE-CREATION
> src/org/apache/pig/backend/hadoop/executionengine/util/CombinerOptimizerUtil.java 6b66ca1
> src/org/apache/pig/backend/hadoop/executionengine/util/SecondaryKeyOptimizerUtil.java 546d91e
> test/org/apache/pig/test/TestCombiner.java df44293
>
> Diff: https://reviews.apache.org/r/40743/diff/
>
>
> Testing
> -------
>
> The patch unblocked one UT in TestCombiner. Added another UT in the same class. Also did some manual testing.
>
>
> Thanks,
>
> Pallavi Rao
>
>