You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Quang-Nhat HOANG-XUAN <hx...@gmail.com> on 2014/09/11 16:25:22 UTC

Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/
-----------------------------------------------------------

(Updated Sept. 11, 2014, 2:25 p.m.)


Review request for pig.


Bugs: PIG-4066
    https://issues.apache.org/jira/browse/PIG-4066


Repository: pig


Description
-------

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY
{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/


Diffs (updated)
-----

  trunk/src/org/apache/pig/Main.java 1624212 
  trunk/src/org/apache/pig/PigConfiguration.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1624212 
  trunk/src/org/apache/pig/builtin/RollupDimensions.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
  trunk/src/org/apache/pig/parser/AliasMasker.g 1624212 
  trunk/src/org/apache/pig/parser/AstPrinter.g 1624212 
  trunk/src/org/apache/pig/parser/AstValidator.g 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1624212 
  trunk/src/org/apache/pig/parser/QueryLexer.g 1624212 
  trunk/src/org/apache/pig/parser/QueryParser.g 1624212 
  trunk/test/org/apache/pig/test/TestCubeOperator.java 1624212 

Diff: https://reviews.apache.org/r/23804/diff/


Testing
-------


Thanks,

Quang-Nhat HOANG-XUAN


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/#review59777
-----------------------------------------------------------


I am also wondering whether RollupHIIOptimizer has any side effect in non-MR mode. This is a MR-specific optimization, but RollupHIIOptimizer runs on logical plan and sets some boolean flags in POUserFunc. Does this have any side effect in non-MR mode such as Tez?

Is it possible to run this optimizer only in MR mode and disable it by default as it's experimental? What do you think?

Lsatly, I think we should add documentation to this section: http://pig.apache.org/docs/r0.13.0/perf.html#optimization-rules

- Cheolsoo Park


On Nov. 4, 2014, 8:56 a.m., Quang-Nhat HOANG-XUAN wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23804/
> -----------------------------------------------------------
> 
> (Updated Nov. 4, 2014, 8:56 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4066
>     https://issues.apache.org/jira/browse/PIG-4066
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY
> { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
> ...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/Main.java 1624212 
>   trunk/src/org/apache/pig/PigConfiguration.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1624212 
>   trunk/src/org/apache/pig/builtin/RollupDimensions.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
>   trunk/src/org/apache/pig/parser/AliasMasker.g 1624212 
>   trunk/src/org/apache/pig/parser/AstPrinter.g 1624212 
>   trunk/src/org/apache/pig/parser/AstValidator.g 1624212 
>   trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1624212 
>   trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1624212 
>   trunk/src/org/apache/pig/parser/QueryLexer.g 1624212 
>   trunk/src/org/apache/pig/parser/QueryParser.g 1624212 
>   trunk/test/org/apache/pig/test/TestCubeOperator.java 1624212 
> 
> Diff: https://reviews.apache.org/r/23804/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Quang-Nhat HOANG-XUAN
> 
>


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/#review59776
-----------------------------------------------------------


Let me run unit tests.


trunk/src/org/apache/pig/PigConfiguration.java
<https://reviews.apache.org/r/23804/#comment101079>

    These are internal facing properties, right? If so, can you move them to PigImplConstants.java?


- Cheolsoo Park


On Nov. 4, 2014, 8:56 a.m., Quang-Nhat HOANG-XUAN wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23804/
> -----------------------------------------------------------
> 
> (Updated Nov. 4, 2014, 8:56 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4066
>     https://issues.apache.org/jira/browse/PIG-4066
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY
> { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
> ...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/Main.java 1624212 
>   trunk/src/org/apache/pig/PigConfiguration.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1624212 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1624212 
>   trunk/src/org/apache/pig/builtin/RollupDimensions.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1624212 
>   trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
>   trunk/src/org/apache/pig/parser/AliasMasker.g 1624212 
>   trunk/src/org/apache/pig/parser/AstPrinter.g 1624212 
>   trunk/src/org/apache/pig/parser/AstValidator.g 1624212 
>   trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1624212 
>   trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1624212 
>   trunk/src/org/apache/pig/parser/QueryLexer.g 1624212 
>   trunk/src/org/apache/pig/parser/QueryParser.g 1624212 
>   trunk/test/org/apache/pig/test/TestCubeOperator.java 1624212 
> 
> Diff: https://reviews.apache.org/r/23804/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Quang-Nhat HOANG-XUAN
> 
>


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/#review64932
-----------------------------------------------------------

Ship it!


Ran unit tests and e2e tests.

- Cheolsoo Park


On Dec. 7, 2014, 12:32 p.m., Quang-Nhat HOANG-XUAN wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/23804/
> -----------------------------------------------------------
> 
> (Updated Dec. 7, 2014, 12:32 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4066
>     https://issues.apache.org/jira/browse/PIG-4066
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
> This patch contains the following features:
> 1. The new ROLLUP approach: IRG, Hybrid IRG.
> 2. The PIVOT clause in CUBE operators.
> 3. Test cases.
> The new syntax to use our ROLLUP approach:
> alias = CUBE rel BY
> { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
> ...]
> In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
> We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/
> 
> 
> Diffs
> -----
> 
>   trunk/src/org/apache/pig/Main.java 1642549 
>   trunk/src/org/apache/pig/PigConstants.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1642549 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1642549 
>   trunk/src/org/apache/pig/builtin/RollupDimensions.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1642549 
>   trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
>   trunk/src/org/apache/pig/parser/AliasMasker.g 1642549 
>   trunk/src/org/apache/pig/parser/AstPrinter.g 1642549 
>   trunk/src/org/apache/pig/parser/AstValidator.g 1642549 
>   trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1642549 
>   trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1642549 
>   trunk/src/org/apache/pig/parser/QueryLexer.g 1642549 
>   trunk/src/org/apache/pig/parser/QueryParser.g 1642549 
>   trunk/test/org/apache/pig/test/TestCubeOperator.java 1642549 
> 
> Diff: https://reviews.apache.org/r/23804/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Quang-Nhat HOANG-XUAN
> 
>


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Quang-Nhat HOANG-XUAN <hx...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/
-----------------------------------------------------------

(Updated Dec. 7, 2014, 12:32 p.m.)


Review request for pig.


Bugs: PIG-4066
    https://issues.apache.org/jira/browse/PIG-4066


Repository: pig


Description
-------

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY
{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/


Diffs (updated)
-----

  trunk/src/org/apache/pig/Main.java 1642549 
  trunk/src/org/apache/pig/PigConstants.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1642549 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1642549 
  trunk/src/org/apache/pig/builtin/RollupDimensions.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1642549 
  trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
  trunk/src/org/apache/pig/parser/AliasMasker.g 1642549 
  trunk/src/org/apache/pig/parser/AstPrinter.g 1642549 
  trunk/src/org/apache/pig/parser/AstValidator.g 1642549 
  trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1642549 
  trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1642549 
  trunk/src/org/apache/pig/parser/QueryLexer.g 1642549 
  trunk/src/org/apache/pig/parser/QueryParser.g 1642549 
  trunk/test/org/apache/pig/test/TestCubeOperator.java 1642549 

Diff: https://reviews.apache.org/r/23804/diff/


Testing
-------


Thanks,

Quang-Nhat HOANG-XUAN


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Quang-Nhat HOANG-XUAN <hx...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/
-----------------------------------------------------------

(Updated Nov. 17, 2014, 10:50 a.m.)


Review request for pig.


Bugs: PIG-4066
    https://issues.apache.org/jira/browse/PIG-4066


Repository: pig


Description
-------

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY
{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/


Diffs (updated)
-----

  trunk/src/org/apache/pig/Main.java 1624212 
  trunk/src/org/apache/pig/PigConfiguration.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1624212 
  trunk/src/org/apache/pig/builtin/RollupDimensions.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
  trunk/src/org/apache/pig/parser/AliasMasker.g 1624212 
  trunk/src/org/apache/pig/parser/AstPrinter.g 1624212 
  trunk/src/org/apache/pig/parser/AstValidator.g 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1624212 
  trunk/src/org/apache/pig/parser/QueryLexer.g 1624212 
  trunk/src/org/apache/pig/parser/QueryParser.g 1624212 
  trunk/test/org/apache/pig/test/TestCubeOperator.java 1624212 

Diff: https://reviews.apache.org/r/23804/diff/


Testing
-------


Thanks,

Quang-Nhat HOANG-XUAN


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Quang-Nhat HOANG-XUAN <hx...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/
-----------------------------------------------------------

(Updated Nov. 4, 2014, 8:56 a.m.)


Review request for pig.


Bugs: PIG-4066
    https://issues.apache.org/jira/browse/PIG-4066


Repository: pig


Description
-------

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY
{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/


Diffs (updated)
-----

  trunk/src/org/apache/pig/Main.java 1624212 
  trunk/src/org/apache/pig/PigConfiguration.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1624212 
  trunk/src/org/apache/pig/builtin/RollupDimensions.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
  trunk/src/org/apache/pig/parser/AliasMasker.g 1624212 
  trunk/src/org/apache/pig/parser/AstPrinter.g 1624212 
  trunk/src/org/apache/pig/parser/AstValidator.g 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1624212 
  trunk/src/org/apache/pig/parser/QueryLexer.g 1624212 
  trunk/src/org/apache/pig/parser/QueryParser.g 1624212 
  trunk/test/org/apache/pig/test/TestCubeOperator.java 1624212 

Diff: https://reviews.apache.org/r/23804/diff/


Testing
-------


Thanks,

Quang-Nhat HOANG-XUAN


Re: Review Request 23804: PIG-4066 An optimization for ROLLUP operation on Pig

Posted by Quang-Nhat HOANG-XUAN <hx...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/23804/
-----------------------------------------------------------

(Updated Sept. 16, 2014, 1:31 p.m.)


Review request for pig.


Bugs: PIG-4066
    https://issues.apache.org/jira/browse/PIG-4066


Repository: pig


Description
-------

This patch aims at addressing the current limitation of the ROLLUP operator in PIG: most of the work is done in the Map phase of the underlying MapReduce job to generate all possible intermediate keys that the reducer use to aggregate and produce the ROLLUP output. Based on our previous work: “Duy-Hung Phan, Matteo Dell’Amico, Pietro Michiardi: On the design space of MapReduce ROLLUP aggregates” (http://www.eurecom.fr/en/publication/4212/download/rs-publi-4212_2.pdf), we show that the design space for a ROLLUP implementation allows for a different approach (in-reducer grouping, IRG), in which less work is done in the Map phase and the grouping is done in the Reduce phase. This patch presents the most efficient implementation we designed (Hybrid IRG), which allows defining a parameter to balance between parallelism (in the reducers) and communication cost.
This patch contains the following features:
1. The new ROLLUP approach: IRG, Hybrid IRG.
2. The PIVOT clause in CUBE operators.
3. Test cases.
The new syntax to use our ROLLUP approach:
alias = CUBE rel BY
{ CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]} [, { CUBE col_ref | ROLLUP col_ref [PIVOT pivot_value]}
...]
In case there is multiple ROLLUP operator in one CUBE clause, the last ROLLUP operator will be executed with our approach (IRG, Hybrid IRG) while the remaining ROLLUP ahead will be executed with the default approach.
We have already made some experiments for comparison between our ROLLUP implementation and the current ROLLUP. More information can be found at here: http://hxquangnhat.github.io/PIG-ROLLUP-H2IRG/


Diffs (updated)
-----

  trunk/src/org/apache/pig/Main.java 1624212 
  trunk/src/org/apache/pig/PigConfiguration.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/MRCompiler.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigGenericMapReduce.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/RollupHIIPartitioner.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/plans/PhyPlanVisitor.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java 1624212 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/PORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/util/PlanHelper.java 1624212 
  trunk/src/org/apache/pig/builtin/RollupDimensions.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/expression/UserFuncExpression.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/optimizer/LogicalPlanOptimizer.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCogroup.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LOCube.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LORollupHIIForEach.java PRE-CREATION 
  trunk/src/org/apache/pig/newplan/logical/relational/LogToPhyTranslationVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalPlan.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/relational/LogicalRelationalNodesVisitor.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/OptimizerUtils.java 1624212 
  trunk/src/org/apache/pig/newplan/logical/rules/RollupHIIOptimizer.java PRE-CREATION 
  trunk/src/org/apache/pig/parser/AliasMasker.g 1624212 
  trunk/src/org/apache/pig/parser/AstPrinter.g 1624212 
  trunk/src/org/apache/pig/parser/AstValidator.g 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanBuilder.java 1624212 
  trunk/src/org/apache/pig/parser/LogicalPlanGenerator.g 1624212 
  trunk/src/org/apache/pig/parser/QueryLexer.g 1624212 
  trunk/src/org/apache/pig/parser/QueryParser.g 1624212 
  trunk/test/org/apache/pig/test/TestCubeOperator.java 1624212 

Diff: https://reviews.apache.org/r/23804/diff/


Testing
-------


Thanks,

Quang-Nhat HOANG-XUAN