You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alex Bain <am...@gmail.com> on 2014/01/08 02:47:12 UTC
Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT
in Tez
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/
-----------------------------------------------------------
Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
Bugs: PIG-3562
https://issues.apache.org/jira/browse/PIG-3562
Repository: pig-git
Description
-------
Implement DISTINCT combiner optimizations in Tez
1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).
I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.
There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.
2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.
Diffs
-----
src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java a7de3a7
test/e2e/pig/tests/tez.conf 71cdcbc
test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION
test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313
test/org/apache/pig/tez/TestTezCompiler.java 2252531
Diff: https://reviews.apache.org/r/16717/diff/
Testing
-------
Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
Added e2e test that runs DISTINCT with algebraic udf combiner
I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.
Thanks,
Alex Bain
Re: Review Request 16717: PIG-3562 Implement combiner optimizations for
DISTINCT in Tez
Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/#review31403
-----------------------------------------------------------
Ship it!
I will commit it after running tests.
- Cheolsoo Park
On Jan. 8, 2014, 1:47 a.m., Alex Bain wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16717/
> -----------------------------------------------------------
>
> (Updated Jan. 8, 2014, 1:47 a.m.)
>
>
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
>
>
> Bugs: PIG-3562
> https://issues.apache.org/jira/browse/PIG-3562
>
>
> Repository: pig-git
>
>
> Description
> -------
>
> Implement DISTINCT combiner optimizations in Tez
>
> 1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).
>
> I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.
>
> There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.
>
> 2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.
>
>
> Diffs
> -----
>
> src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java a7de3a7
> test/e2e/pig/tests/tez.conf 71cdcbc
> test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION
> test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313
> test/org/apache/pig/tez/TestTezCompiler.java 2252531
>
> Diff: https://reviews.apache.org/r/16717/diff/
>
>
> Testing
> -------
>
> Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
> Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
> Added e2e test that runs DISTINCT with algebraic udf combiner
> I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.
>
>
> Thanks,
>
> Alex Bain
>
>