You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alex Bain <am...@gmail.com> on 2014/01/08 02:47:12 UTC

Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT in Tez

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/
-----------------------------------------------------------

Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Bugs: PIG-3562
    https://issues.apache.org/jira/browse/PIG-3562


Repository: pig-git


Description
-------

Implement DISTINCT combiner optimizations in Tez

1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).

I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.

There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.

2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java a7de3a7 
  test/e2e/pig/tests/tez.conf 71cdcbc 
  test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION 
  test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313 
  test/org/apache/pig/tez/TestTezCompiler.java 2252531 

Diff: https://reviews.apache.org/r/16717/diff/


Testing
-------

Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
Added e2e test that runs DISTINCT with algebraic udf combiner
I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.


Thanks,

Alex Bain


Re: Review Request 16717: PIG-3562 Implement combiner optimizations for DISTINCT in Tez

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16717/#review31403
-----------------------------------------------------------

Ship it!


I will commit it after running tests.

- Cheolsoo Park


On Jan. 8, 2014, 1:47 a.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16717/
> -----------------------------------------------------------
> 
> (Updated Jan. 8, 2014, 1:47 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3562
>     https://issues.apache.org/jira/browse/PIG-3562
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement DISTINCT combiner optimizations in Tez
> 
> 1. Use a combiner with normal uses of DISTINCT. In MR Pig, there are some global variables and a special DistinctCombiner class that throws away the duplicate tuples. We could hack this into Pig-on-Tez, but instead I just reused the reduce plan as the combiner plan, which does the same thing (through a POPackage->POProject->POForEach with the setDistinct property set to true).
> 
> I'm a little bit concerned that this combiner plan could somehow be slower than the special DistinctCombiner class, but I don't see how.
> 
> There is also a special CombinerPackager packager that I did NOT use for this. I think that packager is really intended for use with the algebraic UDF combiner optimizations only.
> 
> 2. I carefully verified that DISTINCT nested inside a FOREACH code block is optimized by the CombinerOptimizer into an algebraic UDF version of DISTINCT. I added TestTezCompiler and e2e tests for this. Cheolsoo already made all the combiner changes for this to work correctly - I didn't make any code changes here.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java a7de3a7 
>   test/e2e/pig/tests/tez.conf 71cdcbc 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC13.gld PRE-CREATION 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC5.gld 35d9313 
>   test/org/apache/pig/tez/TestTezCompiler.java 2252531 
> 
> Diff: https://reviews.apache.org/r/16717/diff/
> 
> 
> Testing
> -------
> 
> Updated golden file for existing TestTezCompiler DISTINCT test to include combiner plan
> Added TestTezCompiler test and golden file for DISTINCT algebraic udf combiner
> Added e2e test that runs DISTINCT with algebraic udf combiner
> I am getting some test-e2e-tez failures in ORDER BY tests, but I am also getting these in a clean Tez branch. My new e2e test passes.
> 
> 
> Thanks,
> 
> Alex Bain
> 
>