You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Cheolsoo Park <pi...@gmail.com> on 2014/01/14 20:55:55 UTC

Review Request 16860: PIG-3644: Implement skewed join in Tez

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16860/
-----------------------------------------------------------

Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Bugs: PIG-3644
    https://issues.apache.org/jira/browse/PIG-3644


Repository: pig-git


Description
-------

Skewed join in Tez is implemented in 5 vertices:
Vertex 1) Sample/load skewed table => broadcast sampling input to vertex 2 and shuffle entire input to vertex 3.
Vertex 2) Sampling aggregation vertex => build distribution map and broadcast it to vertex 3 and 4.
Vertex 3) POLocalRearrangeTez for skewed table => partition skewed table using SkewedPartitioner and shuffle it to vertex 5.
Vertex 4) POPartitionRearrangeTez for streaming table => shuffle streaming table to vertex 5.
Vertex 5) Join inputs from vertex 3 and 4.

New classes for Tez:
- POPoissonSample) Sampling operator for skewed join.
- POPartitionRearrangeTez) Sub-class of POPartitionRearrange for Tez.
- SkewedPartitionerTez) Sub-class of SkewedPartitioner for Tez.

Note that there are a couple of places I can refactor. For eg,
- POPoissonSample and PoissonSampleLoader
- POPartitionRearrageTez and POLocalRearrangeTez

I will do it in follow-up jiras.


Diffs
-----

  src/org/apache/pig/PigConfiguration.java ccf3635 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 4790abe 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPoissonSample.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POReservoirSample.java bcb339c 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 585509d 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java e9d8e64 
  src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java e22c319 
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java d35e87d 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 83e5d2c 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java 93e522f 
  src/org/apache/pig/backend/hadoop/executionengine/tez/WeightedRangePartitionerTez.java 7bcc79e 
  src/org/apache/pig/impl/builtin/PartitionSkewedKeys.java 7ce0e82 
  src/org/apache/pig/impl/builtin/PoissonSampleLoader.java 5ce5b9e 
  test/e2e/pig/tests/tez.conf ac254e5 

Diff: https://reviews.apache.org/r/16860/diff/


Testing
-------

- Added e2e test cases for inner and outer skewed joins.
- unit tests pass.
- e2e tests pass.


Thanks,

Cheolsoo Park


Re: Review Request 16860: PIG-3644: Implement skewed join in Tez

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16860/#review31947
-----------------------------------------------------------

Ship it!


Looks right to me. Feel free to commit.

- Daniel Dai


On Jan. 14, 2014, 7:55 p.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16860/
> -----------------------------------------------------------
> 
> (Updated Jan. 14, 2014, 7:55 p.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3644
>     https://issues.apache.org/jira/browse/PIG-3644
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Skewed join in Tez is implemented in 5 vertices:
> Vertex 1) Sample/load skewed table => broadcast sampling input to vertex 2 and shuffle entire input to vertex 3.
> Vertex 2) Sampling aggregation vertex => build distribution map and broadcast it to vertex 3 and 4.
> Vertex 3) POLocalRearrangeTez for skewed table => partition skewed table using SkewedPartitioner and shuffle it to vertex 5.
> Vertex 4) POPartitionRearrangeTez for streaming table => shuffle streaming table to vertex 5.
> Vertex 5) Join inputs from vertex 3 and 4.
> 
> New classes for Tez:
> - POPoissonSample) Sampling operator for skewed join.
> - POPartitionRearrangeTez) Sub-class of POPartitionRearrange for Tez.
> - SkewedPartitionerTez) Sub-class of SkewedPartitioner for Tez.
> 
> Note that there are a couple of places I can refactor. For eg,
> - POPoissonSample and PoissonSampleLoader
> - POPartitionRearrageTez and POLocalRearrangeTez
> 
> I will do it in follow-up jiras.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/PigConfiguration.java ccf3635 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 4790abe 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPoissonSample.java e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POReservoirSample.java bcb339c 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 585509d 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java e9d8e64 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java e22c319 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java d35e87d 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 83e5d2c 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java 93e522f 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/WeightedRangePartitionerTez.java 7bcc79e 
>   src/org/apache/pig/impl/builtin/PartitionSkewedKeys.java 7ce0e82 
>   src/org/apache/pig/impl/builtin/PoissonSampleLoader.java 5ce5b9e 
>   test/e2e/pig/tests/tez.conf ac254e5 
> 
> Diff: https://reviews.apache.org/r/16860/diff/
> 
> 
> Testing
> -------
> 
> - Added e2e test cases for inner and outer skewed joins.
> - unit tests pass.
> - e2e tests pass.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>


Re: Review Request 16860: PIG-3644: Implement skewed join in Tez

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16860/
-----------------------------------------------------------

(Updated Jan. 17, 2014, 12:49 a.m.)


Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Changes
-------

Per Rohini's request, I am uploading the final patch that I committed to tez branch.


Bugs: PIG-3644
    https://issues.apache.org/jira/browse/PIG-3644


Repository: pig-git


Description
-------

Skewed join in Tez is implemented in 5 vertices:
Vertex 1) Sample/load skewed table => broadcast sampling input to vertex 2 and shuffle entire input to vertex 3.
Vertex 2) Sampling aggregation vertex => build distribution map and broadcast it to vertex 3 and 4.
Vertex 3) POLocalRearrangeTez for skewed table => partition skewed table using SkewedPartitioner and shuffle it to vertex 5.
Vertex 4) POPartitionRearrangeTez for streaming table => shuffle streaming table to vertex 5.
Vertex 5) Join inputs from vertex 3 and 4.

New classes for Tez:
- POPoissonSample) Sampling operator for skewed join.
- POPartitionRearrangeTez) Sub-class of POPartitionRearrange for Tez.
- SkewedPartitionerTez) Sub-class of SkewedPartitioner for Tez.

Note that there are a couple of places I can refactor. For eg,
- POPoissonSample and PoissonSampleLoader
- POPartitionRearrageTez and POLocalRearrangeTez

I will do it in follow-up jiras.


Diffs (updated)
-----

  src/org/apache/pig/PigConfiguration.java ccf3635 
  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 4790abe 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPoissonSample.java PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POReservoirSample.java bcb339c 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 585509d 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java e9d8e64 
  src/org/apache/pig/backend/hadoop/executionengine/tez/PigProcessor.java e22c319 
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java PRE-CREATION 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 632eae5 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java 53b255e 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java 93e522f 
  src/org/apache/pig/backend/hadoop/executionengine/tez/WeightedRangePartitionerTez.java 7bcc79e 
  src/org/apache/pig/impl/builtin/PartitionSkewedKeys.java 7ce0e82 
  src/org/apache/pig/impl/builtin/PoissonSampleLoader.java 5ce5b9e 
  test/e2e/pig/tests/tez.conf ac254e5 

Diff: https://reviews.apache.org/r/16860/diff/


Testing
-------

- Added e2e test cases for inner and outer skewed joins.
- unit tests pass.
- e2e tests pass.


Thanks,

Cheolsoo Park