You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Cheolsoo Park <pi...@gmail.com> on 2014/01/27 08:12:58 UTC

Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/
-----------------------------------------------------------

Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Bugs: PIG-3719
    https://issues.apache.org/jira/browse/PIG-3719


Repository: pig-git


Description
-------

This patch fixes two sets of skewed join e2e tests-
1) tez.conf Join_[7_8].
2) nightly.conf SkewedJoin_[1_10].

There are mainly two changes-
1) In POShuffleTezLoad, we copy PigNullableWritable using PigNullableWritable.newInstance(). But this methods doesn't copy the key of NullablePartitionWritable (subclass of PigNullableWritable for skewed join). This was causing NPE later.
2) In POPartitionRearrangeTez, the init() method builds reduceMap <key, pair<int, int>>. The problem was that the key of this map was always wrapped by tuple even though there is only a single join key, resulting in wrong join output. Now the key is wrapped only if there are more than one join keys.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 57af63f 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java 4ef55b8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 13d9ec9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java c07e69e 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 761ae90 
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java 654f350 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 197486f 
  src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 637246c 
  test/e2e/pig/tests/nightly.conf e1a55e6 

Diff: https://reviews.apache.org/r/17375/diff/


Testing
-------

* ant clean test-tez passes except TestCustomPartitioner. I confirmed TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
* All nightly.conf SkewedJoin tests pass except #6 self join case.
* All tez.conf e2e tests pass.

Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 8 to 4. The reason is because for right/full outer join, it's important that every task in partition vertex is given at least one row of skewed keys. However, in e2e tests, we arbitrarily re-distributes keys to 8 reducers while keys are not skewed. As a result, some tasks are given no row of these fake "skewed" keys, and that makes the tests fail. So I am reducing the parallel to avoid this situation. For real skewed data, this shouldn't be an issue.

In addition, I'd like to fix the self join case (#6) in a separate jira because it requires quite a few changes in TezCompiler due to POSplit.


Thanks,

Cheolsoo Park


Re: Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/#review32896
-----------------------------------------------------------

Ship it!


Ship It!

- Daniel Dai


On Jan. 27, 2014, 9:13 p.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/17375/
> -----------------------------------------------------------
> 
> (Updated Jan. 27, 2014, 9:13 p.m.)
> 
> 
> Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3719
>     https://issues.apache.org/jira/browse/PIG-3719
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> This patch fixes two sets of skewed join e2e tests-
> 1) tez.conf Join_[7_8].
> 2) nightly.conf SkewedJoin_[1_10].
> 
> There are mainly two changes-
> 1) In POShuffleTezLoad, we copy PigNullableWritable using PigNullableWritable.newInstance(). But this methods doesn't copy the key of NullablePartitionWritable (subclass of PigNullableWritable for skewed join). This was causing NPE later.
> 2) In POPartitionRearrangeTez, the init() method builds reduceMap <key, pair<int, int>>. The problem was that the key of this map was always wrapped by tuple even though there is only a single join key, resulting in wrong join output. Now the key is wrapped only if there are more than one join keys.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 57af63f 
>   src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java 4ef55b8 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 13d9ec9 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java 5e94e61 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 761ae90 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java 8d4c2a9 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java d8646cb 
>   src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 637246c 
> 
> Diff: https://reviews.apache.org/r/17375/diff/
> 
> 
> Testing
> -------
> 
> * ant clean test-tez passes except TestCustomPartitioner. I confirmed TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
> * All nightly.conf SkewedJoin tests pass except #6 self join case.
> * All tez.conf e2e tests pass.
> 
> Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 8 to 4. The reason is because for right/full outer join, it's important that every reducer task assigned to skewed keys is given at least one row of those skewed keys. For eg, if we assign 8 reducers to key X, we expect these 8 reducers are all given at least one row of key X.
> 
> However, in e2e tests, we arbitrarily assign 8 reducers to two particularly keys for testing purpose. As a result, some reducer tasks are given no row of these fake "skewed" keys, and that makes the tests fail. So I am reducing the parallel to avoid this situation.
> 
> In MR mode, this is not a problem because 1) a single task loads the input data (10k rows), and 2) a single task RRs more than 8 rows of the skewed keys to 8 reducers using SkewedPartitioner. So each reducer is given at lease one row of the skewed keys. But in Tez mode, 1) we transfer the input data to the partition vertex that has a parallelism of 8, and 2) 8 mapper tasks RR the skewed keys using SkewedPartitioner. Since each mapper task is given less than 8 rows of the skewed keys, the skewed keys are not evenly distribute to 8 reducer tasks. i.e. reducer #8 ends up with no rows of the skewed keys.
> 
> In fact, this is a test issue IMO. If we have enough skewed keys to RR to 8 reducer tasks, we won't run into this situation.
> 
> Lastly, I'd like to fix the self join case (#6) in a separate jira because it requires quite a few changes in TezCompiler due to POSplit.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>


Re: Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/
-----------------------------------------------------------

(Updated Jan. 27, 2014, 9:13 p.m.)


Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Changes
-------

Rebase to the current head.


Bugs: PIG-3719
    https://issues.apache.org/jira/browse/PIG-3719


Repository: pig-git


Description
-------

This patch fixes two sets of skewed join e2e tests-
1) tez.conf Join_[7_8].
2) nightly.conf SkewedJoin_[1_10].

There are mainly two changes-
1) In POShuffleTezLoad, we copy PigNullableWritable using PigNullableWritable.newInstance(). But this methods doesn't copy the key of NullablePartitionWritable (subclass of PigNullableWritable for skewed join). This was causing NPE later.
2) In POPartitionRearrangeTez, the init() method builds reduceMap <key, pair<int, int>>. The problem was that the key of this map was always wrapped by tuple even though there is only a single join key, resulting in wrong join output. Now the key is wrapped only if there are more than one join keys.


Diffs (updated)
-----

  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 57af63f 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java 4ef55b8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 13d9ec9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java 5e94e61 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 761ae90 
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java 8d4c2a9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java d8646cb 
  src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 637246c 

Diff: https://reviews.apache.org/r/17375/diff/


Testing
-------

* ant clean test-tez passes except TestCustomPartitioner. I confirmed TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
* All nightly.conf SkewedJoin tests pass except #6 self join case.
* All tez.conf e2e tests pass.

Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 8 to 4. The reason is because for right/full outer join, it's important that every reducer task assigned to skewed keys is given at least one row of those skewed keys. For eg, if we assign 8 reducers to key X, we expect these 8 reducers are all given at least one row of key X.

However, in e2e tests, we arbitrarily assign 8 reducers to two particularly keys for testing purpose. As a result, some reducer tasks are given no row of these fake "skewed" keys, and that makes the tests fail. So I am reducing the parallel to avoid this situation.

In MR mode, this is not a problem because 1) a single task loads the input data (10k rows), and 2) a single task RRs more than 8 rows of the skewed keys to 8 reducers using SkewedPartitioner. So each reducer is given at lease one row of the skewed keys. But in Tez mode, 1) we transfer the input data to the partition vertex that has a parallelism of 8, and 2) 8 mapper tasks RR the skewed keys using SkewedPartitioner. Since each mapper task is given less than 8 rows of the skewed keys, the skewed keys are not evenly distribute to 8 reducer tasks. i.e. reducer #8 ends up with no rows of the skewed keys.

In fact, this is a test issue IMO. If we have enough skewed keys to RR to 8 reducer tasks, we won't run into this situation.

Lastly, I'd like to fix the self join case (#6) in a separate jira because it requires quite a few changes in TezCompiler due to POSplit.


Thanks,

Cheolsoo Park


Re: Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/
-----------------------------------------------------------

(Updated Jan. 27, 2014, 5:51 p.m.)


Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Changes
-------

On a second thought, a better solution for right/full outer join is to adjust the parallelism of POLocalRearrange. For eg, given the following query, I change it as follows-

x = join foo by name right outer, bar by name using 'skewed' parallel 8;

From:
LocalRearrange vertex w/ parallel 8
PartitionRearrange vertex w/ parallel 8
Package vertex w/ parallel 8

To:
LocalRearrange vertex w/ parallel determined by size of skewed table
PartitionRearrange vertex w/ parallel determined by size of streaming table
Package vertex w/ parallel 8

Now I don't have to modify the e2e tests.


Bugs: PIG-3719
    https://issues.apache.org/jira/browse/PIG-3719


Repository: pig-git


Description
-------

This patch fixes two sets of skewed join e2e tests-
1) tez.conf Join_[7_8].
2) nightly.conf SkewedJoin_[1_10].

There are mainly two changes-
1) In POShuffleTezLoad, we copy PigNullableWritable using PigNullableWritable.newInstance(). But this methods doesn't copy the key of NullablePartitionWritable (subclass of PigNullableWritable for skewed join). This was causing NPE later.
2) In POPartitionRearrangeTez, the init() method builds reduceMap <key, pair<int, int>>. The problem was that the key of this map was always wrapped by tuple even though there is only a single join key, resulting in wrong join output. Now the key is wrapped only if there are more than one join keys.


Diffs (updated)
-----

  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 57af63f 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java 4ef55b8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 13d9ec9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java c07e69e 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 761ae90 
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java 654f350 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 197486f 
  src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 637246c 

Diff: https://reviews.apache.org/r/17375/diff/


Testing
-------

* ant clean test-tez passes except TestCustomPartitioner. I confirmed TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
* All nightly.conf SkewedJoin tests pass except #6 self join case.
* All tez.conf e2e tests pass.

Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 8 to 4. The reason is because for right/full outer join, it's important that every reducer task assigned to skewed keys is given at least one row of those skewed keys. For eg, if we assign 8 reducers to key X, we expect these 8 reducers are all given at least one row of key X.

However, in e2e tests, we arbitrarily assign 8 reducers to two particularly keys for testing purpose. As a result, some reducer tasks are given no row of these fake "skewed" keys, and that makes the tests fail. So I am reducing the parallel to avoid this situation.

In MR mode, this is not a problem because 1) a single task loads the input data (10k rows), and 2) a single task RRs more than 8 rows of the skewed keys to 8 reducers using SkewedPartitioner. So each reducer is given at lease one row of the skewed keys. But in Tez mode, 1) we transfer the input data to the partition vertex that has a parallelism of 8, and 2) 8 mapper tasks RR the skewed keys using SkewedPartitioner. Since each mapper task is given less than 8 rows of the skewed keys, the skewed keys are not evenly distribute to 8 reducer tasks. i.e. reducer #8 ends up with no rows of the skewed keys.

In fact, this is a test issue IMO. If we have enough skewed keys to RR to 8 reducer tasks, we won't run into this situation.

Lastly, I'd like to fix the self join case (#6) in a separate jira because it requires quite a few changes in TezCompiler due to POSplit.


Thanks,

Cheolsoo Park


Re: Review Request 17375: PIG-3719 Fix skewed join e2e tests in tez branch

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/17375/
-----------------------------------------------------------

(Updated Jan. 27, 2014, 4:37 p.m.)


Review request for pig, Alex Bain, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Bugs: PIG-3719
    https://issues.apache.org/jira/browse/PIG-3719


Repository: pig-git


Description
-------

This patch fixes two sets of skewed join e2e tests-
1) tez.conf Join_[7_8].
2) nightly.conf SkewedJoin_[1_10].

There are mainly two changes-
1) In POShuffleTezLoad, we copy PigNullableWritable using PigNullableWritable.newInstance(). But this methods doesn't copy the key of NullablePartitionWritable (subclass of PigNullableWritable for skewed join). This was causing NPE later.
2) In POPartitionRearrangeTez, the init() method builds reduceMap <key, pair<int, int>>. The problem was that the key of this map was always wrapped by tuple even though there is only a single join key, resulting in wrong join output. Now the key is wrapped only if there are more than one join keys.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/partitioners/SkewedPartitioner.java 57af63f 
  src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPartitionRearrange.java 4ef55b8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POLocalRearrangeTez.java 13d9ec9 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POPartitionRearrangeTez.java c07e69e 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POShuffleTezLoad.java 761ae90 
  src/org/apache/pig/backend/hadoop/executionengine/tez/SkewedPartitionerTez.java 654f350 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 197486f 
  src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 637246c 
  test/e2e/pig/tests/nightly.conf e1a55e6 

Diff: https://reviews.apache.org/r/17375/diff/


Testing (updated)
-------

* ant clean test-tez passes except TestCustomPartitioner. I confirmed TestCustomPartitioner doesn't pass in current tez branch, so it's not related.
* All nightly.conf SkewedJoin tests pass except #6 self join case.
* All tez.conf e2e tests pass.

Note that for nightly.conf SkewedJoin #9 and #10, I changed the parallel from 8 to 4. The reason is because for right/full outer join, it's important that every reducer task assigned to skewed keys is given at least one row of those skewed keys. For eg, if we assign 8 reducers to key X, we expect these 8 reducers are all given at least one row of key X.

However, in e2e tests, we arbitrarily assign 8 reducers to two particularly keys for testing purpose. As a result, some reducer tasks are given no row of these fake "skewed" keys, and that makes the tests fail. So I am reducing the parallel to avoid this situation.

In MR mode, this is not a problem because 1) a single task loads the input data (10k rows), and 2) a single task RRs more than 8 rows of the skewed keys to 8 reducers using SkewedPartitioner. So each reducer is given at lease one row of the skewed keys. But in Tez mode, 1) we transfer the input data to the partition vertex that has a parallelism of 8, and 2) 8 mapper tasks RR the skewed keys using SkewedPartitioner. Since each mapper task is given less than 8 rows of the skewed keys, the skewed keys are not evenly distribute to 8 reducer tasks. i.e. reducer #8 ends up with no rows of the skewed keys.

In fact, this is a test issue IMO. If we have enough skewed keys to RR to 8 reducer tasks, we won't run into this situation.

Lastly, I'd like to fix the self join case (#6) in a separate jira because it requires quite a few changes in TezCompiler due to POSplit.


Thanks,

Cheolsoo Park