You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Cheolsoo Park <pi...@gmail.com> on 2014/03/25 21:50:28 UTC

Review Request 19633: PIG-3743: Use VertexGroup and Alias vertex for union

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19633/
-----------------------------------------------------------

Review request for pig, Daniel Dai and Rohini Palaniswamy.


Bugs: PIG-3743
    https://issues.apache.org/jira/browse/PIG-3743


Repository: pig-git


Description
-------

The patch reimplements union using Tez VertexGroup and GroupInputEdge.

The changes include-
* Implemented POVertexGroupInputTez that takes ConcatenatedMergedKeyValuesInput from VertexGroup.
* TezCompiler inserts an alias vertex for union, and the alias vertex is converted to VertexGroup by TezDagBuilder.
* TezStats JobGraphBuilder removes alias vertices since they're not materialized by Tez, and thus, there is no status for them.

Note that-
* Further optimization is possible for the case where union is only followed by store. In that case, we could directly attach a MROutput to VertexGroup instead of adding another vertex that runs the MROutput. I'll follow up with this soon.
* POLocalRearrangeTez is added to each union source because ConcatenatedMergedKeyValuesInput expected ShuffledMergedInputs that requires sorting.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/POUnionTezLoad.java e496ca8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POVertexGroupInputTez.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 245cade 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java bce8963 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java b3aa020 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java f00946c 
  src/org/apache/pig/tools/pigstats/tez/TezStats.java feac11d 
  test/org/apache/pig/test/data/GoldenFiles/TEZC19.gld e69de29 
  test/org/apache/pig/tez/TestTezCompiler.java e71d838 

Diff: https://reviews.apache.org/r/19633/diff/


Testing
-------

ant test-tez passes except TestTezCompiler (known issue).
tez e2e tests all pass.


Thanks,

Cheolsoo Park


Re: Review Request 19633: PIG-3743: Use VertexGroup and Alias vertex for union

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19633/
-----------------------------------------------------------

(Updated March 25, 2014, 10 p.m.)


Review request for pig, Daniel Dai and Rohini Palaniswamy.


Changes
-------

Incorporated Rohini's comments.


Bugs: PIG-3743
    https://issues.apache.org/jira/browse/PIG-3743


Repository: pig-git


Description
-------

The patch reimplements union using Tez VertexGroup and GroupInputEdge.

The changes include-
* Implemented POVertexGroupInputTez that takes ConcatenatedMergedKeyValuesInput from VertexGroup.
* TezCompiler inserts an alias vertex for union, and the alias vertex is converted to VertexGroup by TezDagBuilder.
* TezStats JobGraphBuilder removes alias vertices since they're not materialized by Tez, and thus, there is no status for them.

Note that-
* Further optimization is possible for the case where union is only followed by store. In that case, we could directly attach a MROutput to VertexGroup instead of adding another vertex that runs the MROutput. I'll follow up with this soon.
* POLocalRearrangeTez is added to each union source because ConcatenatedMergedKeyValuesInput expected ShuffledMergedInputs that requires sorting.


Diffs (updated)
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/POUnionTezLoad.java e496ca8 
  src/org/apache/pig/backend/hadoop/executionengine/tez/POVertexGroupInputTez.java e69de29 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 245cade 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java bce8963 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java b3aa020 
  src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java f00946c 
  src/org/apache/pig/tools/pigstats/tez/TezStats.java feac11d 
  test/org/apache/pig/test/data/GoldenFiles/TEZC19.gld e69de29 
  test/org/apache/pig/tez/TestTezCompiler.java e71d838 

Diff: https://reviews.apache.org/r/19633/diff/


Testing
-------

ant test-tez passes except TestTezCompiler (known issue).
tez e2e tests all pass.


Thanks,

Cheolsoo Park


Re: Review Request 19633: PIG-3743: Use VertexGroup and Alias vertex for union

Posted by Cheolsoo Park <pi...@gmail.com>.

> On March 25, 2014, 9:27 p.m., Rohini Palaniswamy wrote:
> > This will not handle union followed by store right? Can we create a separate jira for that?

Correct. Like I said in the description, MROutput should be added directly to VertexGroup.

I will file a jira.


> On March 25, 2014, 9:27 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/POVertexGroupInputTez.java, lines 82-83
> > <https://reviews.apache.org/r/19633/diff/1/?file=535690#file535690line82>
> >
> >     Don't we have to iterate through all the values? Won't there be more than one if shuffle edge is used and input grouped?

No, I don't think so. It should be handled by ConcatenatedMergedKeyValuesInput. If I am understanding it correctly, calling reader.next() updates the underlying reader. So the following reader.getCurrentValues().iterator() returns the new iterator if the previous reader is done.

In fact, I can confirm the result by testing union with multiple sources, e.g. 3+. It seems correct.


> On March 25, 2014, 9:27 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java, line 1979
> > <https://reviews.apache.org/r/19633/diff/1/?file=535691#file535691line1979>
> >
> >     Can you please add a TODO to use POValueOutputTez instead of POLocalRearrange and unsorted shuffle with TEZ-661 and PIG-3775

Sure.


> On March 25, 2014, 9:27 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java, line 54
> > <https://reviews.apache.org/r/19633/diff/1/?file=535694#file535694line54>
> >
> >     hasVertexGroup()

I 100% agree hasVertexGroup() reads better, but since explain doesn't go through TezDagBuilder, VertexGroup of tezOper is not set when TezPrinter traverses the plan.

Maybe I consolidate hasVertexGroup() and isUnion() into one, and then use it everywhere.


- Cheolsoo


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19633/#review38532
-----------------------------------------------------------


On March 25, 2014, 8:50 p.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19633/
> -----------------------------------------------------------
> 
> (Updated March 25, 2014, 8:50 p.m.)
> 
> 
> Review request for pig, Daniel Dai and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3743
>     https://issues.apache.org/jira/browse/PIG-3743
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> The patch reimplements union using Tez VertexGroup and GroupInputEdge.
> 
> The changes include-
> * Implemented POVertexGroupInputTez that takes ConcatenatedMergedKeyValuesInput from VertexGroup.
> * TezCompiler inserts an alias vertex for union, and the alias vertex is converted to VertexGroup by TezDagBuilder.
> * TezStats JobGraphBuilder removes alias vertices since they're not materialized by Tez, and thus, there is no status for them.
> 
> Note that-
> * Further optimization is possible for the case where union is only followed by store. In that case, we could directly attach a MROutput to VertexGroup instead of adding another vertex that runs the MROutput. I'll follow up with this soon.
> * POLocalRearrangeTez is added to each union source because ConcatenatedMergedKeyValuesInput expected ShuffledMergedInputs that requires sorting.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POUnionTezLoad.java e496ca8 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POVertexGroupInputTez.java e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 245cade 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java bce8963 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java b3aa020 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java f00946c 
>   src/org/apache/pig/tools/pigstats/tez/TezStats.java feac11d 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC19.gld e69de29 
>   test/org/apache/pig/tez/TestTezCompiler.java e71d838 
> 
> Diff: https://reviews.apache.org/r/19633/diff/
> 
> 
> Testing
> -------
> 
> ant test-tez passes except TestTezCompiler (known issue).
> tez e2e tests all pass.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>


Re: Review Request 19633: PIG-3743: Use VertexGroup and Alias vertex for union

Posted by Rohini Palaniswamy <ro...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/19633/#review38532
-----------------------------------------------------------


This will not handle union followed by store right? Can we create a separate jira for that?


src/org/apache/pig/backend/hadoop/executionengine/tez/POVertexGroupInputTez.java
<https://reviews.apache.org/r/19633/#comment70769>

    Don't we have to iterate through all the values? Won't there be more than one if shuffle edge is used and input grouped?



src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/19633/#comment70770>

    Can you please add a TODO to use POValueOutputTez instead of POLocalRearrange and unsorted shuffle with TEZ-661 and PIG-3775



src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java
<https://reviews.apache.org/r/19633/#comment70772>

    hasVertexGroup()


- Rohini Palaniswamy


On March 25, 2014, 8:50 p.m., Cheolsoo Park wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/19633/
> -----------------------------------------------------------
> 
> (Updated March 25, 2014, 8:50 p.m.)
> 
> 
> Review request for pig, Daniel Dai and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3743
>     https://issues.apache.org/jira/browse/PIG-3743
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> The patch reimplements union using Tez VertexGroup and GroupInputEdge.
> 
> The changes include-
> * Implemented POVertexGroupInputTez that takes ConcatenatedMergedKeyValuesInput from VertexGroup.
> * TezCompiler inserts an alias vertex for union, and the alias vertex is converted to VertexGroup by TezDagBuilder.
> * TezStats JobGraphBuilder removes alias vertices since they're not materialized by Tez, and thus, there is no status for them.
> 
> Note that-
> * Further optimization is possible for the case where union is only followed by store. In that case, we could directly attach a MROutput to VertexGroup instead of adding another vertex that runs the MROutput. I'll follow up with this soon.
> * POLocalRearrangeTez is added to each union source because ConcatenatedMergedKeyValuesInput expected ShuffledMergedInputs that requires sorting.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POUnionTezLoad.java e496ca8 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/POVertexGroupInputTez.java e69de29 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 245cade 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java bce8963 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezOperator.java b3aa020 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezPrinter.java f00946c 
>   src/org/apache/pig/tools/pigstats/tez/TezStats.java feac11d 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC19.gld e69de29 
>   test/org/apache/pig/tez/TestTezCompiler.java e71d838 
> 
> Diff: https://reviews.apache.org/r/19633/diff/
> 
> 
> Testing
> -------
> 
> ant test-tez passes except TestTezCompiler (known issue).
> tez e2e tests all pass.
> 
> 
> Thanks,
> 
> Cheolsoo Park
> 
>