You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Rohini Palaniswamy <ro...@gmail.com> on 2015/06/16 09:20:00 UTC

Review Request 35491: PIG-4574: Eliminate identity vertex for order by and skewed join right after LOAD

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35491/
-----------------------------------------------------------

Review request for pig.


Bugs: PIG-4574
    https://issues.apache.org/jira/browse/PIG-4574


Repository: pig


Description
-------

Reading orderby/skewed join data from HDFS in Partitioner vertex, instead of getting from sampler vertex.

This jira does not optimize the case of 

A = LOAD 'x' ...;
B = LOAD 'y' ...;
C = UNION A, B;
D = ORDER C BY ..;

This depends on UnionOptimizer being turned on and will need more changes. So will leave this for another jira.


Diffs
-----

  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POIdentityInOutTez.java 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Limit-2.gld 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-1.gld 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-2.gld PRE-CREATION 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-1.gld 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-2.gld PRE-CREATION 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16-OPTOFF.gld 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16.gld 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezAutoParallelism.java 1685498 
  http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezCompiler.java 1685498 

Diff: https://reviews.apache.org/r/35491/diff/


Testing
-------

Ran subset of e2e tests - SkewedJoin,Union,Order,MultiQuery_Self,MultiQuery_Union

Ran L9.pig. Before the patch

File System Counters
		FILE_BYTES_READ=2028282366911
		FILE_BYTES_WRITTEN=4049785379197
		HDFS_BYTES_READ=1011533488395
		HDFS_BYTES_WRITTEN=1010554380555
        
After the patch

File System Counters
                FILE_BYTES_READ=1007449863330
                FILE_BYTES_WRITTEN=2016036957653
                HDFS_BYTES_READ=2023066976790
                HDFS_BYTES_WRITTEN=1010554380555


Thanks,

Rohini Palaniswamy


Re: Review Request 35491: PIG-4574: Eliminate identity vertex for order by and skewed join right after LOAD

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35491/#review89294
-----------------------------------------------------------

Ship it!


Ship It!

- Daniel Dai


On June 16, 2015, 7:19 a.m., Rohini Palaniswamy wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35491/
> -----------------------------------------------------------
> 
> (Updated June 16, 2015, 7:19 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4574
>     https://issues.apache.org/jira/browse/PIG-4574
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> Reading orderby/skewed join data from HDFS in Partitioner vertex, instead of getting from sampler vertex.
> 
> This jira does not optimize the case of 
> 
> A = LOAD 'x' ...;
> B = LOAD 'y' ...;
> C = UNION A, B;
> D = ORDER C BY ..;
> 
> This depends on UnionOptimizer being turned on and will need more changes. So will leave this for another jira.
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POIdentityInOutTez.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Limit-2.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-1.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-2.gld PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-1.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-2.gld PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16-OPTOFF.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezAutoParallelism.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezCompiler.java 1685498 
> 
> Diff: https://reviews.apache.org/r/35491/diff/
> 
> 
> Testing
> -------
> 
> Ran subset of e2e tests - SkewedJoin,Union,Order,MultiQuery_Self,MultiQuery_Union
> 
> Ran L9.pig. Before the patch
> 
> File System Counters
> 		FILE_BYTES_READ=2028282366911
> 		FILE_BYTES_WRITTEN=4049785379197
> 		HDFS_BYTES_READ=1011533488395
> 		HDFS_BYTES_WRITTEN=1010554380555
>         
> After the patch
> 
> File System Counters
>                 FILE_BYTES_READ=1007449863330
>                 FILE_BYTES_WRITTEN=2016036957653
>                 HDFS_BYTES_READ=2023066976790
>                 HDFS_BYTES_WRITTEN=1010554380555
> 
> 
> Thanks,
> 
> Rohini Palaniswamy
> 
>


Re: Review Request 35491: PIG-4574: Eliminate identity vertex for order by and skewed join right after LOAD

Posted by Rohini Palaniswamy <ro...@gmail.com>.

> On June 24, 2015, 6:38 p.m., Daniel Dai wrote:
> > http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java, line 150
> > <https://reviews.apache.org/r/35491/diff/1/?file=985529#file985529line150>
> >
> >     Can you add a comment why we need to wrap key into NullablePartitionWritable for skewed join?

Sure. POPartitionRearrange of the right table creates as NullablePartitionWritable as the key. Since left side uses LocalRearrange, we have to wrap it specifically to match the key type of the right one.


- Rohini


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35491/#review89225
-----------------------------------------------------------


On June 16, 2015, 7:19 a.m., Rohini Palaniswamy wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35491/
> -----------------------------------------------------------
> 
> (Updated June 16, 2015, 7:19 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4574
>     https://issues.apache.org/jira/browse/PIG-4574
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> Reading orderby/skewed join data from HDFS in Partitioner vertex, instead of getting from sampler vertex.
> 
> This jira does not optimize the case of 
> 
> A = LOAD 'x' ...;
> B = LOAD 'y' ...;
> C = UNION A, B;
> D = ORDER C BY ..;
> 
> This depends on UnionOptimizer being turned on and will need more changes. So will leave this for another jira.
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POIdentityInOutTez.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Limit-2.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-1.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-2.gld PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-1.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-2.gld PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16-OPTOFF.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezAutoParallelism.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezCompiler.java 1685498 
> 
> Diff: https://reviews.apache.org/r/35491/diff/
> 
> 
> Testing
> -------
> 
> Ran subset of e2e tests - SkewedJoin,Union,Order,MultiQuery_Self,MultiQuery_Union
> 
> Ran L9.pig. Before the patch
> 
> File System Counters
> 		FILE_BYTES_READ=2028282366911
> 		FILE_BYTES_WRITTEN=4049785379197
> 		HDFS_BYTES_READ=1011533488395
> 		HDFS_BYTES_WRITTEN=1010554380555
>         
> After the patch
> 
> File System Counters
>                 FILE_BYTES_READ=1007449863330
>                 FILE_BYTES_WRITTEN=2016036957653
>                 HDFS_BYTES_READ=2023066976790
>                 HDFS_BYTES_WRITTEN=1010554380555
> 
> 
> Thanks,
> 
> Rohini Palaniswamy
> 
>


Re: Review Request 35491: PIG-4574: Eliminate identity vertex for order by and skewed join right after LOAD

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/35491/#review89225
-----------------------------------------------------------



http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java (line 150)
<https://reviews.apache.org/r/35491/#comment141821>

    Can you add a comment why we need to wrap key into NullablePartitionWritable for skewed join?


- Daniel Dai


On June 16, 2015, 7:19 a.m., Rohini Palaniswamy wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/35491/
> -----------------------------------------------------------
> 
> (Updated June 16, 2015, 7:19 a.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4574
>     https://issues.apache.org/jira/browse/PIG-4574
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> Reading orderby/skewed join data from HDFS in Partitioner vertex, instead of getting from sampler vertex.
> 
> This jira does not optimize the case of 
> 
> A = LOAD 'x' ...;
> B = LOAD 'y' ...;
> C = UNION A, B;
> D = ORDER C BY ..;
> 
> This depends on UnionOptimizer being turned on and will need more changes. So will leave this for another jira.
> 
> 
> Diffs
> -----
> 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POIdentityInOutTez.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/operator/POLocalRearrangeTez.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Limit-2.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-1.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Order-2.gld PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-1.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-SkewJoin-2.gld PRE-CREATION 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16-OPTOFF.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/data/GoldenFiles/tez/TEZC-Union-16.gld 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezAutoParallelism.java 1685498 
>   http://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/tez/TestTezCompiler.java 1685498 
> 
> Diff: https://reviews.apache.org/r/35491/diff/
> 
> 
> Testing
> -------
> 
> Ran subset of e2e tests - SkewedJoin,Union,Order,MultiQuery_Self,MultiQuery_Union
> 
> Ran L9.pig. Before the patch
> 
> File System Counters
> 		FILE_BYTES_READ=2028282366911
> 		FILE_BYTES_WRITTEN=4049785379197
> 		HDFS_BYTES_READ=1011533488395
> 		HDFS_BYTES_WRITTEN=1010554380555
>         
> After the patch
> 
> File System Counters
>                 FILE_BYTES_READ=1007449863330
>                 FILE_BYTES_WRITTEN=2016036957653
>                 HDFS_BYTES_READ=2023066976790
>                 HDFS_BYTES_WRITTEN=1010554380555
> 
> 
> Thanks,
> 
> Rohini Palaniswamy
> 
>