You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Alex Bain <am...@gmail.com> on 2013/10/24 03:42:05 UTC

Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/
-----------------------------------------------------------

(Updated Oct. 23, 2013, 6:42 p.m.)


Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Summary (updated)
-----------------

PIG-3538 Implement LIMIT in Tez


Bugs: PIG-3538
    https://issues.apache.org/jira/browse/PIG-3538


Repository: pig-git


Description
-------

Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.


Diffs
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 

Diff: https://reviews.apache.org/r/14897/diff/


Testing (updated)
-------

[abain@abain-ld pig]$ cat data/1.dat
1,orange
2,apple
3,strawberry

[abain@abain-ld pig]$ cat test3.pig
a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
b = LIMIT a 2;
STORE b INTO 'foo';

I ran with with "pig -x tez -f test3.pig" and got the following (correct results):

[abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
Found 2 items
-rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
-rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000

[abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
1	orange
2	apple


Thanks,

Alex Bain


Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/#review27498
-----------------------------------------------------------

Ship it!


Ship It!

- Daniel Dai


On Oct. 24, 2013, 11:47 p.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14897/
> -----------------------------------------------------------
> 
> (Updated Oct. 24, 2013, 11:47 p.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3538
>     https://issues.apache.org/jira/browse/PIG-3538
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.
> 
> UPDATED (Oct 24 4:37 PM):
> 1. I added a test to TestTezCompiler.java and a GLD file
> 2. I included Daniel's patch for a new e2e test
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 
>   test/e2e/pig/tests/tez.conf 5edc093 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld PRE-CREATION 
>   test/org/apache/pig/tez/TestTezCompiler.java ef51876 
>   test/org/apache/pig/tez/TestTezJobControlCompiler.java 0a23513 
> 
> Diff: https://reviews.apache.org/r/14897/diff/
> 
> 
> Testing
> -------
> 
> [abain@abain-ld pig]$ cat data/1.dat
> 1,orange
> 2,apple
> 3,strawberry
> 
> [abain@abain-ld pig]$ cat test3.pig
> a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
> b = LIMIT a 2;
> STORE b INTO 'foo';
> 
> I ran with with "pig -x tez -f test3.pig" and got the following (correct results):
> 
> [abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
> Found 2 items
> -rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
> -rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000
> 
> [abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
> 1	orange
> 2	apple
> 
> UPDATED (Oct 24 4:37 PM):
> 1. ant -Dtestcase=TestTezCompiler test passes
> 2. I ran test-e2e-tez. The new test seems to pass (although something else failed).
> 
> 
> Thanks,
> 
> Alex Bain
> 
>


Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/#review27494
-----------------------------------------------------------


2. I ran test-e2e-tez. The new test seems to pass (although something else failed).
Yes, Checkin_2 fail for me. Should be a simple fix.

- Daniel Dai


On Oct. 24, 2013, 11:47 p.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14897/
> -----------------------------------------------------------
> 
> (Updated Oct. 24, 2013, 11:47 p.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3538
>     https://issues.apache.org/jira/browse/PIG-3538
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.
> 
> UPDATED (Oct 24 4:37 PM):
> 1. I added a test to TestTezCompiler.java and a GLD file
> 2. I included Daniel's patch for a new e2e test
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 
>   test/e2e/pig/tests/tez.conf 5edc093 
>   test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld PRE-CREATION 
>   test/org/apache/pig/tez/TestTezCompiler.java ef51876 
>   test/org/apache/pig/tez/TestTezJobControlCompiler.java 0a23513 
> 
> Diff: https://reviews.apache.org/r/14897/diff/
> 
> 
> Testing
> -------
> 
> [abain@abain-ld pig]$ cat data/1.dat
> 1,orange
> 2,apple
> 3,strawberry
> 
> [abain@abain-ld pig]$ cat test3.pig
> a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
> b = LIMIT a 2;
> STORE b INTO 'foo';
> 
> I ran with with "pig -x tez -f test3.pig" and got the following (correct results):
> 
> [abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
> Found 2 items
> -rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
> -rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000
> 
> [abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
> 1	orange
> 2	apple
> 
> UPDATED (Oct 24 4:37 PM):
> 1. ant -Dtestcase=TestTezCompiler test passes
> 2. I ran test-e2e-tez. The new test seems to pass (although something else failed).
> 
> 
> Thanks,
> 
> Alex Bain
> 
>


Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

Posted by Alex Bain <am...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/
-----------------------------------------------------------

(Updated Oct. 24, 2013, 4:47 p.m.)


Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Changes
-------

New diff generated with --no-prefix


Bugs: PIG-3538
    https://issues.apache.org/jira/browse/PIG-3538


Repository: pig-git


Description
-------

Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.

UPDATED (Oct 24 4:37 PM):
1. I added a test to TestTezCompiler.java and a GLD file
2. I included Daniel's patch for a new e2e test


Diffs (updated)
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 
  test/e2e/pig/tests/tez.conf 5edc093 
  test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld PRE-CREATION 
  test/org/apache/pig/tez/TestTezCompiler.java ef51876 
  test/org/apache/pig/tez/TestTezJobControlCompiler.java 0a23513 

Diff: https://reviews.apache.org/r/14897/diff/


Testing
-------

[abain@abain-ld pig]$ cat data/1.dat
1,orange
2,apple
3,strawberry

[abain@abain-ld pig]$ cat test3.pig
a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
b = LIMIT a 2;
STORE b INTO 'foo';

I ran with with "pig -x tez -f test3.pig" and got the following (correct results):

[abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
Found 2 items
-rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
-rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000

[abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
1	orange
2	apple

UPDATED (Oct 24 4:37 PM):
1. ant -Dtestcase=TestTezCompiler test passes
2. I ran test-e2e-tez. The new test seems to pass (although something else failed).


Thanks,

Alex Bain


Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

Posted by Alex Bain <am...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/
-----------------------------------------------------------

(Updated Oct. 24, 2013, 4:45 p.m.)


Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.


Changes
-------

Updated diff based on comments from reviewers. Incorporated Daniel's patch for the e2e test.


Bugs: PIG-3538
    https://issues.apache.org/jira/browse/PIG-3538


Repository: pig-git


Description (updated)
-------

Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.

UPDATED (Oct 24 4:37 PM):
1. I added a test to TestTezCompiler.java and a GLD file
2. I included Daniel's patch for a new e2e test


Diffs (updated)
-----

  src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 
  test/e2e/pig/tests/tez.conf 5edc093 
  test/org/apache/pig/test/data/GoldenFiles/TEZC4.gld PRE-CREATION 
  test/org/apache/pig/tez/TestTezCompiler.java ef51876 
  test/org/apache/pig/tez/TestTezJobControlCompiler.java 0a23513 

Diff: https://reviews.apache.org/r/14897/diff/


Testing (updated)
-------

[abain@abain-ld pig]$ cat data/1.dat
1,orange
2,apple
3,strawberry

[abain@abain-ld pig]$ cat test3.pig
a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
b = LIMIT a 2;
STORE b INTO 'foo';

I ran with with "pig -x tez -f test3.pig" and got the following (correct results):

[abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
Found 2 items
-rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
-rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000

[abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
1	orange
2	apple

UPDATED (Oct 24 4:37 PM):
1. ant -Dtestcase=TestTezCompiler test passes
2. I ran test-e2e-tez. The new test seems to pass (although something else failed).


Thanks,

Alex Bain


Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

Posted by Daniel Dai <da...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/#review27470
-----------------------------------------------------------


Thanks Alex. Patch looks good in general. Might consider some optimization in the future:
1. If the previous stage using 1 reduce, no need to add one more vertex
2. If the limitplan is null (ie, not the "limited order by" case), we might not need a shuffle edge, a pass through edge should be enough if possible
3. Similar to PIG-1270, we can push limit to InputHandler
4. We also need to think through the "limited order by" case once "order by" is implemented.

Once Choelsoo's comments are addressed, we are ready to go. I will add a e2e test case for it.

- Daniel Dai


On Oct. 24, 2013, 1:42 a.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14897/
> -----------------------------------------------------------
> 
> (Updated Oct. 24, 2013, 1:42 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3538
>     https://issues.apache.org/jira/browse/PIG-3538
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 
> 
> Diff: https://reviews.apache.org/r/14897/diff/
> 
> 
> Testing
> -------
> 
> [abain@abain-ld pig]$ cat data/1.dat
> 1,orange
> 2,apple
> 3,strawberry
> 
> [abain@abain-ld pig]$ cat test3.pig
> a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
> b = LIMIT a 2;
> STORE b INTO 'foo';
> 
> I ran with with "pig -x tez -f test3.pig" and got the following (correct results):
> 
> [abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
> Found 2 items
> -rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
> -rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000
> 
> [abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
> 1	orange
> 2	apple
> 
> 
> Thanks,
> 
> Alex Bain
> 
>


Re: Review Request 14897: PIG-3538 Implement LIMIT in Tez

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14897/#review27443
-----------------------------------------------------------


Looks great! Thanks a lot!

In terms of tests, can you please add a test case to TestTezCompiler? Please look at existing test cases and TEZCx.gld files. With that, I think we can commit this patch. Later we should add more tests using either mini cluster or e2e.

I also made few minor comments below.


src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/14897/#comment53329>

    Please remove all the trailing white spaces in the patch.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/14897/#comment53333>

    Can you mark this comment as TODO?



src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/14897/#comment53335>

    This is not your fault but mine. I realized that blocking(op) has no need to take an argument. Can you please change it to blocking()? It just confuses me when reading the code.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/14897/#comment53334>

    In the same line.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/14897/#comment53330>

    Do you mind using Lists.newArrayList() instead of new ArrayList<PhysicalPlan>()? Just for cleaner code.
    
    There are several places.



src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java
<https://reviews.apache.org/r/14897/#comment53331>

    Why is this method public? Can't we keep it private?
    
    Also please put the method signature and curly brace in a single line.


- Cheolsoo Park


On Oct. 24, 2013, 1:42 a.m., Alex Bain wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14897/
> -----------------------------------------------------------
> 
> (Updated Oct. 24, 2013, 1:42 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Mark Wagner, and Rohini Palaniswamy.
> 
> 
> Bugs: PIG-3538
>     https://issues.apache.org/jira/browse/PIG-3538
> 
> 
> Repository: pig-git
> 
> 
> Description
> -------
> 
> Implement LIMIT in Tez by providing an implementation of visitLimit in TezCompiler.java.
> 
> 
> Diffs
> -----
> 
>   src/org/apache/pig/backend/hadoop/executionengine/tez/TezCompiler.java 0c20214 
> 
> Diff: https://reviews.apache.org/r/14897/diff/
> 
> 
> Testing
> -------
> 
> [abain@abain-ld pig]$ cat data/1.dat
> 1,orange
> 2,apple
> 3,strawberry
> 
> [abain@abain-ld pig]$ cat test3.pig
> a = load './1.dat' using PigStorage(',') as (id:int, fruit:chararray);
> b = LIMIT a 2;
> STORE b INTO 'foo';
> 
> I ran with with "pig -x tez -f test3.pig" and got the following (correct results):
> 
> [abain@abain-ld pig]$ hadoop fs -ls /user/abain/foo
> Found 2 items
> -rw-r--r--   1 abain supergroup          0 2013-10-23 18:38 /user/abain/foo/_SUCCESS
> -rw-r--r--   1 abain supergroup         17 2013-10-23 18:38 /user/abain/foo/part-r-00000
> 
> [abain@abain-ld pig]$ hadoop fs -cat /user/abain/foo/part-r-00000
> 1	orange
> 2	apple
> 
> 
> Thanks,
> 
> Alex Bain
> 
>