You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Aniket Mokashi <an...@gmail.com> on 2013/10/26 03:29:38 UTC

Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/
-----------------------------------------------------------

Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.


Bugs: PIG-3047
    https://issues.apache.org/jira/browse/PIG-3047


Repository: pig


Description
-------

-Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default


Diffs
-----

  trunk/conf/pig.properties 1531787 
  trunk/src/org/apache/pig/PigConfiguration.java 1531787 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1531787 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1531787 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1531787 
  trunk/src/org/apache/pig/impl/util/Utils.java 1531787 
  trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1531787 
  trunk/test/org/apache/pig/test/TestFRJoin2.java 1531787 

Diff: https://reviews.apache.org/r/14964/diff/


Testing
-------


Thanks,

Aniket Mokashi


Re: Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

Posted by Aniket Mokashi <an...@gmail.com>.

> On Oct. 28, 2013, 7:20 p.m., Cheolsoo Park wrote:
> > Looks good, and TestFRJoin2 passes.
> > 
> > Aniket, do you mind opening a documentation jira for this? Or you can update it when committing the patch. I think we should change the following section-
> > 
> > Conditions
> > Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory. Please share your observations and experience with us.

Thanks Cheolsoo, let me submit a RB update to incorporate documentation for this.


- Aniket


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/#review27628
-----------------------------------------------------------


On Oct. 28, 2013, 6:45 a.m., Aniket Mokashi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14964/
> -----------------------------------------------------------
> 
> (Updated Oct. 28, 2013, 6:45 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.
> 
> 
> Bugs: PIG-3047
>     https://issues.apache.org/jira/browse/PIG-3047
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> -Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default
> 
> 
> Diffs
> -----
> 
>   trunk/conf/pig.properties 1536246 
>   trunk/src/org/apache/pig/PigConfiguration.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1536246 
>   trunk/src/org/apache/pig/impl/util/Utils.java 1536246 
>   trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1536246 
>   trunk/test/org/apache/pig/test/TestFRJoin2.java 1536246 
> 
> Diff: https://reviews.apache.org/r/14964/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Aniket Mokashi
> 
>


Re: Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/#review27628
-----------------------------------------------------------

Ship it!


Looks good, and TestFRJoin2 passes.

Aniket, do you mind opening a documentation jira for this? Or you can update it when committing the patch. I think we should change the following section-

Conditions
Fragment replicate joins are experimental; we don't have a strong sense of how small the small relation must be to fit into memory. In our tests with a simple query that involves just a JOIN, a relation of up to 100 M can be used if the process overall gets 1 GB of memory. Please share your observations and experience with us.

- Cheolsoo Park


On Oct. 28, 2013, 6:45 a.m., Aniket Mokashi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14964/
> -----------------------------------------------------------
> 
> (Updated Oct. 28, 2013, 6:45 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.
> 
> 
> Bugs: PIG-3047
>     https://issues.apache.org/jira/browse/PIG-3047
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> -Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default
> 
> 
> Diffs
> -----
> 
>   trunk/conf/pig.properties 1536246 
>   trunk/src/org/apache/pig/PigConfiguration.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1536246 
>   trunk/src/org/apache/pig/impl/util/Utils.java 1536246 
>   trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1536246 
>   trunk/test/org/apache/pig/test/TestFRJoin2.java 1536246 
> 
> Diff: https://reviews.apache.org/r/14964/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Aniket Mokashi
> 
>


Re: Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/#review27643
-----------------------------------------------------------

Ship it!


Ship It!

- Cheolsoo Park


On Oct. 28, 2013, 10:28 p.m., Aniket Mokashi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14964/
> -----------------------------------------------------------
> 
> (Updated Oct. 28, 2013, 10:28 p.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.
> 
> 
> Bugs: PIG-3047
>     https://issues.apache.org/jira/browse/PIG-3047
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> -Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default
> 
> 
> Diffs
> -----
> 
>   trunk/conf/pig.properties 1536246 
>   trunk/src/docs/src/documentation/content/xdocs/perf.xml 1536246 
>   trunk/src/org/apache/pig/PigConfiguration.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1536246 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1536246 
>   trunk/src/org/apache/pig/impl/util/Utils.java 1536246 
>   trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1536246 
>   trunk/test/org/apache/pig/test/TestFRJoin2.java 1536246 
> 
> Diff: https://reviews.apache.org/r/14964/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Aniket Mokashi
> 
>


Re: Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

Posted by Aniket Mokashi <an...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/
-----------------------------------------------------------

(Updated Oct. 28, 2013, 10:28 p.m.)


Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.


Changes
-------

added documentation.


Bugs: PIG-3047
    https://issues.apache.org/jira/browse/PIG-3047


Repository: pig


Description
-------

-Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default


Diffs (updated)
-----

  trunk/conf/pig.properties 1536246 
  trunk/src/docs/src/documentation/content/xdocs/perf.xml 1536246 
  trunk/src/org/apache/pig/PigConfiguration.java 1536246 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1536246 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1536246 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1536246 
  trunk/src/org/apache/pig/impl/util/Utils.java 1536246 
  trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1536246 
  trunk/test/org/apache/pig/test/TestFRJoin2.java 1536246 

Diff: https://reviews.apache.org/r/14964/diff/


Testing
-------


Thanks,

Aniket Mokashi


Re: Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

Posted by Aniket Mokashi <an...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/
-----------------------------------------------------------

(Updated Oct. 28, 2013, 6:45 a.m.)


Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.


Changes
-------

Updated patch with code review changes (rebased).


Bugs: PIG-3047
    https://issues.apache.org/jira/browse/PIG-3047


Repository: pig


Description
-------

-Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default


Diffs (updated)
-----

  trunk/conf/pig.properties 1536246 
  trunk/src/org/apache/pig/PigConfiguration.java 1536246 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1536246 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1536246 
  trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1536246 
  trunk/src/org/apache/pig/impl/util/Utils.java 1536246 
  trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1536246 
  trunk/test/org/apache/pig/test/TestFRJoin2.java 1536246 

Diff: https://reviews.apache.org/r/14964/diff/


Testing
-------


Thanks,

Aniket Mokashi


Re: Review Request 14964: PIG-3047 Check the size of a relation before adding it to distributed cache in Replicated join

Posted by Cheolsoo Park <pi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/14964/#review27579
-----------------------------------------------------------


Thank you Aniket for the patch! Overall it looks good.

Some lines become quite wide. Do you mind breaking them into multiple lines? I have few more minor comments below.


trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java
<https://reviews.apache.org/r/14964/#comment53521>

    Unused. Please remove.



trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/14964/#comment53518>

    White space.



trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/14964/#comment53524>

    I think we should not duplicate this catch block. Perhaps we can move it to the outside of the for loop and catch all the exceptions in one place. For example,
    
    try {
      long runningSum = 0;
      long maxSize = configured value;
    
      for (each of replFiles) {
        runningSum += size of replFile;
        if (runningSum > maxSize) {
          throw new IOException();
        }
      }
    
      setupDistributedCache();
    
    } catch (IOException e) {
      throw new VisitorException(e);
    }



trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/14964/#comment53517>

    Please remove the tab.



trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
<https://reviews.apache.org/r/14964/#comment53523>

    Why remove "Internal error"? I think keeping it is consistent with other error messages in this file.



trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java
<https://reviews.apache.org/r/14964/#comment53520>

    Typo: "if a file" => "if a directory".



trunk/src/org/apache/pig/impl/util/Utils.java
<https://reviews.apache.org/r/14964/#comment53516>

    Unused. Please remove.


- Cheolsoo Park


On Oct. 26, 2013, 1:29 a.m., Aniket Mokashi wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/14964/
> -----------------------------------------------------------
> 
> (Updated Oct. 26, 2013, 1:29 a.m.)
> 
> 
> Review request for pig, Cheolsoo Park, Daniel Dai, Dmitriy Ryaboy, and Julien Le Dem.
> 
> 
> Bugs: PIG-3047
>     https://issues.apache.org/jira/browse/PIG-3047
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> -Check the size of a relation before adding it to distributed cache in Replicated join - 1G by default
> 
> 
> Diffs
> -----
> 
>   trunk/conf/pig.properties 1531787 
>   trunk/src/org/apache/pig/PigConfiguration.java 1531787 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/InputSizeReducerEstimator.java 1531787 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1531787 
>   trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1531787 
>   trunk/src/org/apache/pig/impl/util/Utils.java 1531787 
>   trunk/test/org/apache/pig/test/PigStorageWithStatistics.java 1531787 
>   trunk/test/org/apache/pig/test/TestFRJoin2.java 1531787 
> 
> Diff: https://reviews.apache.org/r/14964/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Aniket Mokashi
> 
>