You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Lorand Bendig <lb...@gmail.com> on 2013/12/30 00:19:44 UTC

Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
-----------------------------------------------------------

Review request for pig.


Bugs: PIG-3642
    https://issues.apache.org/jira/browse/PIG-3642


Repository: pig


Description
-------

With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

    it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
    no scalar aliases
    no SampleLoader
    single leaf job
    DUMP (no STORE)

The feature is enabled by default and can be toggled with:

    -N or -no_fetch
    set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)


Diffs
-----

  /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 
  /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 
  /trunk/src/org/apache/pig/Main.java 1553596 
  /trunk/src/org/apache/pig/PigServer.java 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1553596 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1553596 
  /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
  /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1553596 
  /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
  /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
  /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 

Diff: https://reviews.apache.org/r/16507/diff/


Testing
-------

- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible


Thanks,

Lorand Bendig

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Lorand Bendig <lb...@gmail.com>.


> On Jan. 3, 2014, 1:18 a.m., Cheolsoo Park wrote:
> > I have one last comment below. Other than that, everything looks good.
> > 
> > Also, can you document this? It think it's worth to mention in the "Performance and Efficiency" section in the manual. You can post a doc patch in a separate jira if you'd like.

Great, thank you. Sure, I'll create a separate jira for this.


> On Jan. 3, 2014, 1:18 a.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java, line 178
> > <https://reviews.apache.org/r/16507/diff/2/?file=413423#file413423line178>
> >
> >     This won't work if the temporary file storage is not InterStorage. It can be one of Inter, TFile, and SequenceFile storages.
> >     
> >     See here-
> >     https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347
> >

Great observation! I wasn't aware of this. Now I the verification is done in two steps:
- check whether the storage class is the same as the configured one
- check if the target path equals to the temporary path (it's unlikely that someone will use the generated temp.
  path of the current session) 
What do you think?


- Lorand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
-----------------------------------------------------------


On Jan. 3, 2014, 10:57 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2014, 10:57 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1555255 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1555255 
>   /trunk/src/org/apache/pig/Main.java 1555255 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1555255 
>   /trunk/src/org/apache/pig/PigServer.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/Utils.java 1555255 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1555255 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1555255 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1555255 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1555255 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Cheolsoo Park <pi...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31099
-----------------------------------------------------------


I have one last comment below. Other than that, everything looks good.

Also, can you document this? It think it's worth to mention in the "Performance and Efficiency" section in the manual. You can post a doc patch in a separate jira if you'd like.


/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59452>

    This won't work if the temporary file storage is not InterStorage. It can be one of Inter, TFile, and SequenceFile storages.
    
    See here-
    https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/util/Utils.java#L347
    


- Cheolsoo Park


On Jan. 2, 2014, 2:05 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 2, 2014, 2:05 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1554785 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1554785 
>   /trunk/src/org/apache/pig/Main.java 1554785 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1554785 
>   /trunk/src/org/apache/pig/PigServer.java 1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1554785 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1554785 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1554785 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Lorand Bendig <lb...@gmail.com>.


> On Jan. 5, 2014, 12:49 a.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java, line 383
> > <https://reviews.apache.org/r/16507/diff/3/?file=415059#file415059line383>
> >
> >     I think "return" is omitted here. The explain still outputs the MR plan even if the plan is fetchable.

This is ugly. Thanks for fixing it!


- Lorand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31213
-----------------------------------------------------------


On Jan. 3, 2014, 10:57 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2014, 10:57 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1555255 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1555255 
>   /trunk/src/org/apache/pig/Main.java 1555255 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1555255 
>   /trunk/src/org/apache/pig/PigServer.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/Utils.java 1555255 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1555255 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1555255 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1555255 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1555255 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Cheolsoo Park <pi...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review31213
-----------------------------------------------------------

Ship it!


Looks good to me. I will commit it after running unit tests and e2e tests.

I found a minor bug below. Let me fix it when I commit it.


/trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java
<https://reviews.apache.org/r/16507/#comment59615>

    I think "return" is omitted here. The explain still outputs the MR plan even if the plan is fetchable.


- Cheolsoo Park


On Jan. 3, 2014, 10:57 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Jan. 3, 2014, 10:57 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1555255 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1555255 
>   /trunk/src/org/apache/pig/Main.java 1555255 
>   /trunk/src/org/apache/pig/PigConfiguration.java 1555255 
>   /trunk/src/org/apache/pig/PigServer.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1555255 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255 
>   /trunk/src/org/apache/pig/impl/util/Utils.java 1555255 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1555255 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1555255 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1555255 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1555255 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Lorand Bendig <lb...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
-----------------------------------------------------------

(Updated Jan. 3, 2014, 10:57 p.m.)


Review request for pig.


Changes
-------

Updated patch: PIG-3642-3.patch


Bugs: PIG-3642
    https://issues.apache.org/jira/browse/PIG-3642


Repository: pig


Description
-------

With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

    it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
    no scalar aliases
    no SampleLoader
    single leaf job
    DUMP (no STORE)

The feature is enabled by default and can be toggled with:

    -N or -no_fetch
    set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)


Diffs (updated)
-----

  /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1555255 
  /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1555255 
  /trunk/src/org/apache/pig/Main.java 1555255 
  /trunk/src/org/apache/pig/PigConfiguration.java 1555255 
  /trunk/src/org/apache/pig/PigServer.java 1555255 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1555255 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1555255 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1555255 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1555255 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1555255 
  /trunk/src/org/apache/pig/impl/io/FileLocalizer.java 1555255 
  /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1555255 
  /trunk/src/org/apache/pig/impl/util/Utils.java 1555255 
  /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1555255 
  /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestAssert.java 1555255 
  /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1555255 
  /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestPigRunner.java 1555255 

Diff: https://reviews.apache.org/r/16507/diff/


Testing
-------

- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible


Thanks,

Lorand Bendig

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Lorand Bendig <lb...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/
-----------------------------------------------------------

(Updated Jan. 2, 2014, 2:05 p.m.)


Review request for pig.


Changes
-------

Updated patch: PIG-3642-2.patch


Bugs: PIG-3642
    https://issues.apache.org/jira/browse/PIG-3642


Repository: pig


Description
-------

With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

    it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
    no scalar aliases
    no SampleLoader
    single leaf job
    DUMP (no STORE)

The feature is enabled by default and can be toggled with:

    -N or -no_fetch
    set opt.fetch true/false;

There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)


Diffs (updated)
-----

  /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1554785 
  /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1554785 
  /trunk/src/org/apache/pig/Main.java 1554785 
  /trunk/src/org/apache/pig/PigConfiguration.java 1554785 
  /trunk/src/org/apache/pig/PigServer.java 1554785 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1554785 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1554785 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1554785 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1554785 
  /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1554785 
  /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1554785 
  /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1554785 
  /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestAssert.java 1554785 
  /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1554785 
  /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
  /trunk/test/org/apache/pig/test/TestPigRunner.java 1554785 

Diff: https://reviews.apache.org/r/16507/diff/


Testing
-------

- new testcase added:  TestFetch
- the patch was checked against test-commit and test-core
- Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible


Thanks,

Lorand Bendig

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Lorand Bendig <lb...@gmail.com>.


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > This a great work. Thank you very much!
> > 
> > I have few minor comments below mostly about tests.

Cheolsoo, thanks for taking your time to review it!
I fixed/commented the issues.


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java, line 74
> > <https://reviews.apache.org/r/16507/diff/1/?file=404117#file404117line74>
> >
> >     Can you move this to PigConfiguration?

PigConfiguration seems to be a better place to put OPT_FETCH


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java, line 23
> > <https://reviews.apache.org/r/16507/diff/1/?file=404122#file404122line23>
> >
> >     Do you mind removing unused imports?
> >     - import java.util.LinkedList;
> >     - import org.apache.pig.impl.util.IdentityHashSet;
> >     - import org.apache.pig.pen.util.LineageTracer;

Sure. Intially didn't want to remove these leftovers from PIG-1712 


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/test/org/apache/pig/test/TestAssert.java, lines 91-94
> > <https://reviews.apache.org/r/16507/diff/1/?file=404127#file404127line91>
> >
> >     Does this else block ever get executed given the we're running the test with opt.fetch on?
> >     
> >     I think you can do either-
> >     
> >     1) explicitly set opt.fetch to true or false in setup(),
> >     
> >     or
> >     
> >     2) change the test to run the query twice with opt.fetch on and off to ensure we're not breaking anything when opt.fetch is off.

Not really. I chose the second option


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/test/org/apache/pig/test/TestPigRunner.java, line 174
> > <https://reviews.apache.org/r/16507/diff/1/?file=404130#file404130line174>
> >
> >     Why is this changed? I think the default value for opt.multiquery is true.

I accidentally changed it


- Lorand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review30936
-----------------------------------------------------------


On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Dec. 29, 2013, 11:19 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 
>   /trunk/src/org/apache/pig/Main.java 1553596 
>   /trunk/src/org/apache/pig/PigServer.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1553596 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1553596 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)

Posted by Cheolsoo Park <pi...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review30936
-----------------------------------------------------------


This a great work. Thank you very much!

I have few minor comments below mostly about tests.


/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java
<https://reviews.apache.org/r/16507/#comment59191>

    Replace UDFContext.getUdFContext() with udfContext.



/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59252>

    Do you mind removing trailing white spaces in the patch?



/trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java
<https://reviews.apache.org/r/16507/#comment59192>

    Can you move this to PigConfiguration?



/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
<https://reviews.apache.org/r/16507/#comment59194>

    Do you mind removing unused imports?
    - import java.util.LinkedList;
    - import org.apache.pig.impl.util.IdentityHashSet;
    - import org.apache.pig.pen.util.LineageTracer;



/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java
<https://reviews.apache.org/r/16507/#comment59276>

    Do you mind adding a comment explaining about isFetchable in POStream?



/trunk/test/org/apache/pig/test/TestAssert.java
<https://reviews.apache.org/r/16507/#comment59195>

    Does this else block ever get executed given the we're running the test with opt.fetch on?
    
    I think you can do either-
    
    1) explicitly set opt.fetch to true or false in setup(),
    
    or
    
    2) change the test to run the query twice with opt.fetch on and off to ensure we're not breaking anything when opt.fetch is off.



/trunk/test/org/apache/pig/test/TestEvalPipeline2.java
<https://reviews.apache.org/r/16507/#comment59197>

    Can you change the test to not take into account the order of records instead of handling fetchEnabled separately? Since union doesn't guarantee any ordering, the test shouldn't assume it in the first place.
    



/trunk/test/org/apache/pig/test/TestEvalPipeline2.java
<https://reviews.apache.org/r/16507/#comment59248>

    Same as above. Given that opt.fetch is on in the test, the bincond will be always true.
    
    I think you can do either-
    
    1) explicitly set opt.fetch to true or false in setup(),
    
    or
    
    2) change the test to run the query twice with opt.fetch on and off.



/trunk/test/org/apache/pig/test/TestPigRunner.java
<https://reviews.apache.org/r/16507/#comment59196>

    Why is this changed? I think the default value for opt.multiquery is true.



/trunk/test/org/apache/pig/test/TestPigRunner.java
<https://reviews.apache.org/r/16507/#comment59250>

    Same here. The default value for opt.multiquery is true.


- Cheolsoo Park


On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Dec. 29, 2013, 11:19 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 
>   /trunk/src/org/apache/pig/Main.java 1553596 
>   /trunk/src/org/apache/pig/PigServer.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1553596 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1553596 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>