You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by Lorand Bendig <lb...@gmail.com> on 2014/01/02 15:04:22 UTC

Re: Review Request 16507: PIG-3642 Direct HDFS access for small jobs (fetch)


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > This a great work. Thank you very much!
> > 
> > I have few minor comments below mostly about tests.

Cheolsoo, thanks for taking your time to review it!
I fixed/commented the issues.


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java, line 74
> > <https://reviews.apache.org/r/16507/diff/1/?file=404117#file404117line74>
> >
> >     Can you move this to PigConfiguration?

PigConfiguration seems to be a better place to put OPT_FETCH


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java, line 23
> > <https://reviews.apache.org/r/16507/diff/1/?file=404122#file404122line23>
> >
> >     Do you mind removing unused imports?
> >     - import java.util.LinkedList;
> >     - import org.apache.pig.impl.util.IdentityHashSet;
> >     - import org.apache.pig.pen.util.LineageTracer;

Sure. Intially didn't want to remove these leftovers from PIG-1712 


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/test/org/apache/pig/test/TestAssert.java, lines 91-94
> > <https://reviews.apache.org/r/16507/diff/1/?file=404127#file404127line91>
> >
> >     Does this else block ever get executed given the we're running the test with opt.fetch on?
> >     
> >     I think you can do either-
> >     
> >     1) explicitly set opt.fetch to true or false in setup(),
> >     
> >     or
> >     
> >     2) change the test to run the query twice with opt.fetch on and off to ensure we're not breaking anything when opt.fetch is off.

Not really. I chose the second option


> On Dec. 30, 2013, 9:50 p.m., Cheolsoo Park wrote:
> > /trunk/test/org/apache/pig/test/TestPigRunner.java, line 174
> > <https://reviews.apache.org/r/16507/diff/1/?file=404130#file404130line174>
> >
> >     Why is this changed? I think the default value for opt.multiquery is true.

I accidentally changed it


- Lorand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16507/#review30936
-----------------------------------------------------------


On Dec. 29, 2013, 11:19 p.m., Lorand Bendig wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/16507/
> -----------------------------------------------------------
> 
> (Updated Dec. 29, 2013, 11:19 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-3642
>     https://issues.apache.org/jira/browse/PIG-3642
> 
> 
> Repository: pig
> 
> 
> Description
> -------
> 
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> 
>     it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
>     no scalar aliases
>     no SampleLoader
>     single leaf job
>     DUMP (no STORE)
> 
> The feature is enabled by default and can be toggled with:
> 
>     -N or -no_fetch
>     set opt.fetch true/false;
> 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)
> 
> 
> Diffs
> -----
> 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java 1553596 
>   /trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/FixedWidthLoader.java 1553596 
>   /trunk/src/org/apache/pig/Main.java 1553596 
>   /trunk/src/org/apache/pig/PigServer.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/HExecutionEngine.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchLauncher.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchOptimizer.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchPOStoreImpl.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/fetch/FetchProgressableReporter.java PRE-CREATION 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/expressionOperators/POUserFunc.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POStream.java 1553596 
>   /trunk/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java 1553596 
>   /trunk/src/org/apache/pig/impl/util/PropertiesUtil.java 1553596 
>   /trunk/src/org/apache/pig/newplan/logical/expression/ExpToPhyTranslationVisitor.java 1553596 
>   /trunk/src/org/apache/pig/tools/pigstats/SimpleFetchPigStats.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestAssert.java 1553596 
>   /trunk/test/org/apache/pig/test/TestEvalPipeline2.java 1553596 
>   /trunk/test/org/apache/pig/test/TestFetch.java PRE-CREATION 
>   /trunk/test/org/apache/pig/test/TestPigRunner.java 1553596 
> 
> Diff: https://reviews.apache.org/r/16507/diff/
> 
> 
> Testing
> -------
> 
> - new testcase added:  TestFetch
> - the patch was checked against test-commit and test-core
> - Because opt.fetch is set by default, the testcases were using fetch instead of MR jobs wherever it was possible
> 
> 
> Thanks,
> 
> Lorand Bendig
> 
>