You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Cheolsoo Park (JIRA)" <ji...@apache.org> on 2014/02/02 06:36:09 UTC

[jira] [Updated] (PIG-3642) Direct HDFS access for small jobs (fetch)

     [ https://issues.apache.org/jira/browse/PIG-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheolsoo Park updated PIG-3642:
-------------------------------

    Attachment: PIG-3642-6.patch

+1.

All the e2e tests pass except PIG-3679 and PIG-3739. (Both are unrelated.) 

I also confirmed that the fetch optimization is explicitly disabled in all the effected e2e test cases, so they will run in the same manner as before. In fact, I found one "commented-out" test case that can be effected by the fetch optimization. So I made an one-line change just to be safe as follows-
{code:title=cmdline.conf#L71}
#			'java_params' => ['-Dopt.fetch=false'],
{code}
I will commit PIG-3642-6.patch if I don't hear any objection.

> Direct HDFS access for small jobs (fetch) 
> ------------------------------------------
>
>                 Key: PIG-3642
>                 URL: https://issues.apache.org/jira/browse/PIG-3642
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Lorand Bendig
>            Assignee: Lorand Bendig
>             Fix For: 0.13.0
>
>         Attachments: PIG-3642-3.patch, PIG-3642-4.patch, PIG-3642-5.patch, PIG-3642-6.patch, PIG-3642.patch
>
>
> With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:
> * it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
> * no scalar aliases
> * no SampleLoader
> * single leaf job
> * DUMP (no STORE)
> The feature is enabled by default and can be toggled with:
> * -N or -no_fetch 
> * set opt.fetch true/false; 
> There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's '{{hive.fetch.task.conversion.threshold}}' does. (through Pig's LoadMetadata#getStatistic ?)



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)