You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Jie Li (JIRA)" <ji...@apache.org> on 2012/08/01 01:48:34 UTC

[jira] [Updated] (PIG-483) PERFORMANCE: different strategies for large and small order bys

     [ https://issues.apache.org/jira/browse/PIG-483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jie Li updated PIG-483:
-----------------------

    Attachment: PIG-483.1.patch

Now with PIG-2779 fixed, before submitting the sample job we have the runtime #reducers of orderby/skewjoin job. If they're using only one reducer then we can then skip the sample job. Some changes of the orderby/skewjoin in case of skipping sample:

* do not add to distributed cache the partition file, as there is no such file.
* do not set the specialized Partitioner, i.e. WeightedRangePartitioner for orderby and SkewedPartitioner for skewjoin
* for skew join, do not load partition file in POPartitionRearrange.

We then return the sample job as a SkipJob, whose status is set to successful so JobControl directly puts it in the successful job queue without submitting it. Then the SkipJob is processed just the same as regular jobs.

Any comment on this approach? Will work on unit tests if it looks good.
                
> PERFORMANCE: different strategies for large and small order bys
> ---------------------------------------------------------------
>
>                 Key: PIG-483
>                 URL: https://issues.apache.org/jira/browse/PIG-483
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.2.0
>            Reporter: Olga Natkovich
>              Labels: gsoc2011, performance
>         Attachments: PIG-483.0.patch, PIG-483.1.patch
>
>
> Currently pig always does a multi-pass order by where it first determines a distribution for the keys and then orders in a second pass.  This avoids the necessity of having a single reducer.  However, in cases where the data is small enough to fit into a single reducer, this is inefficient.  For small data sets it would be good to realize the small size of the set and do the order by in a single pass with a single reducer.
> This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira