You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/02/06 00:03:59 UTC

[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

    [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12670938#action_12670938 ] 

Alan Gates commented on PIG-545:
--------------------------------

I ran the pigmix L9 (order by of single field) and L10 (order by of multiple fields).  L9 went from 14 minutes to 8, so this patch holds huge promise.  But L10 went from 8 minutes to 11, so it doesn't seem to be working well in the multiple field case.  (It could also be related to the fact that L10 uses descending on one of the columns, I don't know if the new partitioner can handle that or not.)  I also ran our end to end order by tests on it, and all passed, except bigdata_1, which fails with an IndexOutOfBounds exception in the new WeightedRangePartitioner class.  

As for the caveat that it needs to know the number of reducers up front, I believe in cases where the user doesn't say parallel, that we can determine the parallelism of the reduces using JobClient.getDefaultReduces().  We need to double check that this will give us the right information in both the hod and non-hod cases.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>
>         Attachments: WRP.patch
>
>
> In running tests on actual data, I've noticed that the final reduce of an order by has skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting a better distribution should lead to much better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.