You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by "Alan Gates (JIRA)" <ji...@apache.org> on 2009/01/21 01:35:59 UTC

[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

    [ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665650#action_12665650 ] 

Alan Gates commented on PIG-545:
--------------------------------

I ran the pigmix queries L9 (order by a single column) and L10 (order by multiple columns) and found some interesting results.

For L9, the total ordering job (job 3), took 587 seconds.  Min and max times for individual reducers were 92 and 589 seconds (I'm not sure how 1 reducer ran 2 sec longer than total job time, but all these numbers come from the hadoop web ui).  Seven of the 40 reducers (including the 92 second one) received no records to sort.  The long running 589 second job received one key, which had 2M values.

For L10, the total ordering job took 238 seconds.  Min and max times for individual reducers were 99 seconds (3 keys, 32K records) and 232 seconds (413K keys, 496K records).

>From this I draw a couple of conclusions:  

One, our order by partitioner could be better built.  There is no reason a reducer should ever receive 0 records.  And in a job with 3 uncorrelated keys we still see a > 10x disparity in data distribution.  The partitioner needs to do a better job of producing even distributions of the keys to reducers.

Two, just getting better sampling won't resolve the issue for order by queries that have one or a few keys with a very high number of values, such as in a zipf distribution.  Unfortunately for us, zipf is a very common data distribution.  In this case our partitioner may need to be able to detect and split large keys by round robining them to a group of reducers.

> PERFORMANCE: Sampler for order bys does not produce a good distribution
> -----------------------------------------------------------------------
>
>                 Key: PIG-545
>                 URL: https://issues.apache.org/jira/browse/PIG-545
>             Project: Pig
>          Issue Type: Bug
>          Components: impl
>    Affects Versions: types_branch
>            Reporter: Alan Gates
>            Assignee: Amir Youssefi
>             Fix For: types_branch
>
>
> In running tests on actual data, I've noticed that the final reduce of an order by has skewed partitions.  Some reduces finish in a few seconds while some run for 20 minutes.  Getting a better distribution should lead to much better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.