You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Amir Youssefi (JIRA)" <ji...@apache.org> on 2008/08/19 04:41:44 UTC

[jira] Commented: (PIG-171) Top K

    [ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623540#action_12623540 ] 

Amir Youssefi commented on PIG-171:
-----------------------------------

Daniel, 

 Thanks for the patches. Going back to single reducer over-load issue I mentioned in May 21st comment:

  https://issues.apache.org/jira/browse/PIG-171?focusedCommentId=12598862#action_12598862

 We can now consider having multiple MR steps. I was avoiding it because of logic complexity. Having more MR steps is rather cheap and cluster is already allocated for us.

To address this, instead of sending map results to one reducer,

 Solutions A (constant number of MR jobs):  we can use many reducers to do in-memory merge LIMIT between output of multiple maps. Results will hit the disk and second MR job does the same task but this time sends all to single reducer. This way we have greatly reduced load on a single reducer (it has to merge far less rows).

 Solution B (merge degree / fan-out): We use multiple MR jobs and in each step map/reduce task merge (does in-memory LIMIT) between D (degree)  number of inputs. With each set of tasks we divide number of inputs to be merged by D till it becomes 1. That is when we finish.

 Set of tasks above can be both Map and Reduce side. We can skip map side for simplicity (no multi input format). We use identity mappers and each reducer merges D to 1. 

 We can discuss detecting when we hit 1. One ways is to assume reducers with no output row don't create a part file and number of input files gets reduced in each step and eventually we stop when only one input file exists.

  There is a JIRA for makings sure reducers with not output don't create a file. 

 There may be better ways to do this e.g. any new MR job uses reducer 0 to r and in each step we divide r by D (just like number in PARALLEL command) till we hit 1.

 Thoughts?



 

> Top K
> -----
>
>                 Key: PIG-171
>                 URL: https://issues.apache.org/jira/browse/PIG-171
>             Project: Pig
>          Issue Type: Sub-task
>    Affects Versions: types_branch
>            Reporter: Amir Youssefi
>             Fix For: types_branch
>
>         Attachments: limit1.patch, limit2.patch, limit3.patch
>
>
> Frequently, users are interested on Top results (especially Top K rows) . This can be implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network Bandwidth/Memory usage.
>  
>  Key point is to prune all data on the map side and keep only small set of rows with Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
>   - An Algebraic Function for 'Top K Rows'
>   - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
>   - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions but instead of one value we get multiple ones. 
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.