You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pig.apache.org by "Pi Song (JIRA)" <ji...@apache.org> on 2008/06/08 02:35:45 UTC
[jira] Commented: (PIG-171) Top K
[ https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603350#action_12603350 ]
Pi Song commented on PIG-171:
-----------------------------
Seems like all efficient histogram generation algorithms are probabilistic so my optimization idea wouldn't work.
> Top K
> -----
>
> Key: PIG-171
> URL: https://issues.apache.org/jira/browse/PIG-171
> Project: Pig
> Issue Type: New Feature
> Reporter: Amir Youssefi
> Assignee: Amir Youssefi
>
> Frequently, users are interested on Top results (especially Top K rows) . This can be implemented efficiently in Pig /Map Reduce settings to deliver rapid results and low Network Bandwidth/Memory usage.
>
> Key point is to prune all data on the map side and keep only small set of rows with Top criteria . We can do it in Algebraic function (combiner) with multiple value output. Only a small data-set gets out of mapper node.
> The same idea is applicable to solve variants of this problem:
> - An Algebraic Function for 'Top K Rows'
> - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense Rank K')
> - TOP K ORDER BY.
> Another words implementation is similar to combiners for aggregate functions but instead of one value we get multiple ones.
> I will add a sample implementation for Top K Rows and possibly TOP K ORDER BY to clarify details.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Commented: (PIG-171) Top K
Posted by Ted Dunning <te...@gmail.com>.
An efficient implementation of top K without full histogramming would still
be very, very useful.
On Sat, Jun 7, 2008 at 5:35 PM, Pi Song (JIRA) <ji...@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603350#action_12603350]
>
> Pi Song commented on PIG-171:
> -----------------------------
>
> Seems like all efficient histogram generation algorithms are probabilistic
> so my optimization idea wouldn't work.
>
> > Top K
> > -----
> >
> > Key: PIG-171
> > URL: https://issues.apache.org/jira/browse/PIG-171
> > Project: Pig
> > Issue Type: New Feature
> > Reporter: Amir Youssefi
> > Assignee: Amir Youssefi
> >
> > Frequently, users are interested on Top results (especially Top K rows) .
> This can be implemented efficiently in Pig /Map Reduce settings to deliver
> rapid results and low Network Bandwidth/Memory usage.
> >
> > Key point is to prune all data on the map side and keep only small set
> of rows with Top criteria . We can do it in Algebraic function (combiner)
> with multiple value output. Only a small data-set gets out of mapper node.
> > The same idea is applicable to solve variants of this problem:
> > - An Algebraic Function for 'Top K Rows'
> > - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense
> Rank K')
> > - TOP K ORDER BY.
> > Another words implementation is similar to combiners for aggregate
> functions but instead of one value we get multiple ones.
> > I will add a sample implementation for Top K Rows and possibly TOP K
> ORDER BY to clarify details.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
--
ted