You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "koert kuipers (JIRA)" <ji...@apache.org> on 2014/12/08 04:25:12 UTC

[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

    [ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14237299#comment-14237299 ] 

koert kuipers edited comment on SPARK-3655 at 12/8/14 3:24 AM:
---------------------------------------------------------------

i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python.

pullreq is here:
https://github.com/apache/spark/pull/3632

i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used.

I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft. Or perhaps groupByKeyAndSortValues could be DeveloperAPI?



was (Author: koert):
i have a new pullreq that implements just groupByKeyAndSortValues in scala and java. i will need some help with python.

pullreq is here:
https://github.com/apache/spark/pull/3632

i changed methods to return RDD[(K, TraversableOnce[V])] instead of RDD[(K, Iterable[V])], since i dont see a reasonable way to implement it so that it returns Iterables without resorting to keeping the data in memory.
The assumption made is that once you move on to the next key within a partition that the previous value (so the TraversableOnce[V]) will no longer be used.

I personally find this API too generic, and too easy to abuse or make mistakes with. So i prefer a more constrained API like foldLeft.


> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>
>                 Key: SPARK-3655
>                 URL: https://issues.apache.org/jira/browse/SPARK-3655
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>            Priority: Minor
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org