You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "koert kuipers (JIRA)" <ji...@apache.org> on 2015/04/28 22:19:06 UTC
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

    [ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517946#comment-14517946 ] 

koert kuipers edited comment on SPARK-3655 at 4/28/15 8:18 PM:
---------------------------------------------------------------

since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit).
i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices.

so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough.



was (Author: koert):
since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit).
i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices.

so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]:
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] => Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
(note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough.


> Support sorting of values in addition to keys (i.e. secondary sort)
> -------------------------------------------------------------------
>
>                 Key: SPARK-3655
>                 URL: https://issues.apache.org/jira/browse/SPARK-3655
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 1.1.0, 1.2.0
>            Reporter: koert kuipers
>            Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org