You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/12/09 16:16:11 UTC

[jira] [Resolved] (SPARK-3461) Support external groupByKey using repartitionAndSortWithinPartitions

     [ https://issues.apache.org/jira/browse/SPARK-3461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Reynold Xin resolved SPARK-3461.
--------------------------------
       Resolution: Implemented
         Assignee: Reynold Xin  (was: Sandy Ryza)
    Fix Version/s: 1.6.0

> Support external groupByKey using repartitionAndSortWithinPartitions
> --------------------------------------------------------------------
>
>                 Key: SPARK-3461
>                 URL: https://issues.apache.org/jira/browse/SPARK-3461
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Reynold Xin
>            Priority: Critical
>             Fix For: 1.6.0
>
>
> Given that we have SPARK-2978, it seems like we could support an external group by operator pretty easily. We'd just have to wrap the existing iterator exposed by SPARK-2978 with a lookahead iterator that detects the group boundaries. Also, we'd have to override the cache() operator to cache the parent RDD so that if this object is cached it doesn't wind through the iterator.
> I haven't totally followed all the sort-shuffle internals, but just given the stated semantics of SPARK-2978 it seems like this would be possible.
> It would be really nice to externalize this because many beginner users write jobs in terms of groupByKey.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org